0% found this document useful (0 votes)
16 views232 pages

What Is Apache Spark?

Uploaded by

Ketan Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views232 pages

What Is Apache Spark?

Uploaded by

Ketan Rana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 232

Apache Spark - SparkByExamples

Apache Spark 3.5


What is Apache Spark?
Apache Spark Tutorial – Apache Spark is an Open source analytical processing engine for large-
scale powerful distributed data processing and machine learning applications. Spark was
Originally developed at the University of California, Berkeley’s, and later donated to the Apache
Software Foundation. In February 2014, Spark became a Top-Level Apache Project and has been
contributed by thousands of engineers making Spark one of the most active open-source projects
in Apache.
Apache Spark 3.5 is a framework that is supported in Scala, Python, R Programming, and Java.
Below are different implementations of Spark.
 Spark – Default interface for Scala and Java
 PySpark – Python interface for Spark
 SparklyR – R interface for Spark.
Examples explained in this Spark tutorial are with Scala, and the same is also explained
with PySpark Tutorial (Spark with Python) Examples. Python also supports Pandas which also
contains Data Frame but this is not distributed.
Features of Apache Spark
 In-memory computation
 Distributed processing using parallelize
 Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
 Fault-tolerant
 Immutable
 Lazy evaluation
 Cache & persistence
 Inbuild-optimization when using DataFrames
 Supports ANSI SQL
Advantages of Apache Spark
 Spark is a general-purpose, in-memory, fault-tolerant, distributed processing engine that
allows you to process data efficiently in a distributed fashion.
 Applications running on Spark are 100x faster than traditional systems.
 You will get great benefits from using Spark for data ingestion pipelines.
 Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS, Azure
Blob Storage, and many file systems.
 Spark also is used to process real-time data using Streaming and Kafka.
 Using Spark Streaming you can also stream files from the file system and also stream from
the socket.

1
Apache Spark - SparkByExamples

 Spark natively has machine learning and graph libraries.


 Provides connectors to store the data in NoSQL databases like MongoDB.
What Versions of Java & Scala Spark 3.5 Supports?
Apache Spark 3.5 is compatible with Java versions 8, 11, and 17, Scala versions 2.12 and 2.13,
Python 3.8 and newer, as well as R 3.5 and beyond. However, it’s important to note that support
for Java 8 versions prior to 8u371 has been deprecated starting from Spark 3.5.0.

LANGUAGE SUPPORTED VERSION

Python 3.8

Java Java 8, 11, 13, 17, and the latest versions


Java 8 versions prior to 8u371 have been deprecated

Scala 2.12 and 2.13

R 3.5

Apache Spark Architecture


Spark works in a master-slave architecture where the master is called the “Driver” and slaves are
called “Workers”. When you run a Spark application, Spark Driver creates a context that is an
entry point to your application, and all operations (transformations and actions) are executed on
worker nodes, and the resources are managed by Cluster Manager.

Cluster Manager Types


As of writing this Apache Spark Tutorial, Spark supports below cluster managers:
 Standalone – a simple cluster manager included with Spark that makes it easy to set up a
cluster.
 Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and
Spark applications.
 Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, a cluster
manager.
 Kubernetes – an open-source system for automating deployment, scaling, and
management of containerized applications.

2
Apache Spark - SparkByExamples

local – which is not really a cluster manager but still I wanted to mention that we use “local”
for master() in order to run Spark on our laptop/computer.
Spark Modules
 Spark Core
 Spark SQL
 Spark Streaming
 Spark MLlib
 Spark GraphX

Spark Core
In this section of the Apache Spark Tutorial, you will learn different concepts of the Spark Core
library with examples in Scala code. Spark Core is the main base library of Spark which provides
the abstraction of how distributed task dispatching, scheduling, basic I/O functionalities etc.
Before getting your hands dirty on Spark programming, have your Development Environment
Setup to run Spark Examples using IntelliJ IDEA
SparkSession
SparkSession introduced in version 2.0, is an entry point to underlying Spark functionality in order
to programmatically use Spark RDD, DataFrame, and Dataset. It’s object spark is default available
in spark-shell.
Creating a SparkSession instance would be the first statement you would write to the program
with RDD, DataFrame and Dataset. SparkSession will be created
using SparkSession.builder() builder pattern.
// Create SparkSession
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
Spark Context

3
Apache Spark - SparkByExamples

SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is used to be an entry
point to Spark and PySpark before introducing SparkSession in 2.0. Creating SparkContext was
the first step to the program with RDD and to connect to Spark Cluster. It’s object sc by default
available in spark-shell.
Since Spark 2.x version, When you create SparkSession, SparkContext object is by default
created and it can be accessed using spark.sparkContext
Note that you can create just one SparkContext per JVM but can create many SparkSession
objects.

RDD Spark Tutorial


RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and it is the primary
data abstraction in Apache Spark and the Spark Core. RDDs are fault-tolerant, immutable
distributed collections of objects, which means once you create an RDD you cannot change it.
Each dataset in RDD is divided into logical partitions, which can be computed on different nodes of
the cluster.
This Apache Spark RDD Tutorial will help you start understanding and using Apache Spark RDD
(Resilient Distributed Dataset) with Scala code examples. All RDD examples provided in this
tutorial were also tested in our development environment and are available at GitHub spark scala
examples project for quick reference.
In this section of the Apache Spark tutorial, I will introduce the RDD and explain how to create
them and use their transformation and action operations. Here is the full article on Spark RDD in
case you want to learn more about it and get your fundamentals strong.

RDD creation
RDDs are created primarily in two different ways, first parallelizing an existing collection and
secondly referencing a dataset in an external storage system (HDFS, HDFS, S3 and many more).

sparkContext.parallelize()
sparkContext.parallelize is used to parallelize an existing collection in your driver program. This is
a basic method to create RDD.

//Create RDD from parallelize


 val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))
 val rdd=spark.sparkContext.parallelize(dataSeq)

sparkContext.textFile()
Using textFile() method we can read a text (.txt) file from many sources like HDFS, S#, Azure,
local e.t.c into RDD.

4
Apache Spark - SparkByExamples

//Create RDD from external Data source


 val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

RDD Operations
On Spark RDD, you can perform two kinds of operations.

RDD Transformations
Spark RDD Transformations are lazy operations meaning they don’t execute until you call an
action on RDD. Since RDDs are immutable, When you run a transformation(for example map()),
instead of updating a current RDD, it returns a new RDD.
Some transformations on RDDs are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all
these return a new RDD instead of updating the current.

RDD Actions
RDD Action operation returns the values from an RDD to a driver node. In other words, any RDD
function that returns non RDD[T] is considered as an action. RDD operations trigger the
computation and return RDD in a List to the driver program.
Some actions on RDDs are count(), collect(), first(), max(), reduce() and more.
RDD Examples
 Read CSV file into RDD
 RDD Pair Functions
 Generate DataFrame from RDD

DataFrame Spark Tutorial with Basic Examples


DataFrame definition is very well explained by Databricks hence I do not want to define it again
and confuse you. Below is the definition I took from Databricks.
DataFrame is a distributed collection of data organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer
optimizations under the hood. DataFrames can be constructed from a wide array of sources such
as structured data files, tables in Hive, external databases, or existing RDDs.
DataFrame creation
The simplest way to create a Spark DataFrame is from a seq collection. Spark DataFrame can
also be created from an RDD and by reading files from several sources.
using createDataFrame()
By using createDataFrame() function of the SparkSession you can create a DataFrame.
// Create DataFrame
5
Apache Spark - SparkByExamples

val data = Seq(('James','','Smith','1991-04-01','M',3000),


('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
)
val columns = Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema = columns).toDF(columns:_*)
Since DataFrames are structure format that contains names and column, we can get the schema
of the DataFrame using the df.printSchema()
df.show() shows the 20 elements from the DataFrame.
+---------+----------+--------+----------+------+------+

|firstname|middlename|lastname|dob |gender|salary|

+---------+----------+--------+----------+------+------+

|James | |Smith |1991-04-01|M |3000 |

|Michael |Rose | |2000-05-19|M |4000 |

|Robert | |Williams|1978-09-05|M |4000 |

|Maria |Anne |Jones |1967-12-01|F |4000 |

|Jen |Mary |Brown |1980-02-17|F |-1 |

+---------+----------+--------+----------+------+------+

In this Apache Spark SQL DataFrame Tutorial, I have explained several mostly used
operation/functions on DataFrame & DataSet with working Scala examples.
______________________________________________________________________________

What is SparkSession | Entry Point to Spark


SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts
as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and
Datasets, providing a unified interface to work with structured data processing. It is one of the very
first objects you create while developing a Spark SQL application. As a Spark developer, you
create a SparkSession using the SparkSession.builder() method
SparkSession consolidates several previously separate contexts, such as SQLContext,
HiveContext, and StreamingContext, into one entry point, simplifying the interaction with Spark
and its different APIs. It enables users to perform various operations like reading data from various
sources, executing SQL queries, creating DataFrames and Datasets, and performing actions on
distributed datasets efficiently.
For those engaging with Spark through the spark-shell CLI, the ‘spark’ variable automatically
provides a default Spark Session, eliminating the need for manual creation within this context.

6
Apache Spark - SparkByExamples

In this article, I’ll delve into the essence of SparkSession, how to create SparkSession object, and
explore its frequently utilized methods.

What is SparkSession
SparkSession was introduced in version Spark 2.0, it is an entry point to underlying Spark
functionality in order to programmatically create Spark RDD, DataFrame, and DataSet.
SparkSession’s object spark is the default variable available in spark-shell and it can be created
programmatically using SparkSession builder pattern.
If you are looking for a PySpark explanation, please refer to how to create SparkSession in
PySpark.
1. SparkSession Introduction
As mentioned in the beginning, SparkSession is an entry point to Spark, and creating a
SparkSession instance would be the first statement you would write to program
with RDD, DataFrame, and Dataset. SparkSession will be created
using SparkSession.builder() builder pattern.
Before Spark 2.0, SparkContext used to be an entry point, and it’s not been completely replaced
with SparkSession. Many features of SparkContext are still available and used in Spark 2.0 and
later. You should also know that SparkSession internally
creates SparkConfig and SparkContext with the configuration provided with SparkSession.
With Spark 2.0, a new class org.apache.spark.sql.SparkSession has been introduced, which is a
combined class for all the different contexts we used to have before 2.0 (SQLContext,
HiveContext, etc); hence, Spark Session can be used in the place of SQLContext, HiveContext,
and other contexts.
Spark Session also includes all the APIs available in different contexts –
 SparkContext
 SQLContext
 StreamingContext
 HiveContext
How many SparkSessions can you create in an application?
You can create as many SparkSession as you want in a Spark application using
either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are
required when you want to keep Spark tables (relational entities) logically separated.

2. SparkSession in spark-shell
By default, Spark shell provides spark object, which is an instance of the SparkSession class. We
can directly use this object when required in spark-shell.
// Usage of spark variable
scala> spark.version

7
Apache Spark - SparkByExamples

Like the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment
creates a default SparkSession object for us to use, so you don’t have to worry about creating a
Spark session.
3. How to Create SparkSession
Creating a SparkSession is fundamental as it initializes the environment required to leverage the
capabilities of Apache Spark.
To create SparkSession in Scala or Python, you need to use the builder pattern
method builder() and calling getOrCreate() method. It returns a SparkSession that already exists;
otherwise, it creates a new SparkSession. The example below creates a SparkSession in Scala.

// Create SparkSession object


import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App {
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate();
println(spark)
println("Spark Version : "+spark.version)
}
// Outputs
// org.apache.spark.sql.SparkSession@2fdf17dc
// Spark Version : 3.4.1

From the above code –


SparkSession.builder() – Return SparkSession.Builder class. This is a builder for SparkSession.
master(), appName(), and getOrCreate() are methods of SparkSession.Builder.
master() – This allows Spark applications to connect and run in different modes (local, standalone
cluster, Mesos, YARN), depending on the configuration.
 Use local[x] when running on your local laptop. x should be an integer value and should be
greater than 0; this represents how many partitions it should create when using RDD,
DataFrame, and Dataset. Ideally, x value should be the number of CPU cores you have.
 For standalone use spark://master:7077
appName() – Sets a name to the Spark application that shows in the Spark web UI. If no
application name is set, it sets a random name.
getOrCreate() – This returns a SparkSession object if it already exists. Creates a new one if it
does not exist.
8
Apache Spark - SparkByExamples

3.1 Get Existing SparkSession


You can get the existing SparkSession in Scala programmatically using the example below. To get
the existing SparkSession, you don’t have to specify the app name, master e.t.c
// Get existing SparkSession
 import org.apache.spark.sql.SparkSession
 val spark2 = SparkSession.builder().getOrCreate()
 print(spark2)
// Output:
// org.apache.spark.sql.SparkSession@2fdf17dc
Compare the hash of spark and spark2 object. Since it returned the existing session, both objects
have the same hash value.
3.2 Create Another SparkSession
Sometimes, you might be required to create multiple sessions, which you can easily achieve by
using newSession() method. This uses the same app name and master as the existing session.
Underlying SparkContext will be the same for both sessions, as you can have only one context per
Spark application.
// Create a new SparkSession
 val spark3 = spark.newSession()
 print(spark3)
// Output:
// org.apache.spark.sql.SparkSession@692dba54
Compare this hash with the hash from the above example; it should be different.
3.3 Setting Spark Configs
If you want to set some configs to SparkSession, use the config() method.
// Usage of config()
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.some.config.option", "config-value")
.getOrCreate();

3.4 Create SparkSession with Hive Enable

9
Apache Spark - SparkByExamples

To use Hive with Spark, you need to enable it using the enableHiveSupport() method.
SparkSession from Spark2.0 provides inbuilt support for Hive operations like writing queries on
Hive tables using HQL, accessing to Hive UDFs, and reading data from Hive tables.
// Enabling Hive to use in Spark
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.sql.warehouse.dir", "<path>/spark-warehouse")
.enableHiveSupport()
.getOrCreate();

4. Other Usages of SparkSession


4.1 Set & Get All Spark Configs
Once the SparkSession is created, you can add the spark configs during runtime or get all configs.
// Set Config
 spark.conf.set("spark.sql.shuffle.partitions", "30")
// Get all Spark Configs
 val configMap:Map[String, String] = spark.conf.getAll

4.2 Create DataFrame


SparkSession also provides several methods to create a Spark DataFrame and Dataset. The
below example uses the createDataFrame() method which takes a list of data.
// Create DataFrame
val df = spark.createDataFrame(
List(("Scala", 25000), ("Spark", 35000), ("PHP", 21000)))
df.show()
// Output:

// +-----+-----+

// | _1| _2|

// +-----+-----+

// |Scala|25000|

// |Spark|35000|

// | PHP|21000|

// +-----+-----+

4.3 Working with Spark SQL


10
Apache Spark - SparkByExamples

Using SparkSession you can access Spark SQL capabilities in Apache Spark. In order to use SQL
features first, you need to create a temporary view in Spark. Once you have a temporary view you
can run any ANSI SQL queries using spark.sql() method.
// Spark SQL
 df.createOrReplaceTempView("sample_table")
 val df2 = spark.sql("SELECT _1,_2 FROM sample_table")
 df2.show()
Spark SQL temporary views are session-scoped and will not be available if the session that
creates it terminates. If you want to have a temporary view that is shared among all sessions and
kept alive until the Spark application terminates, you can create a global temporary view
using createGlobalTempView().

4.4 Create Hive Table


As explained above, SparkSession can also be used to create Hive tables and query them. Note
that in order to do this for testing you don’t need Hive to be installed. saveAsTable() creates Hive
managed table. Query the table using spark.sql().
// Create Hive table & query it.
 spark.table("sample_table").write.saveAsTable("sample_hive_table")
 val df3 = spark.sql("SELECT _1,_2 FROM sample_hive_table")
 df3.show()

4.5 Working with Catalogs


To get the catalog metadata, Spark Session exposes catalog variable. Note that these
methods spark.catalog.listDatabases and spark.catalog.listTables returns the Dataset.
// Get metadata from the Catalog
// List databases
 val ds = spark.catalog.listDatabases
 ds.show(false)
// Output:

// +-------+----------------+----------------------------+

// |name |description |locationUri |

// +-------+----------------+----------------------------+

// |default|default database|file:/<path>/spark-warehouse|

// +-------+----------------+----------------------------+

// List Tables
 val ds2 = spark.catalog.listTables
 ds2.show(false)
// Output:

// +-----------------+--------+-----------+---------+-----------+

11
Apache Spark - SparkByExamples
// |name |database|description|tableType|isTemporary|

// +-----------------+--------+-----------+---------+-----------+

// |sample_hive_table|default |null |MANAGED |false |

// |sample_table |null |null |TEMPORARY|true |

Notice the two tables we have created so far, The sample_table which was created from
Spark.createOrReplaceTempView is considered a temporary table and Hive table as managed
table.

5. SparkSession Commonly Used Methods

Method Description

version Returns Spark version where your application is running, probably the
Spark version your cluster is configured with.

catalog Returns the catalog object to access metadata.

conf Returns the RuntimeConfig object.

builder() builder() is used to create a new SparkSession, this


return SparkSession.Builder

newSession() Creaetes a new SparkSession.

range(n) Returns a single column Dataset with LongType and column named id,
containing elements in a range from 0 to n (exclusive) with step value 1.
There are several variations of this function, for details, refer to Spark
documentation.

createDataFrame() This creates a DataFrame from a collection and an RDD

createDataset() This creates a Dataset from the collection, DataFrame, and RDD.

emptyDataset() Creates an empty Dataset.

getActiveSession() Returns an active Spark session for the current thread.

getDefaultSession() Returns the default SparkSession that is returned by the builder.

implicits() To access the nested Scala object.

12
Apache Spark - SparkByExamples

Method Description

read() Returns an instance of DataFrameReader class, this is used to read


records from CSV, Parquet, Avro, and more file formats into DataFrame.

readStream() Returns an instance of DataStreamReader class, this is used to read


streaming data. that can be used to read streaming data into DataFrame.

sparkContext() Returns a SparkContext.

sql(String sql) Returns a DataFrame after executing the SQL mentioned.

sqlContext() Returns SQLContext.

stop() Stop the current SparkContext.

table() Returns a DataFrame of a table or view.

udf() Creates a Spark UDF to use it on DataFrame, Dataset, and SQL.

6. FAQ’s on SparkSession

 How to create SparkSession?


SparkSession is created using SparkSession.builder().master("master-details").appName("app-
name").getOrCreate(); Here, getOrCreate() method returns SparkSession if already exists. If not, it
creates a new SparkSession.
 How many SparkSessions can I create?
You can create as many SparkSession as you want in a Spark application using
either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are
required when you want to keep Spark tables (relational entities) logically separated.
 How to stop SparkSession?
To stop SparkSession in Apache Spark, you can use the stop() method of
the SparkSession object. If you have spark as a SparkSession object then call spark.stop() to stop
the session. Calling a stop() is important to do when you’re finished with your Spark application.
This ensures that resources are properly released and the Spark application terminates gracefully.

13
Apache Spark - SparkByExamples

 How SparkSession is different from SparkContext?


SparkSession and SparkContext are two core components of Apache Spark. Though they sound
similar, they serve different purposes and are used in different contexts within a Spark application.
SparkContext provides the connection to a Spark cluster and is responsible for coordinating and
distributing the operations on that cluster. SparkContext is used for low-level RDD (Resilient
Distributed Dataset) programming.
SparkSession was introduced in Spark 2.0 to provide a more convenient and unified API for
working with structured data. It’s designed to work with DataFrames and Datasets, which provide
more structured and optimized operations than RDDs.
 Do we need to stop SparkSession?
It is recommended to end the Spark session after finishing the Spark job in order for the JVMs to
close and free the resources.
 How do I know if my Spark session is active?
To check if your SparkSession is active, you can use
the SparkSession object’s sparkContext attribute and check its isActive property. If you
have spark as a SparkSession object then call spark.sparkContext.isActive. This returns true if it is
active otherwise false.

7. Conclusion
In this Spark SparkSession article, you have learned what is Spark Session, its usage, how to
create SparkSession programmatically, and learned some of the commonly used SparkSession
methods. In summary
 SparkSession was introduced in Spark 2.0 which is a unified API for working with
structured data.
 It combines SparkContext, SQLContext, and HiveContext. It’s designed to work with
DataFrames and Datasets, which provide more structured and optimized operations than
RDDs.
 SparkSession natively supports SQL queries, structured streaming, and DataFrame-based
machine learning APIs.
 spark-shell, Databricks, and other tools provide spark variable as the default SparkSession
object.

14
Apache Spark - SparkByExamples

What is SparkContext?
SparkContext has been available since Spark 1.x (JavaSparkContext for Java) and it used to be
an entry point to Spark and PySpark before introducing SparkSession in 2.0. Creating
SparkContext is the first step to using RDD and connecting to Spark Cluster, In this article, you will
learn how to create it using examples.
What is SparkContext
Since Spark 1.x, SparkContext is an entry point to Spark and is defined
in org.apache.spark package. It is used to programmatically create Spark RDD, accumulators,
and broadcast variables on the cluster. Its object sc is default variable available in spark-shell and
it can be programmatically created using SparkContext class.
Note that you can create only one active SparkContext per JVM. You should stop() the
active SparkContext before creating a new one.

Source:
The Spark driver program creates and uses SparkContext to connect to the cluster manager to
submit Spark jobs, and know what resource manager (YARN, Mesos or Standalone) to
communicate to. It is the heart of the Spark application.
Related: How to get current SparkContext & its configurations in Spark

1. SparkContext in spark-shell
By default, Spark shell provides sc object which is an instance of the SparkContext class. We can
directly use this object where required.
// 'sc' is a SparkContext variable in spark-shell
scala>>sc.appName
Yields below output.

15
Apache Spark - SparkByExamples

Similar to the Spark shell, In most of the tools, notebooks, and Azure Databricks, the environment
itself creates a default SparkContext object for us to use so you don’t have to worry about creating
a spark context.
2. Spark 2.X – Create SparkContext using Scala Program
Since Spark 2.0, we mostly use SparkSession as most of the methods available in SparkContext
are also present in SparkSession. Spark session internally creates the Spark Context and exposes
the sparkContext variable to use.
At any given time only one SparkContext instance should be active per JVM. In case you want to
create another you should stop existing SparkContext (using stop()) before creating a new one.
// Imports
import org.apache.spark.sql.SparkSession
object SparkSessionTest extends App {

// Create SparkSession object


val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate();

// Access spark context


println(spark.sparkContext)
println("Spark App Name : "+spark.sparkContext.appName)
}
// Output:
//org.apache.spark.SparkContext@2fdf17dc
//Spark App Name : SparkByExamples.com
As I explained in the SparkSession article, you can create any number of SparkSession objects
however, for all those objects underlying there will be only one SparkContext.

3. Create RDD
Once you create a Spark Context object, use the below to create Spark RDD.
// Create RDD
val rdd = spark.sparkContext.range(1, 5)
rdd.collect().foreach(print)
// Create RDD from Text file
16
Apache Spark - SparkByExamples

val rdd2 = spark.sparkContext.textFile("src/main/resources/text/alice.txt")

4. Stop SparkContext
You can stop the SparkContext by calling the stop() method. As explained above you can have
only one SparkContext per JVM. If you want to create another, you need to shut down it first by
using stop() method and create a new SparkContext.
// SparkContext stop() method
spark.sparkContext.stop()
When Spark executes this statement, it logs the message INFO SparkContext: Successfully
stopped SparkContext to the console or to a log file.

5. Spark 1.X – Creating SparkContext using Scala Program


In Spark 1.x, first, you need to create a SparkConf instance by assigning the app name and setting
the master by using the SparkConf static methods setAppName() and setMaster() respectively
and then pass the SparkConf object as an argument to the SparkContext constructor to create
Spark Context.
// Create SpakContext
import org.apache.spark.{SparkConf, SparkContext}
// Create SparkConf object
val sparkConf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")
// Create Spark context (deprecated)
val sparkContext = new SparkContext(sparkConf)
SparkContext constructor has been deprecated in 2.0 hence, the recommendation is to use a
static method getOrCreate() that internally creates SparkContext. This function instantiates a
SparkContext and registers it as a singleton object.
// Create Spark Context
val sc = SparkContext.getOrCreate(sparkConf)

6. SparkContext Commonly Used Methods


The following are the most commonly used methods of SparkContext. For the complete list, refer
to Spark documentation.
longAccumulator() – It creates an accumulator variable of a long data type. Only a driver can
access accumulator variables.
doubleAccumulator() – It creates an accumulator variable of a double data type. Only a driver
can access accumulator variables.
applicationId – Returns a unique ID of a Spark application.
appName – Return an app name that was given when creating SparkContext
17
Apache Spark - SparkByExamples

broadcast – read-only variable broadcast to the entire cluster. You can broadcast a variable to a
Spark cluster only once.
emptyRDD – Creates an empty RDD
getPersistentRDDs – Returns all persisted RDDs
getOrCreate() – Creates or returns a SparkContext
hadoopFile – Returns an RDD of a Hadoop file
master()– Returns master that set while creating SparkContext
newAPIHadoopFile – Creates an RDD for a Hadoop file with a new API InputFormat.
sequenceFile – Get an RDD for a Hadoop SequenceFile with given key and value types.
setLogLevel – Change log level to debug, info, warn, fatal, and error
textFile – Reads a text file from HDFS, local or any Hadoop supported file systems, and returns
an RDD
union – Union two RDDs
wholeTextFiles – Reads a text file in the folder from HDFS, local or any Hadoop supported file
systems and returns an RDD of Tuple2. The first element of the tuple consists file name and the
second element consists context of the text file.

7. SparkContext Example

// Complete example of SparkContext


import org.apache.spark.{SparkConf, SparkContext}

object SparkContextExample extends App{

val conf = new SparkConf().setAppName("sparkbyexamples.com").setMaster("local[1]")


val sparkContext = new SparkContext(conf)
val rdd = sparkContext.textFile("src/main/resources/text/alice.txt")

sparkContext.setLogLevel("ERROR")

println("First SparkContext:")
println("APP Name :"+sparkContext.appName)
println("Deploy Mode :"+sparkContext.deployMode)
println("Master :"+sparkContext.master)

18
Apache Spark - SparkByExamples

// sparkContext.stop()
val conf2 = new SparkConf().setAppName("sparkbyexamples.com-2").setMaster("local[1]")
val sparkContext2 = new SparkContext(conf2)

println("Second SparkContext:")
println("APP Name :"+sparkContext2.appName)
println("Deploy Mode :"+sparkContext2.deployMode)
println("Master :"+sparkContext2.master)
}

FAQ’s on SparkContext
 What does SparkContext do?
SparkContext is entry point to spark application since spark 1.x. The SparkContext is the central
entry point and controller for Spark applications. It manages resources, coordinates tasks, and
provides the necessary infrastructure for distributed data processing in Spark. It plays a vital role
in ensuring the efficient and fault-tolerant execution of Spark jobs.
 How to create SparkContext?
SparkContext is created using SparkContext class. By default, A spark “driver” is an application
that creates the SparkContext in order to execute the job or jobs of a cluster. You can access the
spark context from spark spark session object spark.sparkContext. If you wanted to create spark
context by yourself, use the below snippet.

// Create SpakContext
import org.apache.spark.{SparkConf, SparkContext}
val sparkConf = new SparkConf().setAppName(“sparkbyexamples.com”).setMaster(“local[1]”)
val sparkContext = new SparkContext(sparkConf)
 How to stop SparkContext?
Once you have finished using Spark, you can stop the SparkContext using the stop() method. This
will release all resources associated with the SparkContext and shut down the Spark application
gracefully.
 Can I have multiple SparkContext in Spark job?
There can only be one active SparkContext per JVM. Having multiple SparkContext instances in a
single application can cause issues like resource conflicts, configuration conflicts, and unexpected
behavior.
 How to access SparkContex variable?
By default, A spark “driver” is an application that creates the SparkContext in order to execute the
job or jobs of a cluster. You can access the spark context from spark spark session
object spark.sparkContext.

19
Apache Spark - SparkByExamples

8. Conclusion
In this Spark Context article, you have learned what is SparkContext, how to create in Spark 1.x
and Spark 2.0, and using with few basic examples. In summary,
 SparkContext is the entry point to any Spark functionality. It represents the connection to a
Spark cluster and is responsible for coordinating and distributing the operations on that
cluster.
 It was the primary entry point for Spark applications before Spark 2.0.
 SparkContext is used for low-level RDD (Resilient Distributed Dataset) operations, which
were the core data abstraction in Spark before DataFrames and Datasets were introduced.
 It is not thread-safe, so in a multi-threaded or multi-user environment, you need to be
careful when using a single SparkContext instance.
______________________________________________________________________________

Create a Spark RDD using Parallelize


Let’s see how to create Spark RDD using sparkContext.parallelize() method and using Spark shell
and Scala example.
Before we start let me explain what is RDD, Resilient Distributed Datasets (RDD) is a fundamental
data structure of Spark, It is an immutable distributed collection of objects. Each dataset in RDD is
divided into logical partitions, which may be computed on different nodes of the cluster.
Note: Spark Parallelizes an existing collection in your driver program.
Below is an example of how to create an RDD using a parallelize method from Sparkcontext. For
example, sparkContext.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) creates an RDD with an Array of
Integers.

Using sc.parallelize on Spark Shell or REPL


Spark shell provides SparkContext variable “sc”, use sc.parallelize() to create an RDD.
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24

Using Spark sparkContext.parallelize in Scala


If you are using scala, get SparkContext object from SparkSession and
use sparkContext.parallelize() to create rdd. This function also has another signature which
additionally takes an integer argument to specify the number of partitions. Partitions are basic
units of parallelism in Apache Spark. RDDs in Apache Spark are a collection of partitions that are
20
Apache Spark - SparkByExamples

executed by processors to achieve parallelism. The parallelize method takes a collection of


objects, such as a list, tuple, or set, and creates an RDD from them. The RDD is then distributed
across the nodes in the Spark cluster for parallel processing.
// Imports
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDParallelize {

def main(args: Array[String]): Unit = {


val spark:SparkSession = SparkSession.builder().master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val rdd:RDD[Int] = spark.sparkContext.parallelize(List(1,2,3,4,5))
val rddCollect:Array[Int] = rdd.collect()
println("Number of Partitions: "+rdd.getNumPartitions)
println("Action: First element: "+rdd.first())
println("Action: RDD converted to Array[Int] : ")
rddCollect.foreach(println)
}
}
By executing the above program you should see the below output.
Number of Partitions: 1

Action: First element: 1

Action: RDD converted to Array[Int] :

Create empty RDD by using sparkContext.parallelize


We can create emptyRdd in two ways , one is using sparkContext.parallelize() and the other is
using sparkContext.emptyRDD() method. The difference is being creating between these two
ways is the former one creates no part files on disk , whereas the latter one creates the part files
on Disk.
Using sparkContext.parallelize

21
Apache Spark - SparkByExamples

// Create emptyRdd using parallelize


sparkContext.parallelize(Seq.empty[String])
Using sparkContext.emptyRDD
// Create emptyRdd using emptyRDD
val emptyRDD = sparkContext.emptyRDD[String]
emptyRDD.saveAsTextFile("src/main/output2/EmptyRDD2")
When above code with emptyRDD is executed it creates multiple part files which are empty.
______________________________________________________________________________

Spark – Read multiple text files into single RDD?


Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to
read single and multiple text or csv files into a single Spark RDD. Using this method we can also
read all files from a directory and files with a specific pattern.
textFile() – Read single or multiple text, csv files and returns a single Spark RDD [String]
wholeTextFiles() – Reads single or multiple files and returns a single RDD[Tuple2[String, String]],
where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
In this article let’s see some examples with both of these methods using Scala and PySpark
languages.
 Read all text files from a directory into a single RDD
 Read multiple text files into a single RDD
 Read all text files matching a pattern to single RDD
 Read files from multiple directories into single RDD
 Reading text files from nested directories into Single RDD
 Reading all text files separately and union to create a Single RDD
 Reading CSV files
Before we start, let’s assume we have the following file names and file contents at folder
“c:/tmp/files” and I use these files to demonstrate the examples.

File Name File Contents

text01.txt One,1

text02.txt Two,2

22
Apache Spark - SparkByExamples

File Name File Contents

text03.txt Three,3

text04.txt Four,4

invalid.txt Invalid,I

1. Spark Read all text files from a directory into a single RDD
In Spark, by inputting path of the directory to the textFile() method reads all text files and creates a
single RDD. Make sure you do not have a nested directory If it finds one Spark process fails with
an error.
// Spark Read all text files from a directory into a single RDD
val rdd = spark.sparkContext.textFile("C:/tmp/files/*")
rdd.foreach(f=>{
println(f)
})
This example reads all files from a directory, creates a single RDD and prints the contents of the
RDD.
// Output:

Invalid,I

One,1

Two,2

Three,3

Four,4

If you are running on a cluster you should first collect the data in order to print on a console as
shown below.
// Collect the data
rdd.collect.foreach(f=>{
println(f)
})
Let’s see a similar example with wholeTextFiles() method. note that this returns an RDD[Tuple2].
where first value (_1) in a tuple is a file name and second value (_2) is content of the file.
// Using wholeTextFiles() to load the data
val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
rddWhole.foreach(f=>{
println(f._1+"=>"+f._2)})
// Output:
23
Apache Spark - SparkByExamples
file:/C:/tmp/files/invalid.txt=>Invalid,I

file:/C:/tmp/files/text01.txt=>One,1

file:/C:/tmp/files/text02.txt=>Two,2

file:/C:/tmp/files/text03.txt=>Three,3

file:/C:/tmp/files/text04.txt=>Four,4

2. Spark Read multiple text files into a single RDD


When you know the names of the multiple files you would like to read, just input all file names with
comma separator in order to create a single RDD.
// Read multiple textfiles (with comma separated) into Single RDD
val rdd3 = spark.sparkContext.textFile("C:/tmp/files/text01.txt,C:/tmp/files/text02.txt")
rdd3.foreach(f=>{println(f)})
This read file text01.txt & text02.txt files and outputs below content.
// Output:

One,1

Two,2

3. Read all text files matching a pattern to single RDD


textFile() method also accepts pattern matching and wild characters. For example below snippet
read all files start with text and with the extension “.txt” and creates single RDD.
// Read Textfiles matching a pattern
 val rdd2 = spark.sparkContext.textFile("C:/tmp/files/text*.txt")
 rdd2.foreach(f=>{println(f)})
// Output:

One,1

Two,2

Three,3

Four,4

4. Read files from multiple directories into single RDD


It also supports reading files and multiple directories combination.
// Read files from multiple directories into Single RDD
 val rdd2 = spark.sparkContext.textFile("C:/tmp/dir1/*,C:/tmp/dir2/*,c:/tmp/files/text01.txt")
 rdd2.foreach(f=>{println(f)})
// Output:
One,1

Two,2

Invalid,I

One,1

Two,2

24
Apache Spark - SparkByExamples
Three,3 Four,4

5. Reading text files from nested directories into Single RDD


textFile() and wholeTextFile() returns an error when it finds a nested folder hence, first using scala,
Java, Python languages create a file path list by traversing all nested folders and pass all file
names with comma separator in order to create a single RDD.
6. Reading all text files separately and union to create a Single RDD
You can also read all text files into a separate RDD’s and union all these to create a single RDD.
// Union multiple textfi;es into Single RDD
val rdd1 = spark.sparkContext.textFile("C:/tmp/files/txt01.txt")
val rdd2 = spark.sparkContext.textFile("C:/tmp/files/txt02.txt")
val rdd3 = Seq(rdd1,rdd2)
val finalRdd = spark.sparkContext.union(rdd3)
finalRdd.foreach(x=>{println(x)})
We will see the below output after executing the above snippet.
// Output:

One,1

Two,2

7. Reading multiple CSV files into RDD


Spark RDD’s doesn’t have a method to read csv file formats hence we will use textFile() method to
read csv file like any other text file into RDD and split the record based on comma, pipe or any
other delimiter.
// Read multiple CSV files into RDD
val rdd5 = spark.sparkContext.textFile("C:/tmp/files/*")
val rdd6 = rdd5.map(f=>{
f.split(",")
})
rdd6.foreach(f => {
println("Col1:"+f(0)+",Col2:"+f(1))
})
Here, we read all csv files in a directory into RDD, we apply map transformation to split the record
on comma delimiter and a map returns another RDD “rdd6” after transformation. finally, we iterate
rdd6, reads the column based on an index.
Note: You can’t update RDD as they are immutable. this example yields the below output.
// Output:

Col1:Invalid,Col2:I

Col1:One,Col2:1

25
Apache Spark - SparkByExamples
Col1:Two,Col2:2 Col1:Three,Col2:3 Col1:Four,Col2:4

8. Complete code
package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
object ReadMultipleFiles extends App {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

println("read all text files from a directory to single RDD")


val rdd = spark.sparkContext.textFile("C:/tmp/files/*")
rdd.foreach(f=>{
println(f)
})

println("read text files base on wildcard character")


val rdd2 = spark.sparkContext.textFile("C:/tmp/files/text*.txt")
rdd2.foreach(f=>{
println(f)
})

println("read multiple text files into a RDD")


val rdd3 = spark.sparkContext.textFile("C:/tmp/files/text01.txt,C:/tmp/files/text02.txt")
rdd3.foreach(f=>{
println(f)
})

println("Read files and directory together")


val rdd4 = spark.sparkContext.textFile("C:/tmp/files/text01.txt,C:/tmp/files/text02.txt,C:/tmp/files/
*")

26
Apache Spark - SparkByExamples

rdd4.foreach(f=>{
println(f)
})

val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")


rddWhole.foreach(f=>{
println(f._1+"=>"+f._2)
})

val rdd5 = spark.sparkContext.textFile("C:/tmp/files/*")


val rdd6 = rdd5.map(f=>{
f.split(",")
})
rdd6.foreach(f => {
println("Col1:"+f(0)+",Col2:"+f(1)) }) }
_____________________________________________________________________________

Spark Load CSV File into RDD


In this tutorial, I will explain how to load a CSV file into Spark RDD using a Scala example. Using
the textFile() the method in SparkContext class we can read CSV files, multiple CSV files (based
on pattern matching), or all files from a directory into RDD [String] object.
Before we start, let’s assume we have the following CSV file names with comma delimited file
contents at folder “c:/tmp/files” and I use these files to demonstrate the examples.

File Name File Contents

text01.csv Col1,Col2
One,1
Eleven,11

text02.csv Col1,Col2
Two,2
Twenty One,21

text03.csv Col1,Col2
Three,3

27
Apache Spark - SparkByExamples

File Name File Contents

text04.csv Col1,Col2
Four,4

invalid.csv Col1,Col2
Invalid,I

 Read CSV file into RDD


 Skip header from CSV file
 Read multiple CSV files into RDD
 Read all CSV files in a directory into RDD
Load CSV file into RDD
textFile() method read an entire CSV record as a String and returns RDD[String], hence, we need
to write additional code in Spark to transform RDD[String] to RDD[Array[String]] by splitting the
string record with a delimiter.
The below example reads a file into “rddFromFile” RDD object, and each element in RDD
represents as a String.
// Read from csv file into RDD
val rddFromFile = spark.sparkContext.textFile("C:/tmp/files/text01.csv")
But, we would need every record in a CSV to split by comma delimiter and store it in RDD as
multiple columns, In order to achieve this, we should use map() transformation on RDD where we
will convert RDD[String] to RDD[Array[String] by splitting every record by comma delimiter. map()
method returns a new RDD instead of updating existing.
// Applying map() on Rdd to get array[string]
val rdd = rddFromFile.map(f=>{
f.split(",")
})
Now, read the data from rdd by using foreach, since the elements in RDD are array, we need to
use the index to retrieve each element from an array.

// Read data from RDD using foreach


rdd.foreach(f=>{
println("Col1:"+f(0)+",Col2:"+f(1))
})
Note that the output we get from the above “println” also contains header names from a CSV file
as header considered as data itself in RDD. We need to skip the header while processing the
data. This is where the DataFrame comes handy to read CSV file with a header and handles a lot
more options and file formats. Spark’s DataFrame API or other libraries like Apache Spark’s built-
28
Apache Spark - SparkByExamples

in spark-csv library or external libraries like dataframes-csv provide more effective and efficient
ways to work with CSV files.
// Output:
Col1:col1,Col2:col2
Col1:One,Col2:1
Col1:Eleven,Col2:11
Let’s see how to collect the data from RDD using collect(). In this case, collect() method returns
Array[Array[String]] type where the first Array represents the RDD data and inner array is a record.

// Collect data using collect()


rdd.collect().foreach(f=>{
println("Col1:"+f(0)+",Col2:"+f(1))
})
This example also yields the same as the above output.
Skip Header From CSV file
When you have a header with column names in a CSV file and to read and process with Spark
RDD, you need to skip the header as there is no way in RDD to specify your file has a header.
// Skip the header of CSVFile rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1)
else iter }
Read Multiple CSV Files into RDD
To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing
all file names comma separated. The below example reads text01.csv & text02.csv files into single
RDD.

// Read multiple CSVFiles into RDD


val rdd4 = spark.sparkContext.textFile("C:/tmp/files/text01.csv,C:/tmp/files/text02.csv")
rdd4.foreach(f=>{
println(f)
})
Read all CSV Files in a Directory into RDD
To read all CSV files in a directory or folder, just pass a directory path to the testFile() method.

// Read all CSVFiles in a directory


val rdd3 = spark.sparkContext.textFile("C:/tmp/files/*")
rdd3.foreach(f=>{
29
Apache Spark - SparkByExamples

println(f) })
Complete example

package com.sparkbyexamples.spark.rdd

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object ReadMultipleCSVFiles extends App {

val spark:SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

println("spark read csv files from a directory into RDD")


val rddFromFile = spark.sparkContext.textFile("C:/tmp/files/text01.csv")
println(rddFromFile.getClass)

val rdd = rddFromFile.map(f=>{


f.split(",")
})

println("Iterate RDD")
rdd.foreach(f=>{
println("Col1:"+f(0)+",Col2:"+f(1))
})
println(rdd)

println("Get data Using collect")


rdd.collect().foreach(f=>{
30
Apache Spark - SparkByExamples

println("Col1:"+f(0)+",Col2:"+f(1)) })
println("read all csv files from a directory to single RDD")
val rdd2 = spark.sparkContext.textFile("C:/tmp/files/*")
rdd2.foreach(f=>{
println(f)
})

println("read csv files base on wildcard character")


val rdd3 = spark.sparkContext.textFile("C:/tmp/files/text*.csv")
rdd3.foreach(f=>{
println(f)
})

println("read multiple csv files into a RDD")


val rdd4 = spark.sparkContext.textFile("C:/tmp/files/text01.csv,C:/tmp/files/text02.csv")
rdd4.foreach(f=>{
println(f)
})
}
This complete example can be downloaded from GitHub project
Conclusion
In this tutorial you have learned how to read a single CSV file, multiples CSV files and reading all
CSV files from a directory/folder into a single Spark RDD. You have also learned how to skip the
header while reading CSV files into RDD.
______________________________________________________________________________

Different ways to create Spark RDD


Spark RDD can be created in several ways, for example, It can be created by using
sparkContext.parallelize(), from text file, from another RDD, DataFrame, and Dataset. Though we
have covered most of the examples in Scala here, the same concept can be used to create RDD
in PySpark (Python Spark)
Resilient Distributed Datasets (RDD) is the fundamental data structure of Spark. RDDs
are immutable and fault-tolerant in nature. RDD is just the way of representing Dataset distributed
across multiple nodes in a cluster, which can be operated in parallel. RDDs are called resilient
because they have the ability to always re-compute an RDD when a node failure.

31
Apache Spark - SparkByExamples

Note that once we create an RDD, we can easily create a DataFrame from RDD.
Let’s see how to create an RDD in Apache Spark with examples:
 Spark create RDD from Seq or List (using Parallelize)
 Creating an RDD from a text file
 Creating from another RDD
 Creating from existing DataFrames and DataSet

1. Spark Create RDD from Seq or List (using Parallelize)


RDD’s are generally created by parallelized collection i.e. by taking an existing collection from
driver program (scala, python e.t.c) and passing it to SparkContext’s parallelize() method. This
method is used only for testing but not in realtime as the entire data will reside on one node which
is not ideal for production.
// Spark Create RDD from Seq or List (using Parallelize)
val rdd=spark.sparkContext.parallelize(Seq(("Java", 20000),
("Python", 100000), ("Scala", 3000)))
rdd.foreach(println)
// Outputs:

(Python,100000)

(Scala,3000)

(Java,20000)

2. Create an RDD from a text file


Mostly for production systems, we create RDD’s from files. Here will see how to create an RDD by
reading data from a file.
// Create RDD using textFile
 val rdd = spark.sparkContext.textFile("/path/textFile.txt")

This creates an RDD for which each record represents a line in a file.
If you want to read the entire content of a file as a single record use wholeTextFiles() method on
sparkContext.
// RDD from wholeTextFile
 val rdd2 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
 rdd2.foreach(record=>println("FileName : "+record._1+", FileContents :"+record._2))

In this case, each text file is a single record. In this, the name of the file is the first column and the
value of the text file is the second column.

32
Apache Spark - SparkByExamples

3. Creating from another RDD


You can use transformations like map, flatmap, filter to create a new RDD from an existing one.
// Creating from another RDD
 val rdd3 = rdd.map(row=>{(row._1,row._2+100)})
Above, creates a new RDD “rdd3” by adding 100 to each record on RDD. this example outputs
below.
// Output:
(Python,100100)
(Scala,3100)
(Java,20100)

4. From existing DataFrames and DataSet


To convert DataSet or DataFrame to RDD just use rdd() method on any of these data types.
// From existing DataFrames and DataSet
 val myRdd2 = spark.range(20).toDF().rdd
toDF() creates a DataFrame and by calling rdd on DataFrame returns back RDD.

Conclusion:
In this article, you have learned creating Spark RDD from list or seq, text file, from another RDD,
DataFrame, and Dataset.
______________________________________________________________________________

Spark RDD Actions with examples


RDD actions are operations that return the raw values, In other words, any RDD function that
returns other than RDD[T] is considered as an action in spark programming. In this tutorial, we will
learn RDD actions with Scala examples.
As mentioned in RDD Transformations, all transformations are lazy meaning they do not get
executed right away and action functions trigger to execute the transformations.
Complete code I’ve used in this article is available at GitHub project for quick reference.

Spark RDD Actions


33
Apache Spark - SparkByExamples

Select a link from the table below to jump to an example.

RDD ACTION METHODS METHOD DEFINITION

aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U,


combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
Aggregate the elements of each
partition, and then the results for all the
partitions.

collect():Array[T] Return the complete dataset as an


Array.

count():Long Return the count of elements in the


dataset.

countApprox(timeout: Long, confidence: Double = Return approximate count of elements


0.95): PartialResult[BoundedDouble] in the dataset, this method returns
incomplete when execution time meets
timeout.

countApproxDistinct(relativeSD: Double = 0.05): Long Return an approximate number of


distinct elements in the dataset.

countByValue(): Map[T, Long] Return Map[T,Long] key representing


each unique value in dataset and value
represent count each value present.

countByValueApprox(timeout: Long, confidence: Same as countByValue() but returns


Double = 0.95)(implicit ord: Ordering[T] = null): approximate result.
PartialResult[Map[T, BoundedDouble]]

first():T Return the first element in the dataset.

fold(zeroValue: T)(op: (T, T) ⇒ T): T Aggregate the elements of each


partition, and then the results for all the
partitions.

foreach(f: (T) ⇒ Unit): Unit Iterates all elements in the dataset by


applying function f to all elements.

foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit Similar to foreach, but applies function


f for each partition.

34
Apache Spark - SparkByExamples

RDD ACTION METHODS METHOD DEFINITION

min()(implicit ord: Ordering[T]): T Return the minimum value from the


dataset.

max()(implicit ord: Ordering[T]): T Return the maximum value from the


dataset.

reduce(f: (T, T) ⇒ T): T Reduces the elements of the dataset


using the specified binary operator.

saveAsObjectFile(path: String): Unit Saves RDD as a serialized object’s to


the storage system.

saveAsTextFile(path: String, codec: Class[_ <: Saves RDD as a compressed text file.
CompressionCodec]): Unit

saveAsTextFile(path: String): Unit Saves RDD as a text file.

take(num: Int): Array[T] Return the first num elements of the


dataset.

takeOrdered(num: Int)(implicit ord: Ordering[T]): Return the first num (smallest)


Array[T] elements from the dataset and this is
the opposite of the take() action.
Note: Use this method only when the
resulting array is small, as all the data
is loaded into the driver’s memory.

takeSample(withReplacement: Boolean, num: Int, Return the subset of the dataset in an


seed: Long = Utils.random.nextLong): Array[T] Array.
Note: Use this method only when the
resulting array is small, as all the data
is loaded into the driver’s memory.

toLocalIterator(): Iterator[T] Return the complete dataset as an


Iterator.
Note: Use this method only when the
resulting array is small, as all the data
is loaded into the driver’s memory.

top(num: Int)(implicit ord: Ordering[T]): Array[T] Note: Use this method only when the
resulting array is small, as all the data

35
Apache Spark - SparkByExamples

RDD ACTION METHODS METHOD DEFINITION

is loaded into the driver’s memory.

treeAggregate Aggregates the elements of this RDD


in a multi-level tree pattern.

treeReduce Reduces the elements of this RDD in a


multi-level tree pattern.

RDD Actions Example


Before we start explaining RDD actions with examples, first, let’s create an RDD.
// Create RDD from a List Using parallelize
val spark = SparkSession.builder()
.appName("SparkByExample")
.master("local")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
val inputRDD = spark.sparkContext.parallelize(List(("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),
("B", 60)))

val listRdd = spark.sparkContext.parallelize(List(1,2,3,4,5,3,2))


Note that we have created two RDD’s in the above code snippet and we use these two as and
when necessary to demonstrate the RDD actions.
aggregate – action
aggregate() the elements of each partition, and then the results for all the partitions.

//aggregate
def param0= (accu:Int, v:Int) => accu + v
def param1= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+listRdd.aggregate(0)(param0,param1))
//Output: aggregate : 20

//aggregate
36
Apache Spark - SparkByExamples

def param3= (accu:Int, v:(String,Int)) => accu + v._2


def param4= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+inputRDD.aggregate(0)(param3,param4))

//Output: aggregate : 20

treeAggregate – action

treeAggregate() – Aggregates the elements of this RDD in a multi-level tree pattern. The output
of this function will be similar to the aggregate function.

//treeAggregate. This is similar to aggregate


def param8= (accu:Int, v:Int) => accu + v
def param9= (accu1:Int,accu2:Int) => accu1 + accu2
println("treeAggregate : "+listRdd.treeAggregate(0)(param8,param9))
//Output: treeAggregate : 20

fold – action
fold() – Aggregate the elements of each partition, and then the results for all the partitions.
//fold
println("fold : "+listRdd.fold(0){ (acc,v) =>
val sum = acc+v
sum
})
//Output: fold : 20

println("fold : "+inputRDD.fold(("Total",0)){(acc:(String,Int),v:(String,Int))=>
val sum = acc._2 + v._2
("Total",sum)
})
//Output: fold : (Total,181)

37
Apache Spark - SparkByExamples

reduce
reduce() – Reduces the elements of the dataset using the specified binary operator.
//reduce
println("reduce : "+listRdd.reduce(_ + _))
//Output: reduce : 20
println("reduce alternate : "+listRdd.reduce((x, y) => x + y))
//Output: reduce alternate : 20
println("reduce : "+inputRDD.reduce((x, y) => ("Total",x._2 + y._2)))
//Output: reduce : (Total,181)

treeReduce
treeReduce() – Reduces the elements of this RDD in a multi-level tree pattern.

//treeReduce. This is similar to reduce


println("treeReduce : "+listRdd.treeReduce(_ + _))
//Output: treeReduce : 20

collect
collect() -Return the complete dataset as an Array.

//Collect
val data:Array[Int] = listRdd.collect()
data.foreach(println)

count, countApprox, countApproxDistinct


count() – Return the count of elements in the dataset.
countApprox() – Return approximate count of elements in the dataset, this method returns
incomplete when execution time meets timeout.
countApproxDistinct() – Return an approximate number of distinct elements in the dataset.
//count, countApprox, countApproxDistinct
println("Count : "+listRdd.count)

38
Apache Spark - SparkByExamples

//Output: Count : 20
println("countApprox : "+listRdd.countApprox(1200))
//Output: countApprox : (final: [7.000, 7.000])
println("countApproxDistinct : "+listRdd.countApproxDistinct())
//Output: countApproxDistinct : 5
println("countApproxDistinct : "+inputRDD.countApproxDistinct())
//Output: countApproxDistinct : 5
countByValue, countByValueApprox
countByValue() – Return Map[T,Long] key representing each unique value in dataset and value
represents count each value present.
countByValueApprox() – Same as countByValue() but returns approximate result.

//countByValue, countByValueApprox
println("countByValue : "+listRdd.countByValue())
//Output: countByValue : Map(5 -> 1, 1 -> 1, 2 -> 2, 3 -> 2, 4 -> 1)
//println(listRdd.countByValueApprox())

first
first() – Return the first element in the dataset.
//first
println("first : "+listRdd.first())
//Output: first : 1
println("first : "+inputRDD.first())
//Output: first : (Z,1)

top
top() – Return top n elements from the dataset.
Note: Use this method only when the resulting array is small, as all the data is loaded into the
driver’s memory.
//top
println("top : "+listRdd.top(2).mkString(","))
//Output: take : 5,4
println("top : "+inputRDD.top(2).mkString(","))

39
Apache Spark - SparkByExamples

//Output: take : (Z,1),(C,40)


min
min() – Return the minimum value from the dataset.
//min
println("min : "+listRdd.min())
//Output: min : 1
println("min : "+inputRDD.min())
//Output: min : (A,20)

max
max() – Return the maximum value from the dataset.

//max
println("max : "+listRdd.max())
//Output: max : 5
println("max : "+inputRDD.max())
//Output: max : (Z,1)

take, takeOrdered, takeSample


take() – Return the first num elements of the dataset.
takeOrdered() – Return the first num (smallest) elements from the dataset and this is the
opposite of the take() action.
Note: Use this method only when the resulting array is small, as all the data is loaded into the
driver’s memory.
takeSample() – Return the subset of the dataset in an Array.
Note: Use this method only when the resulting array is small, as all the data is loaded into the
driver’s memory.
//take, takeOrdered, takeSample
println("take : "+listRdd.take(2).mkString(","))
//Output: take : 1,2
println("takeOrdered : "+ listRdd.takeOrdered(2).mkString(","))
//Output: takeOrdered : 1,2
//println("take : "+listRdd.takeSample())

40
Apache Spark - SparkByExamples

Actions – Complete example

package com.sparkbyexamples.spark.rdd

import com.sparkbyexamples.spark.rdd.OperationOnPairRDDComplex.kv
import org.apache.spark.sql.SparkSession

import scala.collection.mutable

object RDDActions extends App {

val spark = SparkSession.builder()


.appName("SparkByExample")
.master("local")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
val inputRDD = spark.sparkContext.parallelize(List(("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),
("B", 60)))

val listRdd = spark.sparkContext.parallelize(List(1,2,3,4,5,3,2))

//Collect
val data:Array[Int] = listRdd.collect()
data.foreach(println)

//aggregate
def param0= (accu:Int, v:Int) => accu + v
def param1= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+listRdd.aggregate(0)(param0,param1))
//Output: aggregate : 20

41
Apache Spark - SparkByExamples

//aggregate
def param3= (accu:Int, v:(String,Int)) => accu + v._2
def param4= (accu1:Int,accu2:Int) => accu1 + accu2
println("aggregate : "+inputRDD.aggregate(0)(param3,param4))
//Output: aggregate : 20

//treeAggregate. This is similar to aggregate


def param8= (accu:Int, v:Int) => accu + v
def param9= (accu1:Int,accu2:Int) => accu1 + accu2
println("treeAggregate : "+listRdd.treeAggregate(0)(param8,param9))
//Output: treeAggregate : 20

//fold
println("fold : "+listRdd.fold(0){ (acc,v) =>
val sum = acc+v
sum
})
//Output: fold : 20

println("fold : "+inputRDD.fold(("Total",0)){(acc:(String,Int),v:(String,Int))=>
val sum = acc._2 + v._2
("Total",sum)
})
//Output: fold : (Total,181)

//reduce
println("reduce : "+listRdd.reduce(_ + _))
//Output: reduce : 20
println("reduce alternate : "+listRdd.reduce((x, y) => x + y))
//Output: reduce alternate : 20
println("reduce : "+inputRDD.reduce((x, y) => ("Total",x._2 + y._2)))
//Output: reduce : (Total,181)
42
Apache Spark - SparkByExamples

//treeReduce. This is similar to reduce


println("treeReduce : "+listRdd.treeReduce(_ + _))
//Output: treeReduce : 20

//count, countApprox, countApproxDistinct


println("Count : "+listRdd.count)
//Output: Count : 20
println("countApprox : "+listRdd.countApprox(1200))
//Output: countApprox : (final: [7.000, 7.000])
println("countApproxDistinct : "+listRdd.countApproxDistinct())
//Output: countApproxDistinct : 5
println("countApproxDistinct : "+inputRDD.countApproxDistinct())
//Output: countApproxDistinct : 5

//countByValue, countByValueApprox
println("countByValue : "+listRdd.countByValue())
//Output: countByValue : Map(5 -> 1, 1 -> 1, 2 -> 2, 3 -> 2, 4 -> 1)
//println(listRdd.countByValueApprox())

//first
println("first : "+listRdd.first())
//Output: first : 1
println("first : "+inputRDD.first())
//Output: first : (Z,1)

//top
println("top : "+listRdd.top(2).mkString(","))
//Output: take : 5,4
println("top : "+inputRDD.top(2).mkString(","))
//Output: take : (Z,1),(C,40)

43
Apache Spark - SparkByExamples

//min
println("min : "+listRdd.min())
//Output: min : 1
println("min : "+inputRDD.min())
//Output: min : (A,20)

//max
println("max : "+listRdd.max())
//Output: max : 5
println("max : "+inputRDD.max())
//Output: max : (Z,1)

//take, takeOrdered, takeSample


println("take : "+listRdd.take(2).mkString(","))
//Output: take : 1,2
println("takeOrdered : "+ listRdd.takeOrdered(2).mkString(","))
//Output: takeOrdered : 1,2
//println("take : "+listRdd.takeSample())

//toLocalIterator
//listRdd.toLocalIterator.foreach(println)
//Output:
}

Conclusion:
RDD actions are operations that return non-RDD values, since RDD’s are lazy they do not execute
the transformation functions until we call actions. hence, all these functions trigger the
transformations to execute and finally returns the value of the action functions to the driver
program. and In this tutorial, you have also learned several RDD functions usage and examples in
scala language.

44
Apache Spark - SparkByExamples

Spark PairRDD Functions


Spark defines PairRDDFunctions class with several functions to work with Pair RDD or RDD key-
value pair, In this tutorial, we will learn these functions with Scala examples. Pair RDD’s are come
in handy when you need to apply transformations like hash partition, set operations, joins e.t.c.
All these functions are grouped into Transformations and Actions similar to regular RDD’s.
Spark Pair RDD Transformation Functions

PAIR RDD FUNCTIONS FUNCTION DESCRIPTION

aggregateByKey Aggregate the values of each key in a data set. This function can
return a different result type then the values in input RDD.

combineByKey Combines the elements for each key.

combineByKeyWithClassTa Combines the elements for each key.


g

flatMapValues It’s flatten the values of each key with out changing key values
and keeps the original RDD partition.

foldByKey Merges the values of each key.

groupByKey Returns the grouped RDD by grouping the values of each key.

mapValues It applied a map function for each value in a pair RDD with out
changing keys.

reduceByKey Returns a merged RDD by merging the values of each key.

reduceByKeyLocally Returns a merged RDD by merging the values of each key and
final result will be sent to the master.

sampleByKey Returns the subset of the RDD.

subtractByKey Return an RDD with the pairs from this whose keys are not in
other.

keys Returns all keys of this RDD as a RDD[T].

45
Apache Spark - SparkByExamples

PAIR RDD FUNCTIONS FUNCTION DESCRIPTION

values Returns an RDD with just values.

partitionBy Returns a new RDD after applying specified partitioner.

fullOuterJoin Return RDD after applying fullOuterJoin on current and


parameter RDD

join Return RDD after applying join on current and parameter RDD

leftOuterJoin Return RDD after applying leftOuterJoin on current and


parameter RDD

rightOuterJoin Return RDD after applying rightOuterJoin on current and


parameter RDD

Spark Pair RDD Actions

PAIR RDD ACTION FUNCTION DESCRIPTION


FUNCTIONS

collectAsMap Returns the pair RDD as a Map to the Spark Master.

countByKey Returns the count of each key elements. This returns the final
result to local Map which is your driver.

countByKeyApprox Same as countByKey but returns the partial result. This takes
a timeout as parameter to specify how long this function to run
before returning.

lookup Returns a list of values from RDD for a given input key.

reduceByKeyLocally Returns a merged RDD by merging the values of each key


and final result will be sent to the master.

saveAsHadoopDataset Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c), It uses Hadoop JobConf object to save.

saveAsHadoopFile Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c), It uses Hadoop OutputFormat class to

46
Apache Spark - SparkByExamples

PAIR RDD ACTION FUNCTION DESCRIPTION


FUNCTIONS

save.

saveAsNewAPIHadoopDataset Saves RDD to any hadoop supported file system (HDFS, S3,
ElasticSearch, e.t.c) with new Hadoop API, It uses Hadoop
Configuration object to save.

saveAsNewAPIHadoopFile Saves RDD to any hadoop supported fule system (HDFS, S3,
ElasticSearch, e.t.c), It uses new Hadoop API OutputFormat
class to save.

Pair RDD Functions Examples


In this section, I will explain Spark pair RDD functions with scala examples, before we get started
let’s create a pair RDD.
// Creating PairRDD
val spark = SparkSession.builder()
.appName("SparkByExample")
.master("local")
.getOrCreate()
val rdd = spark.sparkContext.parallelize(
List("Germany India USA","USA India Russia","India Brazil Canada China")
)
val wordsRdd = rdd.flatMap(_.split(" "))
val pairRDD = wordsRdd.map(f=>(f,1))
pairRDD.foreach(println)
This snippet creates a pair RDD by splitting by space on every element in an RDD, flatten it to
form a single word string on each element in RDD and finally assigns an integer “1” to every word.
// Output:

(Germany,1)

(India,1)

(USA,1)

(USA,1)

(India,1)

(Russia,1)

(India,1)

(Brazil,1)

47
Apache Spark - SparkByExamples
(Canada,1)

(China,1)

distinct – Returns distinct keys.


// Applying distinct()
pairRDD.distinct().foreach(println)
// Output:

(Germany,1)

(India,1)

(Brazil,1)

(China,1)

(USA,1)

(Canada,1)

(Russia,1)

sortByKey – Transformation returns an RDD after sorting by key


// SortByKey() on pairRDD
println("Sort by Key ==>")
val sortRDD = pairRDD.sortByKey()
sortRDD.foreach(println)
// Output:

Sort by Key ==>

(Brazil,1)

(Canada,1)

(China,1)

(Germany,1)

(India,1)

(India,1)

(India,1)

(Russia,1)

(USA,1)

(USA,1)

reduceByKey – Transformation returns an RDD after adding value for each key.
Result RDD contains unique keys.
// reduceByKey() on pairRDD
println("Reduce by Key ==>")
val wordCount = pairRDD.reduceByKey((a,b)=>a+b)

48
Apache Spark - SparkByExamples

wordCount.foreach(println)
This reduces the key by summing the values. Yields below output.
// Output:

Reduce by Key ==>

(Brazil,1)

(Canada,1)

(China,1)

(USA,2)

(Germany,1)

(Russia,1)

(India,3)

aggregateByKey – Transformation same as reduceByKey


In our example, this is similar to reduceByKey but uses a different approach.
// aggregateByKey() on pairRDD
def param1= (accu:Int,v:Int) => accu + v
def param2= (accu1:Int,accu2:Int) => accu1 + accu2
println("Aggregate by Key ==> wordcount")
val wordCount2 = pairRDD.aggregateByKey(0)(param1,param2)
wordCount2.foreach(println)
This example yields the same output as reduceByKey example.

keys – Return RDD[K] with all keys in an dataset


// Returning all keys from pairRDD
println("Keys ==>")
wordCount2.keys.foreach(println)
// Output:
Brazil
Canada
China
USA
Germany
Russia
India

49
Apache Spark - SparkByExamples

values – return RDD[V] with all values in an dataset


// Get all values from prirRDD
println("Keys ==>")
wordCount2.keys.foreach(println)

count – This is an action function and returns a count of a dataset


// Count() to return count of a dataset
println("Count :"+wordCount2.count())

collectAsMap – This is an action function and returns Map to the master for retrieving all data
from a dataset.
// collectAsMap() to retrieve all data from a dataset
println("collectAsMap ==>")
pairRDD.collectAsMap().foreach(println)

// Output:

(Brazil,1)

(Canada,1)

(Germany,1)

(China,1)

(Russia,1)

(India,1)

Complete Example

package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
import scala.collection.mutable

object OperationsOnPairRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("SparkByExample")

50
Apache Spark - SparkByExamples

.master("local")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

val rdd = spark.sparkContext.parallelize(


List("Germany India USA","USA India Russia","India Brazil Canada China")
)
val wordsRdd = rdd.flatMap(_.split(" "))
val pairRDD = wordsRdd.map(f=>(f,1))
pairRDD.foreach(println)

println("Distinct ==>")
pairRDD.distinct().foreach(println)

//SortByKey
println("Sort by Key ==>")
val sortRDD = pairRDD.sortByKey()
sortRDD.foreach(println)

//reduceByKey
println("Reduce by Key ==>")
val wordCount = pairRDD.reduceByKey((a,b)=>a+b)
wordCount.foreach(println)

def param1= (accu:Int,v:Int) => accu + v


def param2= (accu1:Int,accu2:Int) => accu1 + accu2
println("Aggregate by Key ==> wordcount")
val wordCount2 = pairRDD.aggregateByKey(0)(param1,param2)
wordCount2.foreach(println)

//keys
println("Keys ==>")
wordCount2.keys.foreach(println)
51
Apache Spark - SparkByExamples

//values
println("values ==>")
wordCount2.values.foreach(println)

println("Count :"+wordCount2.count())
println("collectAsMap ==>")
pairRDD.collectAsMap().foreach(println) } }

FAQs on Spark pairRDD


 What is pairRDD and when to use them?
A PairRDD (Resilient Distributed Dataset) a type of RDD where each element is a pair of key and
value. Pair RDDs are commonly used for operations that require data to be grouped or aggregated
by keys. They provide a convenient way to work with structured data, especially when dealing with
operations like grouping, reducing, and joining.
Here are some key characteristics and common use cases for PairRDDs:

Key-Value Structure: In a Pair RDD, each element is represented as a tuple (key, value),
where key and value can be of any data type, including basic types, custom objects, or even other
RDDs.
Grouping: PairRDDs are often used for grouping data by keys. For example, you can group
data by a specific attribute in a dataset to perform operations on groups of data with the same key.
Aggregations: PairRDDs are useful for performing aggregation operations such
as reduceByKey, groupByKey, combineByKey, and foldByKey to summarize data based on keys.
Joins: PairRDDs can be joined with other PairRDDs based on their keys using operations
like join, leftOuterJoin, rightOuterJoin, and cogroup.
Transformations: Various transformations can be applied to PairRDDs, such
as mapValues, flatMapValues, and filter, which allow you to manipulate the values associated with
keys.
 How do I create a pairRDD?
PairRDDs can be created by running a map() function that returns key/value pairs. PairRDDs are
commonly used in Spark for operations like groupByKey, reduceByKey, join, and other operations
that involve key-value pairs.
We can first create an RDD either by using parallalize() or by using an existing RDD, then apply a
map() transformation to make a key value pair.
 What is the difference between RDD and pairRDD?
RDD is a distributed collection of data that can be processed in parallel across a cluster of
machines.
RDD can hold any type of data, including simple values, objects, or more complex data structures.
RDD is used for general-purpose distributed data processing and can be transformed and
52
Apache Spark - SparkByExamples

processed using various operations like map, filter, reduce, groupBy, join, etc. RDDs do not have
any inherent key-value structure; they are typically used for non-keyed data.
PairRDD is specialized key-value data and is designed for operations that involve keys. PairRDDs
are commonly used in Spark when you need to work with structured data that can be organized
and processed based on keys, making them suitable for many data processing tasks, especially in
the context of data analytics and transformations.
Conclusion:
In this tutorial, you have learned PairRDDFunctions class and Spark PairRDD transformations &
action functions with scala examples.
______________________________________________________________________________

Spark Repartition() vs Coalesce()


Spark repartition() vs coalesce() – repartition() is used to increase or decrease the RDD,
DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of
partitions in an efficient way.
In this article, you will learn what is Spark repartition() and coalesce() methods? and the difference
between repartition vs coalesce with Scala examples.
 RDD Partition
o RDD repartition
o RDD coalesce
 DataFrame Partition
o DataFrame repartition
o DataFrame coalesce
One important point to note is, Spark repartition() and coalesce() are very expensive operations as
they shuffle the data across many partitions hence try to minimize repartition as much as possible.

1. Spark RDD repartition() vs coalesce()


In RDD, you can create parallelism at the time of the creation of an
RDD using parallelize(), textFile() and wholeTextFiles(). You can download the test.txt file used in
this example from GitHub.
// Create RDD with partition size
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[5]")
.appName("SparkByExamples.com")
53
Apache Spark - SparkByExamples

.getOrCreate()
val rdd = spark.sparkContext.parallelize(Range(0,20))
println("From local[5]"+rdd.partitions.size)

val rdd1 = spark.sparkContext.parallelize(Range(0,20), 6)


println("parallelize : "+rdd1.partitions.size)

val rddFromFile = spark.sparkContext.textFile("src/main/resources/test.txt",10)


println("TextFile : "+rddFromFile.partitions.size)

The above example yields below output


From local[5] : 5
Parallelize : 6
TextFile : 10

spark.sparkContext.parallelize(Range(0,20),6) distributes RDD into 6 partitions and the data is


distributed as below.
 rdd1.saveAsTextFile("/tmp/partition")

//Writes 6 part files, one for each partition

Partition 1 : 0 1 2

Partition 2 : 3 4 5

Partition 3 : 6 7 8 9

Partition 4 : 10 11 12

Partition 5 : 13 14 15

Partition 6 : 16 17 18 19

1.1 RDD repartition()


Spark RDD repartition() method is used to increase or decrease the partitions. The below example
decreases the partitions from 10 to 4 by moving data from all partitions.
 val rdd2 = rdd1.repartition(4)
 println("Repartition size : "+rdd2.partitions.size)
 rdd2.saveAsTextFile("/tmp/re-partition")
This yields output Repartition size : 4 and the repartition re-distributes the data(as shown below)
from all partitions which is full shuffle leading to very expensive operation when dealing with
billions and trillions of data.

54
Apache Spark - SparkByExamples

Partition 1 : 1 6 10 15 19

Partition 2 : 2 3 7 11 16

Partition 3 : 4 8 12 13 17

Partition 4 : 0 5 9 14 18

1.2 RDD coalesce()


Spark RDD coalesce() is used only to reduce the number of partitions. This is optimized or
improved version of repartition() where the movement of the data across the partitions is lower
using coalesce.
 val rdd3 = rdd1.coalesce(4)
 println("Repartition size : "+rdd3.partitions.size)
 rdd3.saveAsTextFile("/tmp/coalesce")
If you compared the below output with section 1, you will notice partition 3 has been moved to 2
and Partition 6 has moved to 5, resulting data movement from just 2 partitions.
Partition 1 : 0 1 2
Partition 2 : 3 4 5 6 7 8 9
Partition 4 : 10 11 12
Partition 5 : 13 14 15 16 17 18 19

1.3 Complete Example of Spark RDD repartition and coalesce


Below is complete example of Spark RDD repartition and coalesce in Scala language.

package com.sparkbyexamples.spark.rdd
import org.apache.spark.sql.SparkSession
object RDDRepartitionExample extends App {
val spark:SparkSession = SparkSession.builder()
.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()

val rdd = spark.sparkContext.parallelize(Range(0,20))


println("From local[5]"+rdd.partitions.size)

val rdd1 = spark.sparkContext.parallelize(Range(0,20), 6)

55
Apache Spark - SparkByExamples

println("parallelize : "+rdd1.partitions.size)
rdd1.partitions.foreach(f=> f.toString)
val rddFromFile = spark.sparkContext.textFile("src/main/resources/test.txt",9)

println("TextFile : "+rddFromFile.partitions.size)

rdd1.saveAsTextFile("c:/tmp/partition")
val rdd2 = rdd1.repartition(4)
println("Repartition size : "+rdd2.partitions.size)

rdd2.saveAsTextFile("c:/tmp/re-partition")

val rdd3 = rdd1.coalesce(4)


println("Repartition size : "+rdd3.partitions.size)

rdd3.saveAsTextFile("c:/tmp/coalesce")
}
2. Spark DataFrame repartition() vs coalesce()
Unlike RDD, you can’t specify the partition/parallelism while creating DataFrame. DataFrame or
Dataset by default uses the methods specified in Section 1 to determine the default partition and
splits the data for parallelism.
val spark:SparkSession = SparkSession.builder()
.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()
val df = spark.range(0,20)
println(df.rdd.partitions.length)
df.write.mode(SaveMode.Overwrite)csv("partition.csv")
The above example creates 5 partitions as specified in master("local[5]") and the data is distributed across
all these 5 partitions.
Partition 1 : 0 1 2 3
Partition 2 : 4 5 6 7
Partition 3 : 8 9 10 11
Partition 4 : 12 13 14 15

56
Apache Spark - SparkByExamples

Partition 5 : 16 17 18 19

2.1 DataFrame repartition()


Similar to RDD, the Spark DataFrame repartition() method is used to increase or decrease the
partitions. The below example increases the partitions from 5 to 6 by moving data from all
partitions.
 val df2 = df.repartition(6)
 println(df2.rdd.partitions.length)
Just increasing 1 partition results data movements from all partitions.

Partition 1 : 14 1 5

Partition 2 : 4 16 15

Partition 3 : 8 3 18

Partition 4 : 12 2 19

Partition 5 : 6 17 7 0

Partition 6 : 9 10 11 13

And, even decreasing the partitions also results in moving data from all partitions. hence when you
wanted to decrease the partition recommendation is to use coalesce()/

2.2 DataFrame coalesce()


Spark DataFrame coalesce() is used only to decrease the number of partitions. This is an
optimized or improved version of repartition() where the movement of the data across the
partitions is fewer using coalesce.
 val df3 = df.coalesce(2)
 println(df3.rdd.partitions.length)
This yields output 2 and the resultant partition looks like

Partition 1 : 0 1 2 3 8 9 10 11

Partition 2 : 4 5 6 7 12 13 14 15 16 17 18 19

Since we are reducing 5 to 2 partitions, the data movement happens only from 3 partitions and it
moves to remain 2 partitions.

Default Shuffle Partition


Calling groupBy(), union(), join() and similar functions on DataFrame results in shuffling data
between multiple executors and even machines and finally repartitions data into 200 partitions by
default. Spark default defines shuffling partition to 200
using spark.sql.shuffle.partitions configuration.
 val df4 = df.groupBy("id").count()
 println(df4.rdd.getNumPartitions)
Conclusion

57
Apache Spark - SparkByExamples

In this Spark repartition and coalesce article, you have learned how to create an RDD with
partition, repartition the RDD & DataFrame using repartition() and coalesce() methods, and
learned the difference between repartition and coalesce.

Spark SQL Shuffle Partitions


The Spark SQL shuffle is a mechanism for redistributing or re-partitioning data so that the data is
grouped differently across partitions. Based on your data size you may need to reduce or increase
the number of partitions of RDD/DataFrame using spark.sql.shuffle.partitions configuration or
through code.
Spark shuffle is a very expensive operation as it moves the data between executors or even
between worker nodes in a cluster so try to avoid it when possible. When you have a performance
issue on Spark jobs, you should look at the Spark transformations that involve shuffling.
In this tutorial, you will learn what triggers the shuffle on RDD and DataFrame transformations
using scala examples. The same approach also can be used with PySpark (Spark with Python)
What is Spark Shuffle?
Shuffling is a mechanism Spark uses to redistribute the data across different executors and even
across machines. Spark shuffling triggers for transformation operations
like groupByKey(), reducebyKey(), join(), groupBy() e.t.c
Spark Shuffle is an expensive operation since it involves the following
 Disk I/O
 Involves data serialization and deserialization
 Network I/O
When creating an RDD, Spark doesn’t necessarily store the data for all keys in a partition since at
the time of creation there is no way we can set the key for the data set.
Hence, when we run the reduceByKey() operation to aggregate the data on keys, Spark does the
following.
 Spark first runs map tasks on all partitions which groups all values for a single key.
 The results of the map tasks are kept in memory.
 When results do not fit in memory, Spark stores the data on a disk.
 Spark shuffles the mapped data across partitions, some times it also stores the shuffled
data into a disk for reuse when it needs to recalculate.
 Run the garbage collection
 Finally runs reduce tasks on each partition based on key.

Spark RDD Shuffle


Spark RDD triggers shuffle for several operations
like repartition(), groupByKey(), reduceByKey(), cogroup() and join() but not countByKey() .
val spark:SparkSession = SparkSession.builder()

58
Apache Spark - SparkByExamples

.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()
val sc = spark.sparkContext

val rdd:RDD[String] = sc.textFile("src/main/resources/test.txt")

println("RDD Parition Count :"+rdd.getNumPartitions)


val rdd2 = rdd.flatMap(f=>f.split(" "))
.map(m=>(m,1))

//ReduceBy transformation
val rdd5 = rdd2.reduceByKey(_ + _)

println("RDD Parition Count :"+rdd5.getNumPartitions)

#Output
RDD Parition Count : 3
RDD Parition Count : 3
Both getNumPartitions from the above examples return the same number of partitions.
Though reduceByKey() triggers data shuffle, it doesn’t change the partition count as RDD’s inherit
the partition size from parent RDD.
You may get partition counts different based on your setup and how Spark creates partitions.

Spark SQL DataFrame Shuffle


Unlike RDD, Spark SQL DataFrame API increases the partitions when the transformation
operation performs shuffling. DataFrame operations that trigger shufflings are join(), and
all aggregate functions.

import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),

59
Apache Spark - SparkByExamples

("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")

val df2 = df.groupBy("state").count()

println(df2.rdd.getNumPartitions)

This outputs the partition count as 200.

Spark Default Shuffle Partition


DataFrame increases the partition number to 200 automatically when Spark operation performs
data shuffling (join(), aggregation functions). This default shuffle partition number comes from
Spark SQL configuration spark.sql.shuffle.partitions which is by default set to 200.
You can change this default shuffle partition value using conf method of the SparkSession object
or using Spark Submit Command Configurations.
 spark.conf.set("spark.sql.shuffle.partitions",100)
 println(df.groupBy("_c0").count().rdd.partitions.length)

Shuffle partition size


Based on your dataset size, number of cores, and memory, Spark shuffling can benefit or harm
your jobs. When you dealing with less amount of data, you should typically reduce the shuffle
partitions otherwise you will end up with many partitioned files with a fewer number of records in
each partition. which results in running many tasks with lesser data to process.
On another hand, when you have too much data and have less number of partitions results in
fewer longer running tasks, and sometimes you may also get out of memory error.
Getting the right size of the shuffle partition is always tricky and takes many runs with different
values to achieve the optimized number. This is one of the key properties to look for when you
have performance issues on Spark jobs.

Conclusion

60
Apache Spark - SparkByExamples

In this article, you have learned what is Spark SQL shuffle, how some Spark operation triggers re-
partition of the data, how to change the default spark shuffle partition, and finally how to get the
right partition size.

Spark Difference between Cache and Persist?


Spark Cache and persist are optimization techniques for iterative and interactive Spark
applications to improve the performance of the jobs or applications. In this article, you will learn
What is Spark Caching and Persistence, the difference between cache() vs persist() methods and
how to use these two with RDD, DataFrame, and Dataset with Scala examples.
Though Spark provides computation 100 x times faster than traditional Map Reduce jobs, If you
have not designed the jobs to reuse the repeating computations you will see a degrade in
performance when dealing with billions or trillions of data. Hence, we may need to look at the
stages and use optimization techniques as one of the ways to improve performance.
Key Points to Note:
 RDD.cache() caches the RDD with the default storage level MEMORY_ONLY
 DataFrame.cache() caches the DataFrame with the default storage
level MEMORY_AND_DISK
 The persist() method is used to store it to the user-defined storage level
 On Spark UI, the Storage tab shows where partitions exist in memory or disk across the
cluster.
 Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
 Caching of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not
be cached until you trigger an action.
1. Spark Cache vs Persist
Using cache() and persist() methods, Spark provides an optimization mechanism to store the
intermediate computation of an RDD, DataFrame, and Dataset so they can be reused in
subsequent actions(reusing the RDD, Dataframe, and Dataset computation results).
Both caching and persisting are used to save the Spark RDD, Dataframe, and Datasets. But, the
difference is, RDD cache() method default saves it to memory (MEMORY_ONLY) and, DataFrame
cache() method default saves it to memory (MEMORY_AND_DISK), whereas persist() method is
used to store it to the user-defined storage level.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if
any partition of a Dataset is lost, it will automatically be recomputed using the original
transformations that created it.
2. Advantages of Caching and Persistence
Below are the advantages of using Spark Cache and Persist methods.
 Cost efficient – Spark computations are very expensive; hence, reusing the computations
are used to save cost.
61
Apache Spark - SparkByExamples

 Time efficient – Reusing repeated computations saves lots of time.


 Execution time – Saves execution time of the job and we can perform more jobs on the
same cluster.
Below, I will explain how to use Spark Cache and Persist with DataFrame or Dataset.
3. Spark Cache Syntax and Example
Spark DataFrame or Dataset caching by default saves it to storage level `MEMORY_AND_DISK`
because recomputing the in-memory columnar representation of the underlying table is expensive.
Note that this is different from the default cache level of `RDD.cache()` which is
‘MEMORY_ONLY‘.
Syntax
cache() : Dataset.this.type
Spark cache() method in Dataset class internally calls persist() method, which in turn
uses sparkSession.sharedState.cacheManager.cacheQuery to cache the result set of DataFrame
or Dataset. Let’s look at an example.

Example
// Create sparkSession and apply cache() on DataFrame
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
val columns = Seq("Seqno","Quote")
val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy."))
val df = data.toDF(columns:_*)

val dfCache = df.cache()


dfCache.show(false)

4. Spark Persist Syntax and Example


Spark persist() has two signatures. The first signature doesn’t take any argument, which by default
saves it to MEMORY_AND_DISK storage level, and the second signature takes StorageLevel as
an argument to store it at different storage levels.
Syntax
62
Apache Spark - SparkByExamples

1) persist() : Dataset.this.type
2) persist(newLevel : org.apache.spark.storage.StorageLevel) : Dataset.this.type

Example
// Persist Example
val dfPersist = df.persist()
dfPersist.show(false)
Using the second signature you can save DataFrame/Dataset to One of the storage
levels MEMORY_ONLY,MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_S
ER, DISK_ONLY, MEMORY_ONLY_2,MEMORY_AND_DISK_2

// Persist with argument


val dfPersist = df.persist(StorageLevel.MEMORY_ONLY)
dfPersist.show(false)
This stores DataFrame/Dataset into Memory.

5. Unpersist syntax and Example


We can also unpersist the persistence DataFrame or Dataset to remove it from the memory or
storage.
Syntax
// unpersist() Syntax
unpersist() : Dataset.this.type
unpersist(blocking : scala.Boolean) : Dataset.this.type
Example
// unpersist() Example
val dfPersist = dfPersist.unpersist()
dfPersist.show(false)
unpersist(Boolean) with boolean as argument blocks until all blocks are deleted.

6. Spark Persistance storage levels


All different storage level Spark supports are available
at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to
persist or cache a Spark DataFrame and Dataset.
MEMORY_ONLY – This is the default behavior of the RDD cache() method and stores the RDD or
DataFrame as deserialized objects to JVM memory. When there is no enough memory available it
63
Apache Spark - SparkByExamples

will not save DataFrame of some partitions and these will be re-computed as and when required.
This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level
as it recomputes the unsaved partitions and recomputing the in-memory columnar representation
of the underlying table is expensive
MEMORY_ONLY_SER – This is the same as MEMORY_ONLY but the difference being it stores
RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) then
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
MEMORY_ONLY_2 – Same as MEMORY_ONLY storage level but replicate each partition to two
cluster nodes.
MEMORY_ONLY_SER_2 – Same as MEMORY_ONLY_SER storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage
Level, The DataFrame will be stored in JVM memory as a deserialized object. When required
storage is greater than available memory, it stores some of the excess partitions into the disk and
reads the data from the disk when required. It is slower as there is I/O involved.
MEMORY_AND_DISK_SER – This is the same as MEMORY_AND_DISK storage level difference
being it serializes the DataFrame objects in memory and on disk when space is not available.
MEMORY_AND_DISK_2 – Same as MEMORY_AND_DISK storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK_SER_2 – Same as MEMORY_AND_DISK_SER storage level but replicate
each partition to two cluster nodes.
DISK_ONLY – In this storage level, DataFrame is stored only on disk and the CPU computation
time is high as I/O is involved.
DISK_ONLY_2 – Same as DISK_ONLY storage level but replicate each partition to two cluster
nodes.
Below are the table representation of the Storage level, Go through the impact of space, cpu and
performance choose the one that best fits for you.
Storage Level Space used CPU time In memory On-disk Serialized Recompute some partitions

----------------------------------------------------------------------------------------------------

MEMORY_ONLY High Low Y N N Y

MEMORY_ONLY_SER Low High Y N Y Y

MEMORY_AND_DISK High Medium Some Some Some N

MEMORY_AND_DISK_SER Low High Some Some Y N

DISK_ONLY Low High N Y Y N

7. Some Points to note on Persistence


 Spark automatically monitors every persist() and cache() calls you make and it checks
usage on each node and drops persisted data if not used or using least-recently-used
(LRU) algorithm. As discussed in one of the above section you can also manually remove
using unpersist() method.

64
Apache Spark - SparkByExamples

 Spark caching and persistence is just one of the optimization techniques to improve the
performance of Spark jobs.
 For RDD cache(), the default storage level is ‘MEMORY_ONLY’ but, for DataFrame and
Dataset, the default is ‘MEMORY_AND_DISK‘
 On Spark UI, the Storage tab shows where partitions exist in memory or disk across the
cluster.
 Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
 Caching of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not
be cached until you trigger an action.
Conclusion
In this article, you have learned Spark cache and Persist methods are optimization techniques to
save interim computation results and use them subsequently and learned what is the difference
between Spark Cache and Persist and finally saw their syntaxes and usages with Scala examples.
______________________________________________________________________________

Spark Persistence Storage Levels


All different persistence (persist() method) storage level Spark/PySpark supports are available
at org.apache.spark.storage.StorageLevel and pyspark.StorageLevel classes respectively. The
storage level specifies how and where to persist or cache a Spark/PySpark RDD, DataFrame, and
Dataset.
All these Storage levels are passed as an argument to the persist() method of the Spark/Pyspark
RDD, DataFrame, and Dataset.
For example
import org.apache.spark.storage.StorageLevel
val rdd2 = rdd.persist(StorageLevel.MEMORY_ONLY_SER)
or
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)
Here, I will describe all storage levels available in Spark.
Memory only Storage level
StorageLevel.MEMORY_ONLY is the default behavior of the RDD cache() method and stores the
RDD or DataFrame as deserialized objects to JVM memory. When there is not enough memory
available it will not save DataFrame of some partitions and these will be re-computed as and when
required.
This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level
as it recomputes the unsaved partitions, and recomputing the in-memory columnar representation
of the underlying table is expensive.
Serialize in Memory
65
Apache Spark - SparkByExamples

StorageLevel.MEMORY_ONLY_SER is the same as MEMORY_ONLY but the difference being it


stores RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) than
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
Memory only and Replicate
StorageLevel.MEMORY_ONLY_2 is same as MEMORY_ONLY storage level but replicate each
partition to two cluster nodes.
Serialized in Memory and Replicate
StorageLevel.MEMORY_ONLY_SER_2 is same as MEMORY_ONLY_SER storage level but
replicate each partition to two cluster nodes.
Memory and Disk Storage level
StorageLevel.MEMORY_AND_DISK is the default behavior of the DataFrame or Dataset. In this
Storage Level, The DataFrame will be stored in JVM memory as deserialized objects. When
required storage is greater than available memory, it stores some of the excess partitions into a
disk and reads the data from the disk when required. It is slower as there is I/O involved.
Serialize in Memory and Disk
StorageLevel.MEMORY_AND_DISK_SER is same as MEMORY_AND_DISK storage level
difference being it serializes the DataFrame objects in memory and on disk when space is not
available.
Memory, Disk and Replicate
StorageLevel.MEMORY_AND_DISK_2 is Same as MEMORY_AND_DISK storage level but
replicate each partition to two cluster nodes.
Serialize in Memory, Disk and Replicate
StorageLevel.MEMORY_AND_DISK_SER_2 is same as MEMORY_AND_DISK_SER storage
level but replicate each partition to two cluster nodes.
Disk only storage level
In StorageLevel.DISK_ONLY storage level, DataFrame is stored only on disk and the CPU
computation time is high as I/O involved.
Disk only and Replicate
StorageLevel.DISK_ONLY_2 is same as DISK_ONLY storage level but replicate each partition to
two cluster nodes.
When to use what?
Below is the table representation of the Storage level, Go through the impact of space, CPU, and
performance choose the one that best fits you.
Storage Level Space used CPU time In memory On-disk Serialized Recompute some partitions
----------------------------------------------------------------------------------------------------
MEMORY_ONLY High Low Y N N Y
MEMORY_ONLY_SER Low High Y N Y Y
66
Apache Spark - SparkByExamples

MEMORY_AND_DISK High Medium Some Some Some N


MEMORY_AND_DISK_SER Low High Some Some Y N
DISK_ONLY Low High N Y Y N

Spark Broadcast Variables


In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are
cached and available on all nodes in a cluster in-order to access or use by the tasks. Instead of
sending this data along with every task, spark distributes broadcast variables to the machine using
efficient broadcast algorithms to reduce communication costs.
Use case
Let me explain with an example, assume you are getting a two-letter country state code in a file
and you wanted to transform it to full state name, (for example CA to California, NY to New York
e.t.c) by doing a lookup to reference mapping. In some instances, this data could be large and you
may have many such lookups (like zip code).
Instead of distributing this information along with each task over the network (overhead and time
consuming), we can use the broadcast variable to cache this lookup info on each machine and
tasks use this cached info while executing the transformations.
1. How does Spark Broadcast work?
Broadcast variables are used in the same way for RDD, DataFrame, and Dataset.
When you run a Spark RDD, DataFrame jobs that has the Broadcast variables defined and used,
Spark does the following.
 Spark breaks the job into stages that have distributed shuffling and actions are executed
with in the stage.
 Later Stages are also broken into tasks
 Spark broadcasts the common data (reusable) needed by tasks within each stage.
 The broadcasted data is cache in serialized format and deserialized before executing each
task.
You should be creating and using broadcast variables for data that shared across multiple stages
and tasks.
Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead,
they will be sent to executors when they are first used.
2. How to create Broadcast variable
The Spark Broadcast is created using the broadcast(v) method of the SparkContext class. This
method takes the argument v that you want to broadcast.
In Spark shell

scala> val broadcastVar = sc.broadcast(Array(0, 1, 2, 3))


broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

67
Apache Spark - SparkByExamples

scala> broadcastVar.value
res0: Array[Int] = Array(0, 1, 2, 3)
3. Spark RDD Broadcast variable example
Below is a very simple example of how to use broadcast variables on RDD. This example defines
commonly used data (country and states) in a Map variable and distributes the variable
using SparkContext.broadcast() and then use these variables on RDD map() transformation.
import org.apache.spark.sql.SparkSession
object RDDBroadcast extends App {
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))

val broadcastStates = spark.sparkContext.broadcast(states)


val broadcastCountries = spark.sparkContext.broadcast(countries)

val data = Seq(("James","Smith","USA","CA"),


("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val rdd = spark.sparkContext.parallelize(data)
val rdd2 = rdd.map(f=>{
val country = f._3
val state = f._4
val fullCountry = broadcastCountries.value.get(country).get
val fullState = broadcastStates.value.get(state).get
(f._1,f._2,fullCountry,fullState)
})
println(rdd2.collect().mkString("\n"))
}
Yields below output
68
Apache Spark - SparkByExamples

4. Spark DataFrame Broadcast variable example


Below is an example of how to use broadcast variables on DataFrame. similar to above RDD
example, This defines commonly used data (country and states) in a Map variable and distributes
the variable using SparkContext.broadcast() and then use these variables on DataFrame map()
transformation.
import org.apache.spark.sql.SparkSession
object BroadcastExample extends App{
val spark = SparkSession.builder()
.appName("SparkByExamples.com")
.master("local")
.getOrCreate()
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))

val broadcastStates = spark.sparkContext.broadcast(states)


val broadcastCountries = spark.sparkContext.broadcast(countries)

val data = Seq(("James","Smith","USA","CA"),


("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)
val columns = Seq("firstname","lastname","country","state")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)

val df2 = df.map(row=>{


val country = row.getString(2)
val state = row.getString(3)
69
Apache Spark - SparkByExamples

val fullCountry = broadcastCountries.value.get(country).get


val fullState = broadcastStates.value.get(state).get
(row.getString(0),row.getString(1),fullCountry,fullState)
}).toDF(columns:_*)
df2.show(false)
}
Above example first creates a DataFrame, transform the data using broadcast variable and yields
below output.

Conclusion
In this Spark Broadcast variable article you have learned what is Broadcast variable, it’s
advantage and how to use in RDD and Dataframe with scala example.

Spark Accumulators Explained


Spark Accumulators are shared variables which are only “added” through an associative and
commutative operation and are used to perform counters (Similar to Map-reduce counters) or sum
operations
Spark by default supports to create an accumulators of any numeric type and provide a capability
to add custom accumulator types.
Programmers can create following accumulators
 named accumulators
 unnamed accumulators
When you create a named accumulator, you can see them on Spark web UI under the
“Accumulator” tab. On this tab, you will see two tables; the first table “accumulable” – consists of
all named accumulator variables and their values. And on the second table “Tasks” – value for
each accumulator modified by a task.
And, unnamed accumulators are not shows on Spark web UI, For all practical purposes it is
suggestable to use named accumulators.
1. Creating Accumulator variable
Spark by default provides accumulator methods for long, double and collection types. All these
methods are present in SparkContext class and return <a
href="#LongAccumulator">LongAccumulator</a>, <a

70
Apache Spark - SparkByExamples

href="#DoubleAccumulator">DoubleAccumulator</a>, and <a


href="#CollectionAccumulator">CollectionAccumulator</a> respectively.
 Long Accumulator
 Double Accumulator
 Collection Accumulator
For example, you can create long accumulator on spark-shell using
// Creating Accumulator variable
scala> val accum = sc.longAccumulator("SumAccumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name:
Some(SumAccumulator), value: 0)
The above statement creates a named accumulator “SumAccumulator”. Now, Let’s see how to
add up the elements from an array to this accumulator.
scala> sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))
-----
scala> accum.value
res2: Long = 6
Each of these accumulator classes has several methods, among these, add() method call from
tasks running on the cluster. Tasks can’t read the values from the accumulator and only the driver
program can read accumulators value using the value() method.
2. Long Accumulator
longAccumulator() methods from SparkContext returns LongAccumulator
Syntax
// Long Accumulator
def longAccumulator : org.apache.spark.util.LongAccumulator
def longAccumulator(name : scala.Predef.String) : org.apache.spark.util.LongAccumulator
You can create named accumulators for long type using SparkContext.longAccumulator(v) and for
unnamed use signature that doesn’t take an argument.
val spark = SparkSession.builder()
.appName("SparkByExample")
.master("local")
.getOrCreate()
val longAcc = spark.sparkContext.longAccumulator("SumAccumulator")
val rdd = spark.sparkContext.parallelize(Array(1, 2, 3))
rdd.foreach(x => longAcc.add(x))
println(longAcc.value)

71
Apache Spark - SparkByExamples

LongAccumulator class provides follwoing methods


 isZero
 copy
 reset
 add
 count
 sum
 avg
 merge
 value
3. Double Accumulator
For named double type using SparkContext.doubleAccumulator(v) and for unnamed use signature
that doesn’t take an argument.
Syntax
// Double Accumulator
def doubleAccumulator : org.apache.spark.util.DoubleAccumulator
def doubleAccumulator(name : scala.Predef.String) : org.apache.spark.util.DoubleAccumulator
DoubleAccumulator class also provides methods similar to LongAccumulator
4. Collection Accumulator
For named collection type using SparkContext.collectionAccumulator(v) and for unnamed use
signature that doesn’t take an argument.
Syntax
// Collection Accumulator
def collectionAccumulator[T] : org.apache.spark.util.CollectionAccumulator[T]
def collectionAccumulator[T](name : scala.Predef.String) :
org.apache.spark.util.CollectionAccumulator[T]
CollectionAccumulator class provides following methods
 isZero
 copyAndReset
 copy
 reset
 add
 merge
 value
72
Apache Spark - SparkByExamples

Note: Each of these accumulator classes has several methods, among these, add() method call
from tasks running on the cluster. Tasks can’t read the values from the accumulator and only the
driver program can read accumulators value using the value() method.
Conclusion
In this Spark accumulators shared variable article, you have learned the Accumulators are only
“added” through an associative and commutative and operation and are used to perform counters
(Similar to Map-reduce counters) or sum operations and also learned different Accumulator
classes along with their methods.
______________________________________________________________________________

Convert Spark RDD to DataFrame | Dataset


While working in Apache Spark with Scala, we often need to Convert Spark RDD to
DataFrame and Dataset as these provide more advantages over RDD. For instance, DataFrame
is a distributed collection of data organized into named columns similar to Database tables and
provides optimization and performance improvement.
In this article, I will explain how to Convert Spark RDD to Dataframe and Dataset using several
examples.
 Create Spark RDD
 Convert Spark RDD to DataFrame
o using toDF()
o using createDataFrame()
o using RDD row type & schema
 Convert Spark RDD to Dataset
Create Spark RDD
First, let’s create an RDD by passing Seq object to sparkContext.parallelize() function. We would
need this “rdd” object for all our examples below.
import spark.implicits._
val columns = Seq("language","users_count")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
val rdd = spark.sparkContext.parallelize(data)

Convert Spark RDD to DataFrame


Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and
transforming rdd[Row] to the data frame.

73
Apache Spark - SparkByExamples

Convert RDD to DataFrame – Using toDF()


Spark provides an implicit function toDF() which would be used to convert RDD, Seq[T], List[T] to
DataFrame. In order to use toDF() function, we should import implicits first using import
spark.implicits._.
 val dfFromRDD1 = rdd.toDF()
 dfFromRDD1.printSchema()
By default, toDF() function creates column names as “_1” and “_2” like Tuples. Outputs below
schema.
root

|-- _1: string (nullable = true)

|-- _2: string (nullable = true)

toDF() has another signature that takes arguments to define column names as shown below.
 val dfFromRDD1 = rdd.toDF("language","users_count")
 dfFromRDD1.printSchema()
Outputs below schema.

root

|-- language: string (nullable = true)

|-- users_count: string (nullable = true)

By default, the datatype of these columns infers to the type of data and set’s nullable to true. We
can change this behavior by supplying schema using StructType – where we can specify a column
name, data type and nullable for each field/column.
Convert RDD to DataFrame – Using createDataFrame()
SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd
object as an argument. and chain it with toDF() to specify names to the columns.
 val columns = Seq("language","users_count")
 val dfFromRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)
Here, we are using scala operator <strong>:_*</strong> to explode columns array to comma-
separated values.
Using RDD Row type RDD[Row] to DataFrame
Spark createDataFrame() has another signature which takes the RDD[Row] type and schema for
column names as arguments. To use this first, we need to convert our “rdd” object from RDD[T] to
RDD[Row]. To define a schema, we use StructType that takes an array of StructField. And
StructField takes column name, data type and nullable/not as arguments.
//From RDD (USING createDataFrame and Adding schema using StructType)
val schema = StructType(columns
.map(fieldName => StructField(fieldName, StringType, nullable = true)))
//convert RDD[T] to RDD[Row]
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2))

74
Apache Spark - SparkByExamples

val dfFromRDD3 = spark.createDataFrame(rowRDD,schema)


This creates a data frame from RDD and assigns column names using schema.
Convert Spark RDD to Dataset
The DataFrame API is radically different from the RDD API because it is an API for building a
relational query plan that Spark’s Catalyst optimizer can then execute.
The Dataset API aims to provide the best of both worlds: the familiar object-oriented programming
style and compile-time type-safety of the RDD API but with the performance benefits of the
Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the
DataFrame API.
DataFrame is an alias to Dataset[Row]. As we mentioned before, Datasets are optimized for typed
engineering tasks, for which you want types checking and object-oriented programming interface,
while DataFrames are faster for interactive analytics and close to SQL style.
About data serializing. The Dataset API has the concept of encoders which translate between JVM
representations (objects) and Spark’s internal binary format. Spark has built-in encoders that are
very advanced in that they generate byte code to interact with off-heap data and provide on-
demand access to individual attributes without having to de-serialize an entire object.
 val ds = spark.createDataset(rdd)
Conclusion:
In this article, you have learned how to convert Spark RDD to DataFrame and Dataset, we would
need these frequently while working in Spark as these provides optimization and performance
over RDD.

Spark Create DataFrame with Examples


In Spark, createDataFrame() and toDF() methods are used to create a DataFrame manually, using
these methods you can create a Spark DataFrame from already existing RDD, DataFrame,
Dataset, List, Seq data objects, here I will explain these with Scala examples.
You can also create a DataFrame from different sources
like Text, CSV, JSON, XML, Parquet, Avro, ORC, Binary files, RDBMS Tables, Hive, HBase, and
many more.
DataFrame is a distributed collection of data organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer
optimizations under the hood. DataFrames can be constructed from a wide array of sources such
as: structured data files, tables in Hive, external databases, or existing RDDs.
-Databricks
 Spark Create DataFrame from RDD
 Create DataFrame from List and Seq collection
 Creating Spark DataFrame from CSV file
 Creating from TXT file
75
Apache Spark - SparkByExamples

 Creating from JSON file


 Creating from an XML file
 Creating from HIVE
 Creating from RDBMS Database table
 Creating from HBase table
 Other sources (Avro, Parquet e.t.c)
First, let’s import spark implicits as it needed for our examples ( for example when we want to
use .toDF() function) and create the sample data.
// Import Data
import org.apache.spark.sql.SparkSession
// Create SparkSession and Prepare Data
val spark:SparkSession = SparkSession.builder()
.master("local[1]").appName("SparkByExamples.com")
.getOrCreate()
// Create data
val columns = Seq("language","users_count")
val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
1. Spark Create DataFrame from RDD
One easy way to create Spark DataFrame manually is from an existing RDD. First, let’s create an
RDD from a collection Seq by calling parallelize().
I will be using this rdd object for all our examples below.
// Spark Create DataFrame from RDD
 val rdd = spark.sparkContext.parallelize(data)
In PySpark, parallelize(data) is used to create an RDD (Resilient Distributed Dataset) from a local
collection or iterable data. This function distributes the data across the Spark cluster, allowing
parallel processing of the elements within the RDD. It is a fundamental operation for leveraging the
distributed computing capabilities of Apache Spark.

76
Apache Spark - SparkByExamples

1.1 Using toDF() function


Use toDF() on RDD object to create a DataFrame in Spark. By default, it creates column names
as “_1” and “_2” as we have two columns for each row.
// Create DataFrame from RDD
 import spark.implicits._
 val dfFromRDD1 = rdd.toDF()
 dfFromRDD1.printSchema()
 dfFromRDD1.show()
// Output:

//+------+------+

//| _1| _2|

//+------+------+

//| Java| 20000|

//|Python|100000|

//| Scala| 3000|

//+------+------+

//root

// |-- _1: string (nullable = true)

// |-- _2: string (nullable = true)

Since RDD is schema-less without column names and data type, converting from RDD to
DataFrame gives you default column names as _1, _2 and so on and data type as String. Use
DataFrame printSchema() to print the schema to console.
Assign Column Names to DataFrame
toDF() has another signature to assign a column name, this takes a variable number of arguments
for column names as shown below.
// Create DataFrame with custom column names
 val dfFromRDD1 = rdd.toDF("language","users_count")
 dfFromRDD1.show()
 dfFromRDD1.printSchema()
// Output:

//+--------+-----------+

//|language|users_count|

//+--------+-----------+

//| Java| 20000|

//| Python| 100000|

//| Scala| 3000|

//+--------+-----------+

//root

// |-- language: string (nullable = true) // |-- users_count: string (nullable = true)

77
Apache Spark - SparkByExamples

Remember, here, we just assigned column names. The data types are still Strings.
By default, the datatype of these columns is assigned to String. We can change this behavior by
supplying schema – where we can specify a column name, data type and nullable for each
field/column.
1.2 Using Spark createDataFrame() from SparkSession
Using createDataFrame() from SparkSession is another way to create. This signature also takes
rdd object as an argument and chain with toDF() to specify column names.
// Using createDataFrame()
 val dfFromRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)

Here, toDF(columns: _*): assigns column names to the DataFrame using the provided columns list
or sequence. The _* is a syntax to pass a variable number of arguments. It facilitates converting
the elements in columns into separate arguments for the toDF method.
1.3 Using createDataFrame() with the Row type
createDataFrame() has another signature that takes the RDD[Row] type and schema for column
names as arguments. To use this first, we need to convert our “rdd” object
from RDD[T] to RDD[Row] and define a schema using StructType & StructField.
// Additional Imports
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.Row
// Create StructType Schema
val schema = StructType( Array(
StructField("language", StringType,true),
StructField("users", StringType,true)
))
// Use map() transformation to get Row type
val rowRDD = rdd.map(attributes => Row(attributes._1, attributes._2))
val dfFromRDD3 = spark.createDataFrame(rowRDD,schema)
Here, attributes._1 and attributes._2 represent the first and second components of each element
in the original RDD. The transformation maps each element of rdd to a Row object with two fields,
essentially converting a pair of attributes into a structured row.

2. Create Spark DataFrame from List and Seq Collection


In this section, we will see several approaches of how to create Spark DataFrame from
collection Seq[T] or List[T]. These examples would be similar to what we have seen in the above
section with RDD, but we use “data” object instead of “rdd” object.

78
Apache Spark - SparkByExamples

2.1 Using toDF() on List or Seq collection


The toDF() on collection (Seq, List) object creates a Spark DataFrame. Make sure
importing import spark.implicits._ to use toDF(). In Apache Spark using Scala, import
spark.implicits._ enables implicit conversions to Spark’s Dataset and DataFrame API.
// Import implicits
 import spark.implicits._
// Create DF from data object
 val dfFromData1 = data.toDF()
Here, val dfFromData1 = data.toDF() creates a DataFrame (dfFromData1) from a local collection
or Seq data. The toDF() method converts the collection into a DataFrame, automatically assigning
default column names. The import statement is necessary for the implicit conversion to work.
2.2 Using createDataFrame() from SparkSession
Calling createDataFrame() from SparkSession is another way to create a Spark DataFrame, and it
takes collection object (Seq or List) as an argument. and chain with toDF() to specify column
names.
// From Data (USING createDataFrame)
 var dfFromData2 = spark.createDataFrame(data).toDF(columns:_*)
Here, toDF(columns: _*): assigns column names to the DataFrame using the provided columns list
or sequence. The _* is a syntax to pass a variable number of arguments. It facilitates converting
the elements in columns into separate arguments for the toDF method.
2.3 Using createDataFrame() with the Row type
createDataFrame() has another signature in Spark that takes the util.List of Row type and schema
for column names as arguments. To use this first, we need to import
scala.collection.JavaConversions._
// Import
import scala.collection.JavaConversions._
// From Data (USING createDataFrame and Adding schema using StructType)
val rowData= Seq(Row("Java", "20000"),
Row("Python", "100000"),
Row("Scala", "3000"))
var dfFromData3 = spark.createDataFrame(rowData,schema)
3. Create Spark DataFrame from CSV
In all the above examples, you have learned how to create DataFrame from RDD and data
collection objects. In real-time, these are less used. Hence, in this and the following sections, you
will learn how to create DataFrame from data sources like CSV, text, JSON, Avro e.t.c
Spark, by default, provides an API to read delimiter files like comma, pipe, tab-separated files, and
it also provides several options on handling with a header, without header, double quotes, data
types, etc.
79
Apache Spark - SparkByExamples

For a detailed example, refer to create DataFrame from a CSV file.


// Create Spark DataFrame from CSV
 val df2 = spark.read.csv("/resources/file.csv")
4. Creating from text (TXT) file
Use spark.read.text() to read a text file and create a DataFrame from it.
// Creating from text (TXT) file
val df2 = spark.read
.text("/resources/file.txt")

5. Creating from JSON file


Here, will see how to create from a JSON file.
val df2 = spark.read
.json("/resources/file.json")

6. Creating from an XML file


To create DataFrame by parse XML, we should use
DataSource "com.databricks.spark.xml" spark-xml api from Databricks.
// Creating from an XML file
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.6.0</version>
</dependency>

val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("src/main/resources/persons.xml")

7. Creating from Hive


// Creating from Hive
val hiveContext = new org.apache.spark.sql.hive.HiveContext(spark.sparkContext)
val hiveDF = hiveContext.sql(“select * from emp”)
80
Apache Spark - SparkByExamples

8. Spark Create DataFrame from RDBMS Database


8.1 From Mysql table
Make sure you have MySQL library as a dependency in your pom.xml file or MySQL jars in your
classpath.
// From Mysql table
val df_mysql = spark.read.format(“jdbc”)
.option(“url”, “jdbc:mysql://localhost:port/db”)
.option(“driver”, “com.mysql.jdbc.Driver”)
.option(“dbtable”, “tablename”)
.option(“user”, “user”)
.option(“password”, “password”)
.load()
8.2 From DB2 table
Make sure you have DB2 library as a dependency in your pom.xml file or DB2 jars in your
classpath.
// From DB2 table
val df_db2 = spark.read.format(“jdbc”)
.option(“url”, “jdbc:db2://localhost:50000/dbname”)
.option(“driver”, “com.ibm.db2.jcc.DB2Driver”)
.option(“dbtable”, “tablename”)
.option(“user”, “user”)
.option(“password”, “password”)
.load()
Similarly, we can create DataFrame in Spark from most of the relational databases which I’ve not
covered here and I will leave this to you to explore.
9. Create DataFrame from HBase table
To create Spark DataFrame from the HBase table, we should use DataSource defined in Spark
HBase connectors. for example use DataSource
“org.apache.spark.sql.execution.datasources.hbase” from Hortonworks or use
“org.apache.hadoop.hbase.spark” from spark HBase connector.
// Create DataFrame from HBase table
val hbaseDF = sparkSession.read
.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
81
Apache Spark - SparkByExamples

10. Other sources (Avro, Parquet, Kafka)


We can also create Spark DataFrame from Avro, Parquet, HBase and reading data from Kafka
which I’ve explained in the below articles, I would recommend reading these when you have time.
 Creating DataFrame from Parquet source
 Creating DataFrame from Avro source
 Creating DataFrame by Streaming data from Kafka
The complete code can be downloaded from GitHub
Conclusion
In this article, you have learned different ways to create Spark DataFrame, for example: 1. From
RDD: Convert an RDD using spark.createDataFrame(rdd). 2. From Local
Collections: Use spark.createDataFrame(data: Seq[_], schema: StructType). 3. From Existing
DataFrames: Employ operations like union or join on existing DataFrames. 4. From External
Sources: Load data from sources like CSV, Parquet, or JSON files using spark.read.
______________________________________________________________________________

Spark where() Function


Spark where() function is used to select the rows from DataFrame or Dataset based on the given
condition or SQL expression, In this tutorial, you will learn how to apply single and multiple
conditions on DataFrame columns using where() function with Scala examples.
1. Spark DataFrame where() Syntaxes
// Spark DataFrame where() Syntaxes
1) where(condition: Column): Dataset[T]
2) where(conditionExpr: String): Dataset[T] //using SQL expression
3) where(func: T => Boolean): Dataset[T]
4) where(func: FilterFunction[T]): Dataset[T]

The first signature is used with condition with Column names


using $colname, col("colname"), 'colname and df("colname") with condition expression.
The second signature will be used to provide SQL expressions to filter rows.
The third signature can be used to SQL functions where function applied on each row and the
result with “true” are returned.
The fourth signature is used with FilterFunction class.
Before we start with examples, first let’s create a DataFrame.
82
Apache Spark - SparkByExamples

val arrayStructureData = Seq(


Row(Row("James","","Smith"),List("Java","Scala","C++"),"OH","M"),
Row(Row("Anna","Rose",""),List("Spark","Java","C++"),"NY","F"),
Row(Row("Julia","","Williams"),List("CSharp","VB"),"OH","F"),
Row(Row("Maria","Anne","Jones"),List("CSharp","VB"),"NY","M"),
Row(Row("Jen","Mary","Brown"),List("CSharp","VB"),"NY","M"),
Row(Row("Mike","Mary","Williams"),List("Python","VB"),"OH","M")
)
val arrayStructureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("languages", ArrayType(StringType))
.add("state", StringType)
.add("gender", StringType)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()
This yields below schema and DataFrame results.
// Output:

root

|-- name: struct (nullable = true)

| |-- firstname: string (nullable = true)

| |-- middlename: string (nullable = true)

| |-- lastname: string (nullable = true)

|-- languages: array (nullable = true)

| |-- element: string (containsNull = true)

|-- state: string (nullable = true)

|-- gender: string (nullable = true)

+--------------------+------------------+-----+------+

| name| languages|state|gender|

+--------------------+------------------+-----+------+

| [James, , Smith]|[Java, Scala, C++]| OH| M|

83
Apache Spark - SparkByExamples
| [Anna, Rose, ]|[Spark, Java, C++]| NY| F|

| [Julia, , Williams]| [CSharp, VB]| OH| F|

|[Maria, Anne, Jones]| [CSharp, VB]| NY| M|

| [Jen, Mary, Brown]| [CSharp, VB]| NY| M|

|[Mike, Mary, Will...| [Python, VB]| OH| M|

+--------------------+------------------+-----+------+

2. DataFrame where() with Column condition


Use Column with the condition to select the rows from DataFrame, using this you can express
complex condition by referring column names using col(name), $"colname" dfObject("colname") ,
this approach is mostly used while working with DataFrames. Use “===” for comparison.
// DataFrame where() with Column condition
df.where(df("state") === "OH")
.show(false)
// Output:

+----------------------+------------------+-----+------+

|name |languages |state|gender|

+----------------------+------------------+-----+------+

|[James, , Smith] |[Java, Scala, C++]|OH |M |

|[Julia, , Williams] |[CSharp, VB] |OH |F |

|[Mike, Mary, Williams]|[Python, VB] |OH |M |

+----------------------+------------------+-----+------+

3. DataFrame where() with SQL Expression


If you are coming from SQL background, you can use that knowledge in Spark to select
DataFrame rows with SQL expressions.
// DataFrame where() with SQL Expression
df.where("gender == 'M'")
.show(false)
// Output:

+----------------------+------------------+-----+------+

|name |languages |state|gender|

+----------------------+------------------+-----+------+

|[James, , Smith] |[Java, Scala, C++]|OH |M |

|[Maria, Anne, Jones] |[CSharp, VB] |NY |M |

|[Jen, Mary, Brown] |[CSharp, VB] |NY |M |

|[Mike, Mary, Williams]|[Python, VB] |OH |M |

4. Selecting with multiple conditions


84
Apache Spark - SparkByExamples

To select rows on DataFrame based on multiple conditions, you case use either Column with a
condition or SQL expression. Below is just a simple example, you can extend this with AND(&&),
OR(||), and NOT(!) conditional expressions as needed.
// Multiple condition
df.where(df("state") === "OH" && df("gender") === "M")
.show(false)
// Output:

+----------------------+------------------+-----+------+

|name |languages |state|gender|

+----------------------+------------------+-----+------+

|[James, , Smith] |[Java, Scala, C++]|OH |M |

|[Mike, Mary, Williams]|[Python, VB] |OH |M |

+----------------------+------------------+-----+------+

5. Selecting on an Array column


When you want to filter rows from DataFrame based on value present in an array collection
column, you can use the first syntax. The below example uses array_contains() SQL function
which checks if a value contains in an array if present it returns true otherwise false.
// Filtering on an Array column
df.where(array_contains(df("languages"),"Java"))
.show(false)
// Output:

+----------------+------------------+-----+------+

|name |languages |state|gender|

+----------------+------------------+-----+------+

|[James, , Smith]|[Java, Scala, C++]|OH |M |

|[Anna, Rose, ] |[Spark, Java, C++]|NY |F |

+----------------+------------------+-----+------+

6. Selecting on Nested Struct columns


If your DataFrame consists of nested struct columns, you can use any of the above syntaxes to
filter the rows based on the nested column.
// Struct condition
df.where(df("name.lastname") === "Williams")
.show(false)
// Output:

+----------------------+------------+-----+------+

|name |languages |state|gender|


85
Apache Spark - SparkByExamples
+----------------------+------------+-----+------+

|[Julia, , Williams] |[CSharp, VB]|OH |F |

|[Mike, Mary, Williams]|[Python, VB]|OH |M |

+----------------------+------------+-----+------+

7. Source code of Spark DataFrame using where()

package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.{Row, SparkSession}


import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
import org.apache.spark.sql.functions.array_contains
object FilterExample extends App{

val spark: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

val arrayStructureData = Seq(


Row(Row("James","","Smith"),List("Java","Scala","C++"),"OH","M"),
Row(Row("Anna","Rose",""),List("Spark","Java","C++"),"NY","F"),
Row(Row("Julia","","Williams"),List("CSharp","VB"),"OH","F"),
Row(Row("Maria","Anne","Jones"),List("CSharp","VB"),"NY","M"),
Row(Row("Jen","Mary","Brown"),List("CSharp","VB"),"NY","M"),
Row(Row("Mike","Mary","Williams"),List("Python","VB"),"OH","M")
)
val arrayStructureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)

86
Apache Spark - SparkByExamples

.add("lastname",StringType))
.add("languages", ArrayType(StringType))
.add("state", StringType)
.add("gender", StringType)

val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()

// Condition
df.where(df("state") === "OH")
.show(false)

// SQL Expression
df.where("gender == 'M'")
.show(false)

// Multiple condition
df.where(df("state") === "OH" && df("gender") === "M")
.show(false)

// Array condition
df.where(array_contains(df("languages"),"Java"))
.show(false)

// Struct condition
df.where(df("name.lastname") === "Williams")
.show(false)

Conclusion
87
Apache Spark - SparkByExamples

In this tutorial, I’ve explained how to select rows from Spark DataFrame based on single or
multiple conditions and SQL expression using where() function, also learned filtering rows by
providing conditions on the array and struct column with Scala examples.
Alternatively, you also use filter() function to filter the rows on DataFrame.
______________________________________________________________________________

Spark DataFrame withColumn


Spark withColumn() is a DataFrame function that is used to add a new column to DataFrame,
change the value of an existing column, convert the datatype of a column, derive a new column
from an existing column, on this post, I will walk you through commonly used DataFrame column
operations with Scala examples.
 Spark withColumn() Syntax and Usage
 Add a New Column to DataFrame
 Change Value of an Existing Column
 Derive New Column From an Existing Column
 Change Column DataType
 Add, Replace or Update Multiple Columns
 Rename Column Name
 Drop a Column From DataFrame
 Split Column into Multiple Columns
Spark withColumn() Syntax and Usage
Spark withColumn() is a transformation function of DataFrame that is used to manipulate the
column values of all rows or selected rows on DataFrame.
withColumn() function returns a new Spark DataFrame after performing operations like adding a
new column, update the value of an existing column, derive a new column from an existing
column, and many more.
Below is a syntax of withColumn() function.
 withColumn(colName : String, col : Column) : DataFrame
colName:String – specify a new column you wanted to create. use an existing column to update
the value.
col:Column – column expression.
Since withColumn() is a transformation function it doesn’t execute until action is called.

88
Apache Spark - SparkByExamples

Spark withColumn() method introduces a projection internally. Therefore, calling it multiple times,
for instance, via loops in order to add multiple columns can generate big plans which can cause
performance issues and even StackOverflowException. To avoid this, use select with the multiple
columns at once.
Spark Documentation
First, let’s create a DataFrame to work with.
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructType}
val data = Seq(Row(Row("James;","","Smith"),"36636","M","3000"),
Row(Row("Michael","Rose",""),"40288","M","4000"),
Row(Row("Robert","","Williams"),"42114","M","4000"),
Row(Row("Maria","Anne","Jones"),"39192","F","4000"),
Row(Row("Jen","Mary","Brown"),"","F","-1")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("dob",StringType)
.add("gender",StringType)
.add("salary",StringType)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
1. Add a New Column to DataFrame
To create a new column, pass your desired column name to the first argument of withColumn()
transformation function. Make sure this new column not already present on DataFrame, if it
presents it updates the value of the column. On the below snippet, lit() function is used to add a
constant value to a DataFrame column. We can also chain in order to add multiple columns.
// Add a New Column to DataFrame
import org.apache.spark.sql.functions.lit
df.withColumn("Country", lit("USA"))

// Chaining to operate on multiple columns


df.withColumn("Country", lit("USA"))
.withColumn("anotherColumn",lit("anotherValue"))

89
Apache Spark - SparkByExamples

The above approach is fine if you are manipulating few columns, but when you wanted to add or
update multiple columns, do not use the chaining withColumn() as it leads to performance issues,
use select() to update multiple columns instead.
2. Change Value of an Existing Column
Spark withColumn() function of DataFrame can also be used to update the value of an existing
column. In order to change the value, pass an existing column name as a first argument and value
to be assigned as a second column. Note that the second argument should be Column type .
// Change Value of an Existing Column
 import org.apache.spark.sql.functions.col
 df.withColumn("salary",col("salary")*100)
This snippet multiplies the value of “salary” with 100 and updates the value back to “salary”
column.
3. Derive New Column From an Existing Column
To create a new column, specify the first argument with a name you want your new column to be
and use the second argument to assign a value by applying an operation on an existing column.
// Derive New Column From an Existing Column
 df.withColumn("CopiedColumn",col("salary")* -1)
This snippet creates a new column “CopiedColumn” by multiplying “salary” column with value -1.
4. Change Column Data Type
By using Spark withColumn on a DataFrame and using cast function on a column, we can change
datatype of a DataFrame column. The below statement changes the datatype from String to
Integer for the “salary” column.
// Change Column Data Type
 df.withColumn("salary",col("salary").cast("Integer"))
5. Add, Replace, or Update multiple Columns
When you wanted to add, replace or update multiple columns in Spark DataFrame, it is not
suggestible to chain withColumn() function as it leads into performance issue and recommends to
use select() after creating a temporary view on DataFrame
// Add, Replace, or Update multiple Columns
 df2.createOrReplaceTempView("PERSON")
spark.sql("SELECT salary*100 as salary, salary*-1 as CopiedColumn, 'USA' as country FROM
PERSON").show()
6. Rename Column Name
Though examples in 6,7, and 8 doesn’t use withColumn() function, I still feel like explaining how to
rename, drop, and split columns as these would be useful to you.
To rename an existing column use “withColumnRenamed” function on DataFrame.

90
Apache Spark - SparkByExamples

// Rename Column Name


 df.withColumnRenamed("gender","sex")
7. Drop a Column
Use drop() function to drop a specific column from the DataFrame.
// Drop a Column
 df.drop("CopiedColumn")
8. Split Column into Multiple Columns
Though this example doesn’t use withColumn() function, I still feel like it’s good to explain
on splitting one DataFrame column to multiple columns using Spark map() transformation function.
// Split Column into Multiple Columns
import spark.implicits._

val columns = Seq("name","address")


val data = Seq(("Robert, Smith", "1 Main st, Newark, NJ, 92537"),
("Maria, Garcia","3456 Walnut st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()

val newDF = dfFromData.map(f=>{


val nameSplit = f.getAs[String](0).split(",")
val addSplit = f.getAs[String](1).split(",")
(nameSplit(0),nameSplit(1),addSplit(0),addSplit(1),addSplit(2),addSplit(3))
})
val finalDF = newDF.toDF("First Name","Last Name",
"Address Line1","City","State","zipCode")
finalDF.printSchema()
finalDF.show(false)
This snippet split name column into “first name”, “last name” and address column into “Address
Line1”, “City”, “State” and “ZipCode”. Yields below output:
// Output

root

|-- First Name: string (nullable = true)

|-- Last Name: string (nullable = true)

|-- Address Line1: string (nullable = true)

91
Apache Spark - SparkByExamples
|-- City: string (nullable = true)

|-- State: string (nullable = true)

|-- zipCode: string (nullable = true)

+----------+---------+--------------+-------+-----+-------+

|First Name|Last Name|Address Line1 |City |State|zipCode|

+----------+---------+--------------+-------+-----+-------+

|Robert | Smith |1 Main st | Newark| NJ | 92537 |

|Maria | Garcia |3456 Walnut st| Newark| NJ | 94732 |

+----------+---------+--------------+-------+-----+-------+

Note: Note that all of these functions return the new DataFrame after applying the functions
instead of updating DataFrame.

9. Spark withColumn() Complete Example


import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructType}
import org.apache.spark.sql.functions._
object WithColumn {
def main(args:Array[String]):Unit= {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
val dataRows = Seq(Row(Row("James;","","Smith"),"36636","M","3000"),
Row(Row("Michael","Rose",""),"40288","M","4000"),
Row(Row("Robert","","Williams"),"42114","M","4000"),
Row(Row("Maria","Anne","Jones"),"39192","F","4000"),
Row(Row("Jen","Mary","Brown"),"","F","-1")
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("dob",StringType)
.add("gender",StringType)
92
Apache Spark - SparkByExamples

.add("salary",StringType)
val df2 = spark.createDataFrame(spark.sparkContext.parallelize(dataRows),schema)

// Change the column data type


df2.withColumn("salary",df2("salary").cast("Integer"))

// Derive a new column from existing


val df4=df2.withColumn("CopiedColumn",df2("salary")* -1)

// Transforming existing column


val df5 = df2.withColumn("salary",df2("salary")*100)

// You can also chain withColumn to change multiple columns


// Renaming a column.
val df3=df2.withColumnRenamed("gender","sex")
df3.printSchema()

// Droping a column
val df6=df4.drop("CopiedColumn")
println(df6.columns.contains("CopiedColumn"))

// Adding a literal value


df2.withColumn("Country", lit("USA")).printSchema()

// Retrieving
df2.show(false)
df2.select("name").show(false)
df2.select("name.firstname").show(false)
df2.select("name.*").show(false)

import spark.implicits._

val columns = Seq("name","address")


93
Apache Spark - SparkByExamples

val data = Seq(("Robert, Smith", "1 Main st, Newark, NJ, 92537"), ("Maria, Garcia","3456 Walnut
st, Newark, NJ, 94732"))
var dfFromData = spark.createDataFrame(data).toDF(columns:_*)
dfFromData.printSchema()

val newDF = dfFromData.map(f=>{


val nameSplit = f.getAs[String](0).split(",")
val addSplit = f.getAs[String](1).split(",")
(nameSplit(0),nameSplit(1),addSplit(0),addSplit(1),addSplit(2),addSplit(3))
})
val finalDF = newDF.toDF("First Name","Last Name","Address Line1","City","State","zipCode")
finalDF.printSchema()
finalDF.show(false)

df2.createOrReplaceTempView("PERSON")
spark.sql("SELECT salary*100 as salary, salary*-1 as CopiedColumn, 'USA' as country FROM
PERSON").show()
}
}

Spark Groupby Example with DataFrame


Similar to SQL “GROUP BY” clause, Spark groupBy() function is used to collect the identical data
into groups on DataFrame/Dataset and perform aggregate functions on the grouped data. In this
article, I will explain several groupBy() examples with the Scala language.

Syntax:
groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
org.apache.spark.sql.RelationalGroupedDataset
When we perform groupBy() on Spark Dataframe, it returns RelationalGroupedDataset object
which contains below aggregate functions.
count() - Returns the count of rows for each group.

94
Apache Spark - SparkByExamples

mean() - Returns the mean of values for each group.


max() - Returns the maximum of values for each group.
min() - Returns the minimum of values for each group.
sum() - Returns the total for values for each group.
avg() - Returns the average for values for each group.
agg() - Using agg() function, we can calculate more than one aggregate at a time.
pivot() - This function is used to Pivot the DataFrame which I will not be covered in this article as I
already have a dedicated article for Pivot & Unvot DataFrame.
Preparing Data & DataFrame
Before we start, let’s create the DataFrame from a sequence of the data to work with. This
DataFrame contains columns “employee_name”, “department”, “state“, “salary”, “age” and “bonus”
columns.
We will use this Spark DataFrame to run groupBy() on “department” columns and calculate
aggregates like minimum, maximum, average, total salary for each group using min(), max() and
sum() aggregate functions respectively. and finally, we will also see how to do group and
aggregate on multiple columns.
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
Yields below output.

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

| James| Sales| NY| 90000| 34|10000|

| Michael| Sales| NY| 86000| 56|20000|

| Robert| Sales| CA| 81000| 30|23000|

95
Apache Spark - SparkByExamples
| Maria| Finance| CA| 90000| 24|23000|

| Raman| Finance| CA| 99000| 40|24000|

| Scott| Finance| NY| 83000| 36|19000|

| Jen| Finance| NY| 79000| 53|15000|

| Jeff| Marketing| CA| 80000| 25|18000|

| Kumar| Marketing| NY| 91000| 50|21000|

+-------------+----------+-----+------+---+-----+

groupBy and aggregate on DataFrame columns


Let’s do the groupBy() on department column of DataFrame and then find the sum of salary for
each department using sum() aggregate function.

df.groupBy("department").sum("salary").show(false)
+----------+-----------+

|department|sum(salary)|

+----------+-----------+

|Sales |257000 |

|Finance |351000 |

|Marketing |171000 |

+----------+-----------+

Similarly, we can calculate the number of employee in each department using count()

df.groupBy("department").count()
Calculate the minimum salary of each department using min()

df.groupBy("department").min("salary")
Calculate the maximin salary of each department using max()

df.groupBy("department").max("salary")
Calculate the average salary of each department using avg()

df.groupBy("department").avg( "salary")
Calculate the mean salary of each department using mean()

df.groupBy("department").mean( "salary")
groupBy and aggregate on multiple DataFrame columns

96
Apache Spark - SparkByExamples

Similarly, we can also run groupBy and aggregate on two or more DataFrame columns, below
example does group by on department,state and does sum() on salary and bonus columns.
//GroupBy on multiple columns
df.groupBy("department","state")
.sum("salary","bonus")
.show(false)
This yields the below output.

+----------+-----+-----------+----------+

|department|state|sum(salary)|sum(bonus)|

+----------+-----+-----------+----------+

|Finance |NY |162000 |34000 |

|Marketing |NY |91000 |21000 |

|Sales |CA |81000 |23000 |

|Marketing |CA |80000 |18000 |

|Finance |CA |189000 |47000 |

|Sales |NY |176000 |30000 |

+----------+-----+-----------+----------+

similarly, we can run group by and aggregate on tow or more columns for other aggregate
functions, please refer below source code for example.
Running more aggregates at a time
Using agg() aggregate function we can calculate many aggregations at a time on a single
statement using Spark SQL aggregate functions sum(), avg(), min(), max() mean() e.t.c. In order
to use these, we should import "import org.apache.spark.sql.functions._"
import org.apache.spark.sql.functions._
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.show(false)
This example does group on department column and calculates sum() and avg() of salary for each
department and calculates sum() and max() of bonus for each department.
+----------+----------+-----------------+---------+---------+

|department|sum_salary|avg_salary |sum_bonus|max_bonus|

+----------+----------+-----------------+---------+---------+

|Sales |257000 |85666.66666666667|53000 |23000 |

97
Apache Spark - SparkByExamples
|Finance |351000 |87750.0 |81000 |24000 |

|Marketing |171000 |85500.0 |39000 |21000 |

+----------+----------+-----------------+---------+---------+

Using filter on aggregate data


Similar to SQL “HAVING” clause, On Spark DataFrame we can use either the where() or
filter() function to filter the rows of aggregated data.
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.where(col("sum_bonus") >= 50000)
.show(false)
This removes the sum of a bonus that has less than 50000 and yields below output.
+----------+----------+-----------------+---------+---------+

|department|sum_salary|avg_salary |sum_bonus|max_bonus|

+----------+----------+-----------------+---------+---------+

|Sales |257000 |85666.66666666667|53000 |23000 |

|Finance |351000 |87750.0 |81000 |24000 |

+----------+----------+-----------------+---------+---------+

Source code
package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object GroupbyExample extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
98
Apache Spark - SparkByExamples

("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()

//Group By on single column


df.groupBy("department").count().show(false)
df.groupBy("department").avg("salary").show(false)
df.groupBy("department").sum("salary").show(false)
df.groupBy("department").min("salary").show(false)
df.groupBy("department").max("salary").show(false)
df.groupBy("department").mean("salary").show(false)

//GroupBy on multiple columns


df.groupBy("department","state")
.sum("salary","bonus")
.show(false)
df.groupBy("department","state")
.avg("salary","bonus")
.show(false)
df.groupBy("department","state")
.max("salary","bonus")
.show(false)
df.groupBy("department","state")
.min("salary","bonus")
.show(false)
99
Apache Spark - SparkByExamples

df.groupBy("department","state")
.mean("salary","bonus")
.show(false)

//Running Filter
df.groupBy("department","state")
.sum("salary","bonus")
.show(false)

//using agg function


df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.show(false)

df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
stddev("bonus").as("stddev_bonus"))
.where(col("sum_bonus") > 50000)
.show(false)
}
This example is also available at GitHub project for reference.
Conclusion
In this tutorial, you have learned how to use groupBy() and aggregate functions on Spark
DataFrame and also learned how to run these on multiple columns and finally filtering data on the
aggregated column.

100
Apache Spark - SparkByExamples

Spark SQL Join Types with examples


Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT
OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN. Spark SQL Joins are wider
transformations that result in data shuffling over the network hence they have huge performance
issues when not designed with care.
Related: PySpark SQL Tutorials
On the other hand, Spark SQL Joins comes with more optimization by default (thanks
to DataFrames & Dataset) however still there would be some performance issues to consider
while using.
In this tutorial, you will learn different Join syntaxes and using different Join types on two
DataFrames and Datasets using Scala examples. Please access Join on Multiple DataFrames in
case you want to join more than two DataFrames.
 Join Syntax & Types
 Inner Join
 Full Outer Join
 Left Outer Join
 Right Outer Join
 Left Anti Join
 Left Semi Join
 Self Join
 Using SQL Expression

1. SQL Join Types & Syntax


Below is the list of all Spark SQL Join Types and Syntaxes.
1) join(right: Dataset[_]): DataFrame
2) join(right: Dataset[_], usingColumn: String): DataFrame
3) join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
4) join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
5) join(right: Dataset[_], joinExprs: Column): DataFrame
6) join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

The rest of the tutorial explains Join Types using syntax 6 which takes arguments right join
DataFrame, join expression and type of join in String.
For Syntax 4 & 5 you can use either “JoinType” or “Join String” defined on the above table for
“joinType” string argument. When you use “JoinType”, you should import
org.apache.spark.sql.catalyst.plans._ as this package defines JoinType objects.

101
Apache Spark - SparkByExamples

JoinType Join String Equivalent SQL Join

Inner.sql inner INNER JOIN

FullOuter.sql outer, full, fullouter, full_outer FULL OUTER JOIN

LeftOuter.sql left, leftouter, left_outer LEFT JOIN

RightOuter.sql right, rightouter, right_outer RIGHT JOIN

Cross.sql cross

LeftAnti.sql anti, leftanti, left_anti

LeftSemi.sql semi, leftsemi, left_semi

All Join objects are defined at joinTypes class, In order to use these you need to
import org.apache.spark.sql.catalyst.plans.{LeftOuter,Inner,....}.
Before we jump into Spark SQL Join examples, first, let’s create an emp and dept DataFrames.
here, column emp_id is unique on emp and dept_id is unique on the dept datasets and
emp_dept_id from emp has a reference to dept_id on dept dataset.
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)

val dept = Seq(("Finance",10),


("Marketing",20),

102
Apache Spark - SparkByExamples

("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)

This prints “emp” and “dept” DataFrame to the console.


Emp Dataset

+------+--------+---------------+-----------+-----------+------+------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|

+------+--------+---------------+-----------+-----------+------+------+

|1 |Smith |-1 |2018 |10 |M |3000 |

|2 |Rose |1 |2010 |20 |M |4000 |

|3 |Williams|1 |2010 |10 |M |1000 |

|4 |Jones |2 |2005 |10 |F |2000 |

|5 |Brown |2 |2010 |40 | |-1 |

|6 |Brown |2 |2010 |50 | |-1 |

+------+--------+---------------+-----------+-----------+------+------+

Dept Dataset

+---------+-------+

|dept_name|dept_id|

+---------+-------+

|Finance |10 |

|Marketing|20 |

|Sales |30 |

|IT |40 |

+---------+-------+

2. Inner Join
Spark Inner join is the default join and it’s mostly used, It is used to join two DataFrames/Datasets
on key columns, and where keys don’t match the rows get dropped from both datasets
(emp & dept).
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)
When we apply Inner join on our datasets, It drops “emp_dept_id” 50 from “emp” and “dept_id” 30
from “dept” datasets. Below is the result of the above Join expression.

103
Apache Spark - SparkByExamples
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |

|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |

|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |

|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |

|5 |Brown |2 |2010 |40 | |-1 |IT |40 |

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

3. Full Outer Join


Outer a.k.a full, fullouter join returns all rows from both Spark DataFrame/Datasets, where join
expression doesn’t match it returns null on respective record columns.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter")
.show(false)
From our “emp” dataset’s “emp_dept_id” with value 50 doesn’t have a record on “dept” hence dept
columns have null and “dept_id” 30 doesn’t have a record in “emp” hence you see null’s on emp
columns. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |

|5 |Brown |2 |2010 |40 | |-1 |IT |40 |

|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |

|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |

|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |

|6 |Brown |2 |2010 |50 | |-1 |null |null |

|null |null |null |null |null |null |null |Sales |30 |

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

4. Left Outer Join


Spark Left a.k.a Left Outer join returns all rows from the left DataFrame/Dataset regardless of
match found on the right dataset when join expression doesn’t match, it assigns null for that record
and drops records from right where match not found.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left")
.show(false)
104
Apache Spark - SparkByExamples

empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter")


.show(false)
From our dataset, “emp_dept_id” 5o doesn’t have a record on “dept” dataset hence, this record
contains null on “dept” columns (dept_name & dept_id). and “dept_id” 30 from “dept” dataset
dropped from the results. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |

|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |

|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |

|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |

|5 |Brown |2 |2010 |40 | |-1 |IT |40 |

|6 |Brown |2 |2010 |50 | |-1 |null |null |

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

5. Right Outer Join


Spark Right a.k.a Right Outer join is opposite of left join, here it returns all rows from the right
DataFrame/Dataset regardless of math found on the left dataset, when join expression doesn’t
match, it assigns null for that record and drops records from left where match not found.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right")
.show(false)
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter")
.show(false)
From our example, the right dataset “dept_id” 30 doesn’t have it on the left dataset “emp” hence,
this record contains null on “emp” columns. and “emp_dept_id” 50 dropped as a match not found
on left. Below is the result of the above Join expression.
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

|4 |Jones |2 |2005 |10 |F |2000 |Finance |10 |

|3 |Williams|1 |2010 |10 |M |1000 |Finance |10 |

|1 |Smith |-1 |2018 |10 |M |3000 |Finance |10 |

|2 |Rose |1 |2010 |20 |M |4000 |Marketing|20 |

|null |null |null |null |null |null |null |Sales |30 |

|5 |Brown |2 |2010 |40 | |-1 |IT |40 |

+------+--------+---------------+-----------+-----------+------+------+---------+-------+

105
Apache Spark - SparkByExamples

6. Left Semi Join


Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from
the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join
returns columns from the only left dataset for the records match in the right dataset on join
expression, records not matched on join expression are ignored from both left and right datasets.
The same result can be achieved using select on the result of the inner join however, using this
join would be efficient.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)
Below is the result of the above join expression.
leftsemi join
+------+--------+---------------+-----------+-----------+------+------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|

+------+--------+---------------+-----------+-----------+------+------+

|1 |Smith |-1 |2018 |10 |M |3000 |

|2 |Rose |1 |2010 |20 |M |4000 |

|3 |Williams|1 |2010 |10 |M |1000 |

|4 |Jones |2 |2005 |10 |F |2000 |

|5 |Brown |2 |2010 |40 | |-1 |

+------+--------+---------------+-----------+-----------+------+------+

7. Left Anti Join


Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns
from the left DataFrame/Dataset for non-matched records.
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti")
.show(false)
+------+-----+---------------+-----------+-----------+------+------+

|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|

+------+-----+---------------+-----------+-----------+------+------+

|6 |Brown|2 |2010 |50 | |-1 |

+------+-----+---------------+-----------+-----------+------+------+

8. Self Join
Spark Joins are not complete without a self join, Though there is no self-join type available, we
can use any of the above-explained join types to join DataFrame to itself. below example
use inner self join
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),

106
Apache Spark - SparkByExamples

col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)
Here, we are joining emp dataset with itself to find out superior emp_id and name for all
employees.
+------+--------+---------------+-----------------+

|emp_id|name |superior_emp_id|superior_emp_name|

+------+--------+---------------+-----------------+

|2 |Rose |1 |Smith |

|3 |Williams|1 |Smith |

|4 |Jones |2 |Rose |

|5 |Brown |2 |Rose |

|6 |Brown |2 |Rose |

+------+--------+---------------+-----------------+

9. Using SQL Expression


Since Spark SQL supports native SQL syntax, we can also write join operations after creating
temporary tables on DataFrame’s and using spark.sql()
empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")
//SQL JOIN
val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
joinDF.show(false)
val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id ==
d.dept_id")
joinDF2.show(false)

10. Source Code | Scala Example


package com.sparkbyexamples.spark.dataframe.join
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object JoinExample extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

107
Apache Spark - SparkByExamples

spark.sparkContext.setLogLevel("ERROR")
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns =
Seq("emp_id","name","superior_emp_id","year_joined","emp_dept_id","gender","salary")
import spark.sqlContext.implicits._
val empDF = emp.toDF(empColumns:_*)
empDF.show(false)
val dept = Seq(("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
)
val deptColumns = Seq("dept_name","dept_id")
val deptDF = dept.toDF(deptColumns:_*)
deptDF.show(false)

println("Inner join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"inner")
.show(false)

println("Outer join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"outer")
.show(false)
println("full join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"full")
.show(false)

108
Apache Spark - SparkByExamples

println("fullouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"fullouter")
.show(false)

println("right join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"right")
.show(false)
println("rightouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"rightouter")
.show(false)

println("left join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"left")
.show(false)
println("leftouter join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftouter")
.show(false)

println("leftanti join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftanti")
.show(false)

println("leftsemi join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"leftsemi")
.show(false)

println("cross join")
empDF.join(deptDF,empDF("emp_dept_id") === deptDF("dept_id"),"cross")
.show(false)

println("Using crossJoin()")
empDF.crossJoin(deptDF).show(false)

109
Apache Spark - SparkByExamples

println("self join")
empDF.as("emp1").join(empDF.as("emp2"),
col("emp1.superior_emp_id") === col("emp2.emp_id"),"inner")
.select(col("emp1.emp_id"),col("emp1.name"),
col("emp2.emp_id").as("superior_emp_id"),
col("emp2.name").as("superior_emp_name"))
.show(false)

empDF.createOrReplaceTempView("EMP")
deptDF.createOrReplaceTempView("DEPT")

//SQL JOIN
val joinDF = spark.sql("select * from EMP e, DEPT d where e.emp_dept_id == d.dept_id")
joinDF.show(false)

val joinDF2 = spark.sql("select * from EMP e INNER JOIN DEPT d ON e.emp_dept_id ==


d.dept_id")
joinDF2.show(false)
}
Conclusion
In this tutorial, you have learned Spark SQL Join types INNER, LEFT OUTER, RIGHT
OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF joins usage, and examples with Scala.

Spark map() vs mapPartitions() with Examples


Spark map() and mapPartitions() transformations apply the function on each element/record/row of
the DataFrame/Dataset and returns the new DataFrame/Dataset, In this article, I will explain the
difference between map() vs mapPartitions() transformations, their syntax, and usages with Scala
examples.
 map() – Spark map() transformation applies a function to each row in a DataFrame/Dataset
and returns the new transformed Dataset.
 mapPartitions() – This is exactly the same as map(); the difference being,
Spark mapPartitions() provides a facility to do heavy initializations (for example Database
connection) once for each partition instead of doing it on every DataFrame row. This helps
the performance of the job when you dealing with heavy-weighted initialization on larger
datasets.
110
Apache Spark - SparkByExamples

Key Points:
 One key point to remember, these both transformations returns the Dataset[U] but not
the DataFrame (In Spark 2.0, DataFrame = Dataset[Row]) .
 After applying the transformation function on each row of the input DataFrame/Dataset,
these return the same number of rows as input but the schema or number of the columns of
the result could be different.
 If you know flatMap() transformation, this is the key difference between map and
flatMap where map returns only one row/element for every input, while flatMap() can return
a list of rows/elements.

Spark map() vs mapPartitions() Example


Let’s see the differences with example. First let’s create a Spark DataFrame
val structureData = Seq(
Row("James","","Smith","36636","NewYork",3100),
Row("Michael","Rose","","40288","California",4300),
Row("Robert","","Williams","42114","Florida",1400),
Row("Maria","Anne","Jones","39192","Florida",5500),
Row("Jen","Mary","Brown","34561","NewYork",3000)
)
val structureSchema = new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType)
.add("id",StringType)
.add("location",StringType)
.add("salary",IntegerType)
val df2 = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df2.printSchema()
df2.show(false)
Yields below output

root

|-- firstname: string (nullable = true)

|-- middlename: string (nullable = true)

|-- lastname: string (nullable = true)

|-- id: string (nullable = true)


111
Apache Spark - SparkByExamples
|-- location: string (nullable = true)

|-- salary: integer (nullable = true)

+---------+----------+--------+-----+----------+------+

|firstname|middlename|lastname|id |location |salary|

+---------+----------+--------+-----+----------+------+

|James | |Smith |36636|NewYork |3100 |

|Michael |Rose | |40288|California|4300 |

|Robert | |Williams|42114|Florida |1400 |

|Maria |Anne |Jones |39192|Florida |5500 |

|Jen |Mary |Brown |34561|NewYork |3000 |

+---------+----------+--------+-----+----------+------+

In order to explain map() and mapPartitions() with an example, let’s also create a “Util” class with a
method combine(), this is a simple method that takes three string arguments and combines them
with a comma delimiter. In realtime, this could be a third-party class that does complex
transformation.
class Util extends Serializable {
def combine(fname:String,mname:String,lname:String):String = {
fname+","+mname+","+lname
}
}
We will create an object for this class by initializing and call the combine() method for each row in
a DataFrame.
Spark map() transformation
Spark map() transformation applies a function to each row in a DataFrame/Dataset and returns the
new transformed Dataset. As mentioned earlier, map() returns one row for every row in a input
DataFrame, in other words, input and the result exactly contains the same number of rows.
For example, if you have 100 rows in a DataFrame, after applying the function map() return with
exactly 100 rows. However, the structure or schema of the result could be different.
Syntax:
1) map[U](func : scala.Function1[T, U])(implicit evidence$6 : org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]
2) map[U](func : org.apache.spark.api.java.function.MapFunction[T, U], encoder :
org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]
Spark provides 2 map transformation signatures one takes scala.function1 as argument and the
other takes MapFunction and if you notice both these functions return Dataset[U] but not
DataFrame (which is Dataset[Row]). If you want a DataFrame as output then you need to convert
the Dataset to DataFrame using toDF() function.
Usage:
112
Apache Spark - SparkByExamples

import spark.implicits._
val df3 = df2.map(row=>{
// This initialization happens to every records
// If it is heavy initilizations like Database connects
// It degrates the performance
val util = new Util()
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")

df3Map.printSchema()
df3Map.show(false)
Since map transformations execute on worker nodes, we have initialized and create an object of
the Util class inside the map() function and the initialization happens for every row in a DataFrame.
This causes performance issues when you have heavily weighted initializations.
Note: When you running it on Standalone mode, initializing the class outside of the map() still
works as both executors and driver run on the same JVM but running this on cluster fails with
exception.
Above example yields below output.
root
|-- fullName: string (nullable = true)
|-- id: string (nullable = true)
|-- salary: integer (nullable = false)
+----------------+-----+------+
|fullName |id |salary|
+----------------+-----+------+
|James,,Smith |36636|3100 |
|Michael,Rose, |40288|4300 |
|Robert,,Williams|42114|1400 |
|Maria,Anne,Jones|39192|5500 |
|Jen,Mary,Brown |34561|3000 |
+----------------+-----+------+

As you notice the above output, the input of the DataFrame has 5 rows so the result of the map
also has 5 but the column counts are different.
Spark mapPartitions() transformation
113
Apache Spark - SparkByExamples

Spark mapPartitions() provides a facility to do heavy initializations (for example Database


connection) once for each partition instead of doing it on every DataFrame row. This helps the
performance of the job when you dealing with heavy-weighted initialization on larger datasets.
Syntax:
1) mapPartitions[U](func : scala.Function1[scala.Iterator[T], scala.Iterator[U]])(implicit evidence$7 :
org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]
2) mapPartitions[U](f : org.apache.spark.api.java.function.MapPartitionsFunction[T, U], encoder :
org.apache.spark.sql.Encoder[U])
: org.apache.spark.sql.Dataset[U]

map partitions also have 2 signatures, one take scala.Function1 and other takes
spark MapPartitionsFunction arguments.
mapPartitions() keeps the result of the partition in-memory until it finishes executing all rows in a
partition.

Usage:
val df4 = df2.mapPartitions(iterator => {
// Do the heavy initialization here
// Like database connections e.t.c
val util = new Util()
val res = iterator.map(row=>{
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
res
})
val df4part = df4.toDF("fullName","id","salary")
df4part.printSchema()
df4part.show(false)
This yields the same output as above.

Complete example of Spark DataFrame map() & mapPartitions()


114
Apache Spark - SparkByExamples

Below is complete example of Spark DataFrame map() & mapPartition() example.

package com.sparkbyexamples.spark.dataframe.examples

import org.apache.spark.sql.{Row, SparkSession}


import org.apache.spark.sql.types.{IntegerType, StringType, StructType,ArrayType,MapType}

# Create util class


class Util extends Serializable {
def combine(fname:String,mname:String,lname:String):String = {
fname+","+mname+","+lname
}
}
# Create object to run
object MapTransformation extends App{

val spark:SparkSession = SparkSession.builder()


.master("local[5]")
.appName("SparkByExamples.com")
.getOrCreate()

val structureData = Seq(


Row("James","","Smith","36636","NewYork",3100),
Row("Michael","Rose","","40288","California",4300),
Row("Robert","","Williams","42114","Florida",1400),
Row("Maria","Anne","Jones","39192","Florida",5500),
Row("Jen","Mary","Brown","34561","NewYork",3000)
)

val structureSchema = new StructType()


.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType)
115
Apache Spark - SparkByExamples

.add("id",StringType)
.add("location",StringType)
.add("salary",IntegerType)

val df2 = spark.createDataFrame(


spark.sparkContext.parallelize(structureData),structureSchema)
df2.printSchema()
df2.show(false)

import spark.implicits._
val util = new Util()
val df3 = df2.map(row=>{

val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))


(fullName, row.getString(3),row.getInt(5))
})
val df3Map = df3.toDF("fullName","id","salary")

df3Map.printSchema()
df3Map.show(false)

val df4 = df2.mapPartitions(iterator => {


val util = new Util()
val res = iterator.map(row=>{
val fullName = util.combine(row.getString(0),row.getString(1),row.getString(2))
(fullName, row.getString(3),row.getInt(5))
})
res
})
val df4part = df4.toDF("fullName","id","salary")
df4part.printSchema()
df4part.show(false) }
Conclusion
116
Apache Spark - SparkByExamples

In this Spark DataFrame article, you have learned map() and mapPartitions() transformations
execute a function on each and every row and returns the same number of records as in input but
with the same or different schema or columns. Also learned when you have a complex initialization
you should be using mapPratitions() as it has the capability to do initializations once for each
partition instead of every DataFrame row..
______________________________________________________________________________

Spark foreachPartition vs foreach


In Spark foreachPartition() is used when you have a heavy initialization (like database connection)
and wanted to initialize once per partition where as foreach() is used to apply a function on every
element of a RDD/DataFrame/Dataset partition.
In this Spark Dataframe article, you will learn what is foreachPartiton used for and the differences
with its sibling foreach (foreachPartiton vs foreach) function.
Spark foreachPartition is an action operation and is available in RDD, DataFrame, and Dataset.
This is different than other actions as foreachPartition() function doesn’t return a value instead it
executes input function on each partition.
 DataFrame foreachPartition() Usage
 DataFrame foreach() Usage
 RDD foreachPartition() Usage
 RDD foreach() Usage

1. DataFrame foreachPartition() Usage


On Spark DataFrame foreachPartition() is similar to foreach() action which is used to manipulate
the accumulators, write to a database table or external data sources but the difference being
foreachPartiton() gives you an option to do heavy initializations per each partition and is consider
most efficient.
1.1 Syntax
117
Apache Spark - SparkByExamples

foreachPartition(f : scala.Function1[scala.Iterator[T], scala.Unit]) : scala.Unit


When foreachPartition() applied on Spark DataFrame, it executes a function specified in foreach()
for each partition on DataFrame. This operation is mainly used if you wanted to save the
DataFrame result to RDBMS tables, or produce it to kafka topics e.t.c
Example
In this example, to make it simple we just print the DataFrame to console.
// ForeachPartition DataFrame
val df = spark.createDataFrame(data).toDF("Product","Amount","Country")
df.foreachPartition(partition => {
// Initialize database connection or kafka
partition.foreach(fun=>{
// Apply the function to insert the database
// Or produce kafka topic
})
// If you have batch inserts, do here.
})

2. DataFrame foreach() Usage


When foreach() applied on Spark DataFrame, it executes a function specified in for each element
of DataFrame/Dataset. This operation is mainly used if you wanted to <a
href="https://sparkbyexamples.com/spark/spark-accumulators/">manipulate accumulators</a>,
and any other operations which doesn’t have heavy initializations.
// DataFrame foreach() Usage
val longAcc = spark.sparkContext.longAccumulator("SumAccumulator")
df.foreach(f=> {
longAcc.add(f.getInt(1))
})
println("Accumulator value:"+longAcc.value)

3. RDD foreachPartition() Usage


foreach() on RDD behaves similarly to DataFrame equivalent hence, it has the same syntax.
Syntax
foreachPartition(f : scala.Function1[scala.Iterator[T], scala.Unit]) : scala.Unit
Example
// ForeachPartition DataFrame
118
Apache Spark - SparkByExamples

val rdd = spark.sparkContext.parallelize(Seq(1,2,3,4,5,6,7,8,9))


rdd.foreachPartition(partition => {
// Initialize any database connection
partition.foreach(fun=>{
// Apply the function
})
})
4. Spark RDD foreach() Usage
rdd foreach() is equivalent to DataFrame foreach() action.
// Rdd accumulator
val rdd2 = spark.sparkContext.parallelize(Seq(1,2,3,4,5,6,7,8,9))
val longAcc2 = spark.sparkContext.longAccumulator("SumAccumulator2")
rdd .foreach(f=> {
longAcc2.add(f)
})
println("Accumulator value:"+longAcc2.value)
Conclusion
You should use foreachPartition action operation when using heavy initialization like database
connections or Kafka producer etc where it initializes one per partition rather than one per
element(foreach). foreach() transformation mostly used to update accumulator variables.
______________________________________________________________________________

How to Pivot and Unpivot a Spark Data Frame


Spark pivot() function is used to pivot/rotate the data from one DataFrame/Dataset column into
multiple columns (transform row to column) and unpivot is used to transform it back (transform
columns to rows).
In this article, I will explain how to use pivot() SQL function to transpose one or multiple rows into
columns.
Pivot() is an aggregation where one of the grouping columns values transposed into individual
columns with distinct data.
 Pivot Spark DataFrame
 Pivot Performance improvement in Spark 2.0
119
Apache Spark - SparkByExamples

 Unpivot Spark DataFrame


 Pivot or Transpose without aggregation
Let’s create a DataFrame to work with.
val data = Seq(("Banana",1000,"USA"), ("Carrots",1500,"USA"), ("Beans",1600,"USA"),
("Orange",2000,"USA"),("Orange",2000,"USA"),("Banana",400,"China"),
("Carrots",1200,"China"),("Beans",1500,"China"),("Orange",4000,"China"),
("Banana",2000,"Canada"),("Carrots",2000,"Canada"),("Beans",2000,"Mexico"))
import spark.sqlContext.implicits._
val df = data.toDF("Product","Amount","Country")
df.show()
DataFrame ‘df’ consists of 3 columns Product, Amount and Country as shown below.
// Output:

+-------+------+-------+

|Product|Amount|Country|

+-------+------+-------+

| Banana| 1000| USA|

|Carrots| 1500| USA|

| Beans| 1600| USA|

| Orange| 2000| USA|

| Orange| 2000| USA|

| Banana| 400| China|

|Carrots| 1200| China|

| Beans| 1500| China|

| Orange| 4000| China|

| Banana| 2000| Canada|

|Carrots| 2000| Canada|

| Beans| 2000| Mexico|

+-------+-----+-------+

1. Pivot Spark DataFrame


Spark SQL provides pivot() function to rotate the data from one column into multiple columns
(transpose row to column). It is an aggregation where one of the grouping columns values
transposed into individual columns with distinct data. From the above DataFrame, to get the total
amount exported to each country of each product will do group by Product, pivot by Country, and
the sum of Amount.
val pivotDF = df.groupBy("Product").pivot("Country").sum("Amount")
pivotDF.show()

120
Apache Spark - SparkByExamples

This will transpose the countries from DataFrame rows into columns and produces below output.
Where ever data is not present, it represents as null by default.
+-------+------+-----+------+----+

|Product|Canada|China|Mexico| USA|

+-------+------+-----+------+----+

| Orange| null| 4000| null|4000|

| Beans| null| 1500| 2000|1600|

| Banana| 2000| 400| null|1000|

|Carrots| 2000| 1200| null|1500|

2. Pivot Performance improvement in Spark 2.0


Spark 2.0 on-wards performance has been improved on Pivot, however, if you are using lower
version; note that pivot is a very expensive operation hence, it is recommended to provide column
data (if known) as an argument to function as shown below.
val countries = Seq("USA","China","Canada","Mexico")
val pivotDF = df.groupBy("Product").pivot("Country", countries).sum("Amount")
pivotDF.show()
Another approach is to do two-phase aggregation. Spark 2.0 uses this implementation in order to
improve the performance Spark-13749
val pivotDF = df.groupBy("Product","Country")
.sum("Amount")
.groupBy("Product")
.pivot("Country")
.sum("sum(Amount)")
pivotDF.show()
Above two examples returns the same output but with better performance.

3. Unpivot Spark DataFrame


Unpivot is a reverse operation, we can achieve by rotating column values into rows values. Spark
SQL doesn’t have unpivot function hence will use the stack() function. Below code converts
column countries to row.
// Unpivot
val unPivotDF = pivotDF.select($"Product",
expr("stack(3, 'Canada', Canada, 'China', China, 'Mexico', Mexico) as (Country,Total)"))
.where("Total is not null")
unPivotDF.show()
It converts pivoted column “country” to rows.

121
Apache Spark - SparkByExamples
+-------+-------+-----+

|Product|Country|Total|

+-------+-------+-----+

| Orange| China| 4000|

| Beans| China| 1500|

| Beans| Mexico| 2000|

| Banana| Canada| 2000|

| Banana| China| 400|

|Carrots| Canada| 2000|

|Carrots| China| 1200|

+-------+-------+-----+

4. Transpose or Pivot without aggregation


Can we do Spark DataFrame transpose or pivot without aggregation?
off course you can, but unfortunately, you can’t achieve using Pivot function. However, pivoting or
transposing DataFrame structure without aggregation from rows to columns and columns to rows
can be easily done using Spark and Scala hack. please refer to stackoverflow example.
Conclusion:
We have seen how to Pivot DataFrame (transpose row to column) with scala example and Unpivot
it back using Spark SQL functions. And also have seen how Spark 2.0 changes improves
performance by doing two-phase aggregation.
______________________________________________________________________________

Spark DataFrame Union and Union All


In this Spark article, you will learn how to union two or more data frames of the same schema
which is used to append DataFrame to another or combine two DataFrames and also explain the
differences between union and union all with Scala examples.

Dataframe union() – union() method of the DataFrame is used to combine two DataFrame’s of
the same structure/schema. If schemas are not the same it returns an error.
DataFrame unionAll() – unionAll() is deprecated since Spark “2.0.0” version and replaced with
union().
Note: In other SQL’s, Union eliminates the duplicates but UnionAll combines two datasets
including duplicate records. But, in spark both behave the same and use DataFrame duplicate
function to remove duplicate rows.

First, let’s create two DataFrame with the same schema.


122
Apache Spark - SparkByExamples

import spark.implicits._
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.printSchema()
df.show()
df.printSchema prints the schema and df.show() display DataFrame to console.
// Output:

root

|-- employee_name: string (nullable = true)

|-- department: string (nullable = true)

|-- state: string (nullable = true)

|-- salary: integer (nullable = false)

|-- age: integer (nullable = false)

|-- bonus: integer (nullable = false)

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

| James| Sales| NY| 90000| 34|10000|

| Michael| Sales| NY| 86000| 56|20000|

| Robert| Sales| CA| 81000| 30|23000|

| Maria| Finance| CA| 90000| 24|23000|

+-------------+----------+-----+------+---+-----+

Now, let’s create a second Dataframe with the new records and some records from the above
Dataframe but with the same schema.
val simpleData2 = Seq(("James","Sales","NY",90000,34,10000),
("Maria","Finance","CA",90000,24,23000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df2 = simpleData2.toDF("employee_name","department","state","salary","age","bonus")
This yields below output
123
Apache Spark - SparkByExamples
// Output:

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

|James |Sales |NY |90000 |34 |10000|

|Maria |Finance |CA |90000 |24 |23000|

|Jen |Finance |NY |79000 |53 |15000|

|Jeff |Marketing |CA |80000 |25 |18000|

|Kumar |Marketing |NY |91000 |50 |21000|

+-------------+----------+-----+------+---+-----+

1. Combine two or more DataFrames using union


DataFrame union() method combines two DataFrames and returns the new DataFrame with all
rows from two Dataframes regardless of duplicate data.
// Combine two or more DataFrames using union
val df3 = df.union(df2)
df3.show(false)
As you see below it returns all records.
// Output:

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

|James |Sales |NY |90000 |34 |10000|

|Michael |Sales |NY |86000 |56 |20000|

|Robert |Sales |CA |81000 |30 |23000|

|Maria |Finance |CA |90000 |24 |23000|

|James |Sales |NY |90000 |34 |10000|

|Maria |Finance |CA |90000 |24 |23000|

|Jen |Finance |NY |79000 |53 |15000|

|Jeff |Marketing |CA |80000 |25 |18000|

|Kumar |Marketing |NY |91000 |50 |21000|

+-------------+----------+-----+------+---+-----+

2. Combine DataFrames using unionAll


DataFrame unionAll() method is deprecated since Spark “2.0.0” version and recommends using
the union() method.
// Combine DataFrames using unionAll
val df4 = df.unionAll(df2)

124
Apache Spark - SparkByExamples

df4.show(false)
Returns the same output as above.
3. Combine without Duplicates
Since the union() method returns all rows without distinct records, we will use the distinct() function
to return just one record when duplicate exists.
// Combine without Duplicates
 val df5 = df.union(df2).distinct()
 df5.show(false)
Yields below output. As you see, this returns only distinct rows.
// Output:

+-------------+----------+-----+------+---+-----+

|employee_name|department|state|salary|age|bonus|

+-------------+----------+-----+------+---+-----+

|James |Sales |NY |90000 |34 |10000|

|Maria |Finance |CA |90000 |24 |23000|

|Jeff |Marketing |CA |80000 |25 |18000|

|Jen |Finance |NY |79000 |53 |15000|

|Kumar |Marketing |NY |91000 |50 |21000|

|Michael |Sales |NY |86000 |56 |20000|

|Robert |Sales |CA |81000 |30 |23000|

+-------------+----------+-----+------+---+-----+

4. Complete Example of DataFrame Union

package com.sparkbyexamples.spark.dataframe

import org.apache.spark.sql.SparkSession

object UnionExample extends App{

val spark: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

125
Apache Spark - SparkByExamples

import spark.implicits._

val simpleData = Seq(("James","Sales","NY",90000,34,10000),


("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.printSchema()
df.show()

val simpleData2 = Seq(("James","Sales","NY",90000,34,10000),


("Maria","Finance","CA",90000,24,23000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df2 = simpleData2.toDF("employee_name","department","state","salary","age","bonus")
df2.show(false)

val df3 = df.union(df2)


df3.show(false)
df3.distinct().show(false)

val df4 = df.unionAll(df2)


df4.show(false)
}

Conclusion
In this Spark article, you have learned how to combine two or more DataFrame’s of the same
schema into single DataFrame using Union method and learned the difference between the
union() and unionAll() functions.

126
Apache Spark - SparkByExamples

Collect() – Retrieve data from Spark RDD/DataFrame


Spark collect() and collectAsList() are action operation that is used to retrieve all the elements of
the RDD/DataFrame/Dataset (from all nodes) to the driver node. We should use the collect() on
smaller dataset usually after filter(), group(), count() e.t.c. Retrieving on larger dataset results in
out of memory.
In this Spark article, I will explain the usage of collect() with DataFrame example, when to avoid it,
and the difference between collect() and select().
In order to explain with example, first, let’s create a DataFrame.
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),


Row(Row("Michael ","Rose",""),"40288","M",4000),
Row(Row("Robert ","","Williams"),"42114","M",4000),
Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
Row(Row("Jen","Mary","Brown"),"","F",-1)
)
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
.add("salary",IntegerType)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)

127
Apache Spark - SparkByExamples

show() function on DataFrame prints the result of the dataset in a table format. By default, it shows
only 20 rows. The above snippet returns the data in a table.
root

|-- name: struct (nullable = true)

| |-- firstname: string (nullable = true)

| |-- middlename: string (nullable = true)

| |-- lastname: string (nullable = true)

|-- id: string (nullable = true)

|-- gender: string (nullable = true)

|-- salary: integer (nullable = true)

+---------------------+-----+------+------+

|name |id |gender|salary|

+---------------------+-----+------+------+

|[James , , Smith] |36636|M |3000 |

|[Michael , Rose, ] |40288|M |4000 |

|[Robert , , Williams]|42114|M |4000 |

|[Maria , Anne, Jones]|39192|F |4000 |

|[Jen, Mary, Brown] | |F |-1 |

+---------------------+-----+------+------+

Using collect() and collectAsList()


collect() action function is used to retrieve all elements from the dataset
(RDD/DataFrame/Dataset) as a Array[Row] to the driver program.
collectAsList() action function is similar to collect() but it returns Java util list.
Syntax:
collect() : scala.Array[T]
collectAsList() : java.util.List[T]

collect() Example
val colList = df.collectAsList()
val colData = df.collect()
colData.foreach(row=>
{
val salary = row.getInt(3)//Index starts from zero
println(salary)
})

128
Apache Spark - SparkByExamples

deptDF.collect() retrieves all elements in a DataFrame as an array to the driver. From the array,
I’ve retried the firstName element and printed on the console.
3000
4000
4000
4000
-1
Retrieving data from Struct column
To retrieve a struct column from Row, we should use getStruct() function.
//Retrieving data from Struct column
colData.foreach(row=>
{
val salary = row.getInt(3)
val fullName:Row = row.getStruct(0) //Index starts from zero
val firstName = fullName.getString(0)//In struct row, again index starts from zero
val middleName = fullName.get(1).toString
val lastName = fullName.getAs[String]("lastname")
println(firstName+","+middleName+","+lastName+","+salary)
})
Above example explains the use of different Row class functions get(), getString(), getAs[String]
(), getStruct().
James ,,Smith,3000
Michael ,Rose,,4000
Robert ,,Williams,4000
Maria ,Anne,Jones,4000
Jen,Mary,Brown,-1
Note that like other DataFrame functions, collect() does not return a Dataframe instead, it returns
data in an array to your driver. once the data is collected in an array, you can use scala language
for further processing.
In case you want to just return certain elements of a DataFrame, you should call select() first.
 dataCollect = df.select("name").collect()
When to avoid Collect()
Usually, collect() is used to retrieve the action output when you have very small result set and
calling collect() on an RDD/DataFrame with a bigger result set causes out of memory as it returns

129
Apache Spark - SparkByExamples

the entire dataset (from all workers) to the driver hence we should avoid calling collect() on a
larger dataset.
collect () vs select ()
select() method on an RDD/DataFrame returns a new DataFrame that holds the columns that are
selected whereas collect() returns the entire data set.
select() is a transformation function whereas collect() is an action.
Complete Example of Spark collect()
Below is a complete Spark example of using collect() and collectAsList() on DataFrame, similarly,
you can also create a program with RDD.

import org.apache.spark.sql.{Row, SparkSession}


import org.apache.spark.sql.types.{IntegerType, StringType, StructType}

object CollectExample extends App {

val spark:SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),


Row(Row("Michael ","Rose",""),"40288","M",4000),
Row(Row("Robert ","","Williams"),"42114","M",4000),
Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
Row(Row("Jen","Mary","Brown"),"","F",-1)
)

val schema = new StructType()


.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
130
Apache Spark - SparkByExamples

.add("salary",IntegerType)

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
df.show(false)

val colData = df.collect()

colData.foreach(row=>
{
val salary = row.getInt(3)//Index starts from zero
println(salary)
})

//Retrieving data from Struct column


colData.foreach(row=>
{
val salary = row.getInt(3)
val fullName:Row = row.getStruct(0) //Index starts from zero
val firstName = fullName.getString(0)//In struct row, again index starts from zero
val middleName = fullName.get(1).toString
val lastName = fullName.getAs[String]("lastname")
println(firstName+","+middleName+","+lastName+","+salary)
})
}

Conclusion
In this Spark article, you have learned the collect() and collectAsList() function of the
RDD/DataFrame which returns all elements of the DataFrame to Driver program and also learned
it’s not a good practice to use it on the bigger dataset, finally retrieved the data from Struct field.

Spark DataFrame Cache and Persist Explained


Spark Cache and Persist are optimization techniques in DataFrame / Dataset for iterative and
interactive Spark applications to improve the performance of Jobs. In this article, you will learn
What is Spark cache() and persist(), how to use it in DataFrame, understanding the difference
131
Apache Spark - SparkByExamples

between Caching and Persistance and how to use these two with DataFrame, and Dataset using
Scala examples.
Though Spark provides computation 100 x times faster than traditional Map Reduce jobs, If you
have not designed the jobs to reuse the repeating computations you will see a degrade in
performance when you are dealing with billions or trillions of data. Hence, we may need to look at
the stages and use optimization techniques as one of the ways to improve performance.
Using cache() and persist() methods, Spark provides an optimization mechanism to store the
intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in
other actions on that dataset. And Spark’s persisted data on nodes are fault-tolerant meaning if
any partition of a Dataset is lost, it will automatically be recomputed using the original
transformations that created it.
Advantages for Caching and Persistence of DataFrame
Below are the advantages of using Spark Cache and Persist methods.
 Cost-efficient – Spark computations are very expensive hence reusing the computations
are used to save cost.
 Time-efficient – Reusing repeated computations saves lots of time.
 Execution time – Saves execution time of the job and we can perform more jobs on the
same cluster.
Spark Cache Syntax and Example
Spark DataFrame or Dataset cache() method by default saves it to storage level
`MEMORY_AND_DISK` because recomputing the in-memory columnar representation of the
underlying table is expensive. Note that this is different from the default cache level of
`RDD.cache()` which is ‘MEMORY_ONLY‘.
Syntax
cache() : Dataset.this.type
Spark cache() method in Dataset class internally calls persist() method which in turn
uses sparkSession.sharedState.cacheManager.cacheQuery to cache the result set of DataFrame
or Dataset. Let’s look at an example.

Example
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
//read csv with options
val df = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")

132
Apache Spark - SparkByExamples

val df2 = df.where(col("State") === "PR").cache()


df2.show(false)

println(df2.count())

val df3 = df2.where(col("Zipcode") === 704)

println(df2.count())

DataFrame Persist Syntax and Example


Spark persist() method is used to store the DataFrame or Dataset to one of the storage
levels MEMORY_ONLY,MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_S
ER, DISK_ONLY, MEMORY_ONLY_2,MEMORY_AND_DISK_2 and more.
Caching or persisting of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame
will not be cached until you trigger an action.

Syntax
1) persist() : Dataset.this.type
2) persist(newLevel : org.apache.spark.storage.StorageLevel) : Dataset.this.type
Spark persist has two signature first signature doesn’t take any argument which by default saves it
to MEMORY_AND_DISK storage level and the second signature which takes StorageLevel as an
argument to store it to different storage levels.

Example
val dfPersist = df.persist()
dfPersist.show(false)
Using the second signature you can save DataFrame/Dataset to any storage levels.
val dfPersist = df.persist(StorageLevel.MEMORY_ONLY)
dfPersist.show(false)
This stores DataFrame/Dataset into Memory.
Note that Dataset cache() is an alias for persist(StorageLevel.MEMORY_AND_DISK)
Unpersist syntax and Example
Spark automatically monitors every persist() and cache() calls you make and it checks usage on
each node and drops persisted data if not used or by using the least-recently-used (LRU)
133
Apache Spark - SparkByExamples

algorithm. You can also manually remove using unpersist() method. unpersist() marks the Dataset
as non-persistent, and remove all blocks for it from memory and disk.
Syntax
unpersist() : Dataset.this.type
unpersist(blocking : scala.Boolean) : Dataset.this.type
Example
val dfPersist = dfPersist.unpersist()
unpersist(Boolean) with boolean as argument blocks until all blocks are deleted.

Spark Persist storage levels


All different storage level Spark supports are available
at org.apache.spark.storage.StorageLevel class. The storage level specifies how and where to
persist or cache a Spark DataFrame and Dataset.
MEMORY_ONLY – This is the default behavior of the RDD cache() method and stores the RDD or
DataFrame as deserialized objects to JVM memory. When there is not enough memory available it
will not save DataFrame of some partitions and these will be re-computed as and when required.
This takes more memory. but unlike RDD, this would be slower than MEMORY_AND_DISK level
as it recomputes the unsaved partitions, and recomputing the in-memory columnar representation
of the underlying table is expensive
MEMORY_ONLY_SER – This is the same as MEMORY_ONLY but the difference being it stores
RDD as serialized objects to JVM memory. It takes lesser memory (space-efficient) than
MEMORY_ONLY as it saves objects as serialized and takes an additional few more CPU cycles in
order to deserialize.
MEMORY_ONLY_2 – Same as MEMORY_ONLY storage level but replicate each partition to two
cluster nodes.
MEMORY_ONLY_SER_2 – Same as MEMORY_ONLY_SER storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK – This is the default behavior of the DataFrame or Dataset. In this Storage
Level, The DataFrame will be stored in JVM memory as a deserialized object. When required
storage is greater than available memory, it stores some of the excess partitions into a disk and
reads the data from the disk when required. It is slower as there is I/O involved.
MEMORY_AND_DISK_SER – This is the same as MEMORY_AND_DISK storage level difference
being it serializes the DataFrame objects in memory and on disk when space is not available.
MEMORY_AND_DISK_2 – Same as MEMORY_AND_DISK storage level but replicate each
partition to two cluster nodes.
MEMORY_AND_DISK_SER_2 – Same as MEMORY_AND_DISK_SER storage level but replicate
each partition to two cluster nodes.
DISK_ONLY – In this storage level, DataFrame is stored only on disk and the CPU computation
time is high as I/O is involved.
DISK_ONLY_2 – Same as DISK_ONLY storage level but replicate each partition to two cluster
nodes.
134
Apache Spark - SparkByExamples

Conclusion
In this article, you have learned Spark cache() and persist() methods are used as optimization
techniques to save interim computation results of DataFrame or Dataset and reuse them
subsequently and learned what is the difference between Spark Cache and Persist and finally saw
their syntaxes and usages with Scala examples.
______________________________________________________________________________

Spark SQL UDF (User Defined Functions)


Spark SQL UDF (a.k.a User Defined Function) is the most useful feature of Spark SQL &
DataFrame which extends the Spark build in capabilities. In this article, I will explain what is UDF?
why do we need it and how to create and using it on DataFrame and SQL using Scala example.
Note: UDF’s are the most expensive operations hence use them only you have no choice and
when essential.
What is Spark UDF?
UDF a.k.a User Defined Function, If you are coming from SQL background, UDF’s are nothing
new to you as most of the traditional RDBMS databases support User Defined Functions, and
Spark UDF’s are similar to these.
In Spark, you create UDF by creating a function in a language you prefer to use for Spark. For
example, if you are using Spark with scala, you create a UDF in scala language and wrap it
with udf() function or register it as udf to use it on DataFrame and SQL respectively.
Why do we need a Spark UDF?
UDF’s are used to extend the functions of the framework and re-use this function on several
DataFrame. For example if you wanted to convert the every first letter of a word in a sentence to
capital case, spark build-in features does’t have this function hence you can create it as UDF and
reuse this as needed on many Data Frames. UDF’s are once created they can be re-use on
several DataFrame’s and SQL expressions.
Before you create any UDF, do your research to check if the similar function you wanted is already
available in Spark SQL Functions. Spark SQL provides several predefined common functions and
many more new functions are added with every release. hence, It is best to check before you
reinventing the wheel.
When you creating UDF’s you need to design them very carefully otherwise you will come across
performance issues.
Create a DataFrame
Before we jump in creating a UDF, first let’s create a Spark DataFrame.
import spark.implicits._
val columns = Seq("Seqno","Quote")
135
Apache Spark - SparkByExamples

val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy.")
)
val df = data.toDF(columns:_*)
df.show(false)
Yields below output.

+-----+-----------------------------------------------------------------------------+

|Seqno|Quote |

+-----+-----------------------------------------------------------------------------+

|1 |Be the change that you wish to see in the world |

|2 |Everyone thinks of changing the world, but no one thinks of changing himself.|

|3 |The purpose of our lives is to be happy. |

+-----+-----------------------------------------------------------------------------+

Create a Function
The first step in creating a UDF is creating a Scala function. Below snippet creates a
function convertCase() which takes a string parameter and converts the first letter of every word to
capital letter. UDF’s take parameters of your choice and returns a value.
val convertCase = (strQuote:String) => {
val arr = strQuote.split(" ")
arr.map(f=> f.substring(0,1).toUpperCase + f.substring(1,f.length)).mkString(" ")
}
Create Spark UDF to use it on DataFrame
Now convert this function convertCase() to UDF by passing the function to Spark SQL udf(), this
function is available at org.apache.spark.sql.functions.udf package. Make sure you import this
package before using it.
val convertUDF = udf(convertCase)
Now you can use convertUDF() on a DataFrame column. udf() function
return org.apache.spark.sql.expressions.UserDefinedFunction.
//Using with DataFrame
df.select(col("Seqno"),
convertUDF(col("Quote")).as("Quote") ).show(false)
This results below output.

+-----+-----------------------------------------------------------------------------+

|Seqno|Quote |

+-----+-----------------------------------------------------------------------------+

|1 |Be The Change That You Wish To See In The World |


136
Apache Spark - SparkByExamples
|2 |Everyone Thinks Of Changing The World, But No One Thinks Of Changing Himself.|

|3 |The Purpose Of Our Lives Is To Be Happy. |

+-----+-----------------------------------------------------------------------------+

Registering Spark UDF to use it on SQL


In order to use convertCase() function on Spark SQL, you need to register the function with Spark
using spark.udf.register().
// Using it on SQL
spark.udf.register("convertUDF", convertCase)
df.createOrReplaceTempView("QUOTE_TABLE")
spark.sql("select Seqno, convertUDF(Quote) from QUOTE_TABLE").show(false)
This yields the same output as previous example.

null check
UDF’s are error-prone when not designed carefully. for example, when you have a column that
contains the value null on some records and not handling null inside a UDF function returns below
error.
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined
function(anonfun$1: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1066)

at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:152)

at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:92)

at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$
$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)

at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$24$
$anonfun$applyOrElse$23.apply(Optimizer.scala:1364)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at scala.collection.immutable.List.foreach(List.scala:392)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

at scala.collection.immutable.List.map(List.scala:296)

at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$
$anonfun$apply$24.applyOrElse(Optimizer.scala:1364)

at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$
$anonfun$apply$24.applyOrElse(Optimizer.scala:1359)

It always best practice to check for null inside a UDF function rather than checking for null outside.

Performance concern using UDF

137
Apache Spark - SparkByExamples

UDF’s are a black box to Spark hence it can’t apply optimization and you will lose all the
optimization Spark does on Dataframe/Dataset. When possible you should use Spark SQL built-in
functions as these functions provide optimization.

Complete UDF Example


Below is complete UDF function example in Scala
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Row, SparkSession}

object SparkUDF extends App{


val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

import spark.implicits._
val columns = Seq("Seqno","Quote")
val data = Seq(("1", "Be the change that you wish to see in the world"),
("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
("3", "The purpose of our lives is to be happy.")

)
val df = data.toDF(columns:_*)
df.show(false)

val convertCase = (str:String) => {


val arr = str.split(" ")
arr.map(f=> f.substring(0,1).toUpperCase + f.substring(1,f.length)).mkString(" ") }
//Using with DataFrame
val convertUDF = udf(convertCase)
df.select(col("Seqno"),
convertUDF(col("Quote")).as("Quote") ).show(false)

138
Apache Spark - SparkByExamples

// Using it on SQL
spark.udf.register("convertUDF", convertCase)
df.createOrReplaceTempView("QUOTE_TABLE")
spark.sql("select Seqno, convertUDF(Quote) from QUOTE_TABLE").show(false)

}
Conclusion
In this article, you have learned Spark UDF is a User Defined Function that is used to create a
reusable function that can be used on multiple DataFrame. Once UDF’s are created they can be
used on DataFrame and SQL (after registering) .
______________________________________________________________________________

Spark SQL StructType & StructField with examples


Spark SQL StructType & StructField classes are used to programmatically specify the schema to
the DataFrame and creating complex columns like nested struct, array and map columns.
StructType is a collection of StructField’s. Using StructField we can define column name, column
data type, nullable column (boolean to specify if the field can be nullable or not) and metadata.
In this article, we will learn different ways to define the structure of DataFrame using Spark SQL
StructType with scala examples. Though Spark infers a schema from data, some times we may
need to define our own column names and data types and this article explains how to define
simple, nested, and complex schemas.

StructType – Defines the structure of the Dataframe


Spark provides spark.sql.types.StructType class to define the structure of the DataFrame and It is
a collection or list on StructField objects. By calling Spark DataFrame printSchema() print the
schema on console where StructType columns are represented as struct.
StructField – Defines the metadata of the DataFrame column
Spark provides spark.sql.types.StructField class to define the column name(String), column type
(DataType), nullable column (Boolean) and metadata (MetaData)
 Using Spark StructType & StructField with DataFrame
 Defining nested StructType or struct
 Creating StructType or struct from Json file
 Adding & Changing columns of the DataFrame

139
Apache Spark - SparkByExamples

 Using SQL ArrayType and MapType


 Convert case class to StructType
 Creating StructType object from DDL string
 Check if a field exists in a StructType

Using Spark StructType & StructField with DataFrame


While creating a Spark DataFrame we can specify the structure using StructType and StructField
classes. As specified in the introduction, StructType is a collection of StructField’s which is used to
define the column name, data type and a flag for nullable or not. Using StructField we can also
add nested struct schema, ArrayType for arrays and MapType for key-value pairs which we will
discuss in detail in later sections.
StructType & StructField case class as follows.
case class StructType(fields: Array[StructField])
case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty)
The below example demonstrates a very simple example of how to create a struct using
StructType & StructField on DataFrame and it’s usage with sample data to support it.
val simpleData = Seq(Row("James ","","Smith","36636","M",3000),
Row("Michael ","Rose","","40288","M",4000),
Row("Robert ","","Williams","42114","M",4000),
Row("Maria ","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
140
Apache Spark - SparkByExamples

spark.sparkContext.parallelize(simpleData),simpleSchema)
df.printSchema()
df.show()
By running the above snippet, it displays the below outputs.
root

|-- firstname: string (nullable = true)

|-- middlename: string (nullable = true)

|-- lastname: string (nullable = true)

|-- id: string (nullable = true)

|-- gender: string (nullable = true)

|-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+

|firstname|middlename|lastname| id|gender|salary|

+---------+----------+--------+-----+------+------+

| James | | Smith|36636| M| 3000|

| Michael | Rose| |40288| M| 4000|

| Robert | |Williams|42114| M| 4000|

| Maria | Anne| Jones|39192| F| 4000|

| Jen| Mary| Brown| | F| -1|

+---------+----------+--------+-----+------+------+

Defining nested StructType object struct


While working on DataFrame we often need to work with the nested struct column and this can be
defined using SQL StructType.
On the below example I have instantiated StructType and use add method (instead of StructField)
to add column names and datatype. Notice that for column “name” data type is StructType which
is nested.
val structureData = Seq(
Row(Row("James ","","Smith"),"36636","M",3100),
Row(Row("Michael ","Rose",""),"40288","M",4300),
Row(Row("Robert ","","Williams"),"42114","M",1400),
Row(Row("Maria ","Anne","Jones"),"39192","F",5500),
Row(Row("Jen","Mary","Brown"),"","F",-1) )
val structureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
141
Apache Spark - SparkByExamples

.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
.add("salary",IntegerType)

val df2 = spark.createDataFrame(


spark.sparkContext.parallelize(structureData),structureSchema)
df2.printSchema()
df2.show()

Outputs below schema and the DataFrame


root

|-- name: struct (nullable = true)

| |-- firstname: string (nullable = true)

| |-- middlename: string (nullable = true)

| |-- lastname: string (nullable = true)

|-- id: string (nullable = true)

|-- gender: string (nullable = true)

|-- salary: integer (nullable = true)

+--------------------+-----+------+------+

| name| id|gender|salary|

+--------------------+-----+------+------+

| [James , , Smith]|36636| M| 3100|

| [Michael , Rose, ]|40288| M| 4300|

|[Robert , , Willi...|42114| M| 1400|

|[Maria , Anne, Jo...|39192| F| 5500|

| [Jen, Mary, Brown]| | F| -1|

+--------------------+-----+------+------+

Creating StructType object struct from JSON file


If you have too many columns and the structure of the DataFrame changes now and then, it’s a
good practice to load the SQL StructType schema from JSON file. Note the definition in JSON
uses the different layout and you can get this by using schema.prettyJson()

142
Apache Spark - SparkByExamples

{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "middlename",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "lastname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "dob",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "gender",
"type" : "string",
143
Apache Spark - SparkByExamples

"nullable" : true,
"metadata" : { }
}, {
"name" : "salary",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}]
}
val url = ClassLoader.getSystemResource("schema.json")
val schemaSource = Source.fromFile(url.getFile).getLines.mkString
val schemaFromJson = DataType.fromJson(schemaSource).asInstanceOf[StructType]
val df3 = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),schemaFromJson)
df3.printSchema()
This prints the same output as the previous section. You can also, have a name, type, and flag for
nullable in a comma-separated file and we can use these to create a StructType programmatically,
I will leave this to you to explore.

Adding & Changing struct of the DataFrame


Using Spark SQL function struct(), we can change the struct of the existing DataFrame and add a
new StructType to it. The below example demonstrates how to copy the columns from one
structure to another and adding a new column.
val updatedDF = df4.withColumn("OtherInfo",
struct( col("id").as("identifier"),
col("gender").as("gender"),
col("salary").as("salary"),
when(col("salary").cast(IntegerType) &lt 2000,"Low")
.when(col("salary").cast(IntegerType) &lt 4000,"Medium")
.otherwise("High").alias("Salary_Grade")
)).drop("id","gender","salary")
updatedDF.printSchema()
updatedDF.show(false)
Here, it copies “gender“, “salary” and “id” to the new struct “otherInfo” and add’s a new column
“Salary_Grade“.
144
Apache Spark - SparkByExamples
root

|-- name: struct (nullable = true)

| |-- firstname: string (nullable = true)

| |-- middlename: string (nullable = true)

| |-- lastname: string (nullable = true)

|-- OtherInfo: struct (nullable = false)

| |-- identifier: string (nullable = true)

| |-- gender: string (nullable = true)

| |-- salary: integer (nullable = true)

| |-- Salary_Grade: string (nullable = false)

Using SQL ArrayType and MapType


SQL StructType also supports ArrayType and MapType to define the DataFrame columns for array
and map collections respectively. On the below example, column “hobbies” defined as
ArrayType(StringType) and “properties” defined as MapType(StringType,StringType) meaning both
key and value as String.
val arrayStructureData = Seq(
Row(Row("James ","","Smith"),List("Cricket","Movies"),Map("hair"->"black","eye"->"brown")),
Row(Row("Michael ","Rose",""),List("Tennis"),Map("hair"->"brown","eye"->"black")),
Row(Row("Robert ","","Williams"),List("Cooking","Football"),Map("hair"->"red","eye"->"gray")),
Row(Row("Maria ","Anne","Jones"),null,Map("hair"->"blond","eye"->"red")),
Row(Row("Jen","Mary","Brown"),List("Blogging"),Map("white"->"black","eye"->"black"))
)
val arrayStructureSchema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("hobbies", ArrayType(StringType))
.add("properties", MapType(StringType,StringType))
val df5 = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df5.printSchema()
df5.show()
Outputs the below schema and the DataFrame data. Note that field “Hobbies” is array type and
“properties” is map type.
145
Apache Spark - SparkByExamples
root

|-- name: struct (nullable = true)

| |-- firstname: string (nullable = true)

| |-- middlename: string (nullable = true)

| |-- lastname: string (nullable = true)

|-- hobbies: array (nullable = true)

| |-- element: string (containsNull = true)

|-- properties: map (nullable = true)

| |-- key: string

| |-- value: string (valueContainsNull = true)

+---------------------+-------------------+------------------------------+

|name |hobbies |properties |

+---------------------+-------------------+------------------------------+

|[James , , Smith] |[Cricket, Movies] |[hair -> black, eye -> brown] |

|[Michael , Rose, ] |[Tennis] |[hair -> brown, eye -> black] |

|[Robert , , Williams]|[Cooking, Football]|[hair -> red, eye -> gray] |

|[Maria , Anne, Jones]|null |[hair -> blond, eye -> red] |

|[Jen, Mary, Brown] |[Blogging] |[white -> black, eye -> black]|

+---------------------+-------------------+------------------------------+

Convert case class to Spark StructType


Spark SQL also provides Encoders to convert case class to StructType object. If you are using
older versions of Spark, you can also transform the case class to the schema using the Scala
hack. Both examples are present here.

case class Name(first:String,last:String,middle:String)


case class Employee(fullName:Name,age:Integer,gender:String)

import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[Employee].dataType.asInstanceOf[StructType]

val encoderSchema = Encoders.product[Employee].schema


encoderSchema.printTreeString()
printTreeString() outputs the below schema.
root

|-- fullName: struct (nullable = true)

| |-- first: string (nullable = true)

| |-- last: string (nullable = true)

146
Apache Spark - SparkByExamples
| |-- middle: string (nullable = true)

|-- age: integer (nullable = true)

|-- gender: string (nullable = true)

Creating StructType object struct from DDL String


Like loading structure from JSON string, we can also create it from DLL ( by
using fromDDL() static function on SQL StructType class StructType.fromDDL). You can also
generate DDL from a schema using toDDL(). printTreeString() on struct object prints the schema
similar to printSchemafunction returns.
val ddlSchemaStr = "`fullName` STRUCT<`first`: STRING, `last`: STRING,
`middle`: STRING>,`age` INT,`gender` STRING"
val ddlSchema = StructType.fromDDL(ddlSchemaStr)
ddlSchema.printTreeString()

Checking if a field exists in a DataFrame


If you want to perform some checks on metadata of the DataFrame, for example, if a column or
field exists in a DataFrame or data type of column; we can easily do this using several functions
on SQL StructType and StructField.
println(df.schema.fieldNames.contains("firstname"))
println(df.schema.contains(StructField("firstname",StringType,true)))

This example returns “true” for both scenarios. And for the second one if you have IntegetType
instead of StringType it returns false as the datatype for first name column is String, as it checks
every property ins field. Similarly, you can also check if two schemas are equal and more.
The complete example explained here is available at GitHub project.

Conclusion:
In this article, you have learned the usage of SQL StructType, StructField and how to change the
structure of the spark DataFrame at runtime, converting case class to the schema and using
ArrayType, MapType.

Spark SQL String Functions Explained


Spark SQL defines built-in standard String functions in DataFrame API, these String functions
come in handy when we need to make operations on Strings. In this article, I will explain the usage
of some functions with scala examples. You can access the standard functions using the following
import statement.

147
Apache Spark - SparkByExamples

// Import spark sql functions


import org.apache.spark.sql.functions._
Related: If you are looking for PySpark, please refer to PySpark SQL String Functions
Spark SQL String Functions:
Click on each link from below table for more explanation and working examples of String Function
with Scala example.

STRING FUNCTION SIGNATURE STRING FUNCTION DESCRIPTION

ascii(e: Column): Column Calculates ascii() value of the first character of the string. It
returns an integer value.

base64(e: Column): Column It is used to encode a binary column into a Base64-


encoded string.

concat_ws(sep: String, exprs: Used to concatenate multiple string or column values


Column*): Column together into a single string, with a specified separator
between each value.

decode(value: Column, charset: Decodes Base64-encoded strings back into their original
String): Column binary form. using the provided character set (one of ‘US-
ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’,
‘UTF-16’).

encode(value: Column, charset: Converts binary data into a Base64-encoded string,


String): Column facilitating textual representation for transmission and
storage.

format_number(x: Column, d: Int): Formats a numeric value to a specified number of decimal


Column places.

format_string(format: String, Formats strings using placeholders, similar to


arguments: Column*): Column the printf function in other programming languages.

initcap(e: Column): Column Capitalizes the first letter of each word in a string

instr(str: Column, substring: Returns the position of the first occurrence of a substring
String): Column within a string.

length(e: Column): Column Computes the character length of a given string or number
of bytes of a binary string.

148
Apache Spark - SparkByExamples

STRING FUNCTION SIGNATURE STRING FUNCTION DESCRIPTION

lower(e: Column): Column Converts a string column to lower case.

levenshtein ( l : Column , r : Computes the Levenshtein distance of the two given string
Column ) : Column columns.

locate(substr: String, str: Column): Returns the position of the first occurrence of a substring
Column within a string.

lpad(str: Column, len: Int, pad: Pads a string with specified characters on the left side until
String): Column it reaches the desired length.

ltrim(e: Column): Column Removes leading whitespace characters from a string.

regexp_extract(e: Column, exp: Extracts substrings from a string column based on a


String, groupIdx: Int): Column regular expression pattern and returns the matched
groups.

regexp_replace(e: Column, Replaces substrings in a string column that match a regular


pattern: Column, replacement: expression pattern with a specified replacement string.
Column): Column

unbase64(e: Column): Column Decodes Base64-encoded strings back into their original
binary form.. This is the reverse of base64.

rpad(str: Column, len: Int, pad: Pads a string with specified characters on the right side
String): Column until it reaches the desired length.

repeat(str: Column, n: Int): Column Repeats a string or character a specified number of times.

rtrim(e: Column): Column Removes trailing whitespace characters from a string.

soundex(e: Column): Column Computes the Soundex code of a string, a phonetic


algorithm for indexing names by sound.

split(str: Column, regex: String): Splits a string into an array of substrings based on a
Column delimiter.

149
Apache Spark - SparkByExamples

STRING FUNCTION SIGNATURE STRING FUNCTION DESCRIPTION

substring(str: Column, pos: Int, len: Extracts a substring from a string column, starting at a
Int): Column specified position and optionally up to a specified length.

substring_index(str: Column, Extracts a substring from a string column before or after a


delim: String, count: Int): Column specified delimiter occurrence.
* If count is positive, everything the left of the final delimiter
(counting from left) is returned.
*If count is negative, every to the right of the final delimiter
(counting from the right) is returned. substring_index
performs a case-sensitive match when searching for delim.

overlay(src: Column, replaceString: Replaces part of a string with another string starting at a
String, pos: Int, len: Int): Column specified position and optionally for a specified length.

translate(src: Column, Replaces characters in a string based on a mapping of


matchingString: String, each character to its replacement.
replaceString: String): Column

trim(e: Column): Column Removes leading and trailing whitespace characters from a
string.

trim(e: Column, trimString: String): Trim the specified character from both ends for the
Column specified string column.

upper(e: Column): Column Converts all characters in a string column to upper case.

Conclusion:
Spark SQL string functions provide a powerful set of tools for manipulating and analyzing textual
data within Apache Spark. These functions allow users to perform a wide range of operations,
such as string manipulation, pattern matching, and data cleansing.

Spark SQL Date and Timestamp Functions

150
Apache Spark - SparkByExamples

Spark SQL provides built-in standard Date and Timestamp (includes date and time) Functions
defines in DataFrame API, these come in handy when we need to make operations on date and
time. All these accept input as, Date type, Timestamp type or String. If a String, it should be in a
format that can be cast to date, such as yyyy-MM-dd and timestamp in yyyy-MM-dd
HH:mm:ss.SSSS and returns date and timestamp respectively; also returns null if the input data
was a string that could not be cast to date and timestamp.
When possible try to leverage standard library as they are a little bit more compile-time safe,
handles null, and perform better when compared to Spark UDF. If your application is critical on
performance try to avoid using custom UDF at all costs as these are not guarantee performance.
For the readable purpose, I’ve grouped Date and Timestamp functions into the following.
 Spark SQL Date Functions
 Spark SQL Timestamp Functions
 Date and Timestamp Window Functions
Before you use any examples below, make sure you create sparksession and import SQL
functions.
import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
Spark SQL Date Functions
Click on each link from below table for more explanation and working examples in Scala.

Date Function Signature Date Function Description

current_date () : Column Returns the current date as a date column.

date_format(dateExpr: Converts a date/timestamp/string to a value of string in the


Column, format: String): format specified by the date format given by the second
Column argument.

to_date(e: Column): Column Converts the column into DateType by casting rules
to DateType.

151
Apache Spark - SparkByExamples

Date Function Signature Date Function Description

to_date(e: Column, fmt: Converts the column into a DateType with a specified format
String): Column

add_months(startDate: Returns the date that is numMonths after startDate.


Column, numMonths: Int):
Column

date_add(start: Column, days: Returns the date that is days days after start
Int): Column
date_sub(start: Column, days:
Int): Column

datediff(end: Column, start: Returns the number of days from start to end.
Column): Column

months_between(end: Returns number of months between dates start and end. A


Column, start: Column): whole number is returned if both inputs have the same day of
Column month or both are the last day of their respective months.
Otherwise, the difference is calculated assuming 31 days per
month.

months_between(end: Returns number of months between dates end and start.


Column, start: Column, If roundOff is set to true, the result is rounded off to 8 digits; it is
roundOff: Boolean): Column not rounded otherwise.

next_day(date: Column, Returns the first date which is later than the value of
dayOfWeek: String): Column the date column that is on the specified day of the week.
For example, next_day('2015-07-27', "Sunday") returns 2015-
08-02 because that is the first Sunday after 2015-07-27.

trunc(date: Column, format: Returns date truncated to the unit specified by the format.
String): Column For example, trunc("2018-11-19 12:01:19", "year") returns
2018-01-01
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month

date_trunc(format: String, Returns timestamp truncated to the unit specified by the format.
timestamp: Column): Column For example, date_trunc("year", "2018-11-19 12:01:19") returns
2018-01-01 00:00:00
format: ‘year’, ‘yyyy’, ‘yy’ to truncate by year,
‘month’, ‘mon’, ‘mm’ to truncate by month,
‘day’, ‘dd’ to truncate by day,

152
Apache Spark - SparkByExamples

Date Function Signature Date Function Description

Other options are: ‘second’, ‘minute’, ‘hour’, ‘week’, ‘month’,


‘quarter’

year(e: Column): Column Extracts the year as an integer from a given


date/timestamp/string

quarter(e: Column): Column Extracts the quarter as an integer from a given


date/timestamp/string.

month(e: Column): Column Extracts the month as an integer from a given


date/timestamp/string

dayofweek(e: Column): Extracts the day of the week as an integer from a given
Column date/timestamp/string. Ranges from 1 for a Sunday through to
7 for a Saturday

dayofmonth(e: Column): Extracts the day of the month as an integer from a given
Column date/timestamp/string.

dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.

weekofyear(e: Column): Extracts the week number as an integer from a given


Column date/timestamp/string. A week is considered to start on a
Monday and week 1 is the first week with more than 3 days, as
defined by ISO 8601

last_day(e: Column): Column Returns the last day of the month which the given date belongs
to. For example, input “2015-07-27” returns “2015-07-31” since
July 31 is the last day of the month in July 2015.

from_unixtime(ut: Column): Converts the number of seconds from unix epoch (1970-01-01
Column 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the yyyy-MM-dd
HH:mm:ss format.

from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch (1970-01-01
String): Column 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the given format.

153
Apache Spark - SparkByExamples

Date Function Signature Date Function Description

unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a long

unix_timestamp(s: Column): Converts time string in format yyyy-MM-dd HH:mm:ss to Unix


Column timestamp (in seconds), using the default timezone and the
default locale.

unix_timestamp(s: Column, p: Converts time string with given pattern to Unix timestamp (in
String): Column seconds).

Spark SQL Timestamp Functions


Below are some of the Spark SQL Timestamp functions, these functions operate on both date and
timestamp values. Select each link for a description and example of each function.
The default format of the Spark Timestamp is yyyy-MM-dd HH:mm:ss.SSSS

Timestamp Function Signature Timestamp Function Description

current_timestamp () : Column Returns the current timestamp as a timestamp column

current_timestamp () : Column Returns the current timestamp as a timestamp column

hour(e: Column): Column Extracts the hours as an integer from a given


date/timestamp/string.

minute(e: Column): Column Extracts the minutes as an integer from a given


date/timestamp/string.

second(e: Column): Column Extracts the seconds as an integer from a given


date/timestamp/string.

to_timestamp(s: Column): Column Converts to a timestamp by casting rules


to TimestampType.

to_timestamp(s: Column, fmt: String): Converts time string with the given pattern to timestamp.
Column

154
Apache Spark - SparkByExamples

Spark Date and Timestamp Window Functions

Date & Time Window Function Date & Time Window Function Description
Syntax

window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are inclusive but
slideDuration: String, startTime: the window ends are exclusive, e.g. 12:05 will be in the
String): Column window [12:05,12:10) but not in [12:00,12:05). Windows can
support microsecond precision. Windows in the order of
months are not supported.

window(timeColumn: Column, Bucketize rows into one or more time windows given a
windowDuration: String, timestamp specifying column. Window starts are inclusive but
slideDuration: String): Column the window ends are exclusive, e.g. 12:05 will be in the
window [12:05,12:10) but not in [12:00,12:05). Windows can
support microsecond precision. Windows in the order of
months are not supported. The windows start beginning at
1970-01-01 00:00:00 UTC

window(timeColumn: Column, Generates tumbling time windows given a timestamp


windowDuration: String): specifying column. Window starts are inclusive but the window
Column ends are exclusive, e.g. 12:05 will be in the window
[12:05,12:10) but not in [12:00,12:05). Windows can support
microsecond precision. Windows in the order of months are
not supported. The windows start beginning at 1970-01-01
00:00:00 UTC.

Spark Date Functions Examples


Below are most used examples of Date Functions.
current_date() and date_format()
We will see how to get the current date and convert date into a specific date format using
date_format() with Scala example. Below example parses the date and converts from ‘yyyy-dd-
mm’ to ‘MM-dd-yyyy’ format.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"))
.toDF("Input")
.select(
current_date()as("current_date"),
col("Input"),
date_format(col("Input"), "MM-dd-yyyy").as("format")

155
Apache Spark - SparkByExamples

).show()
+------------+----------+-----------+

|current_date| Input |format |

+------------+----------+-----------+

| 2019-07-23 |2019-01-23| 01-23-2019 |

+------------+----------+-----------+

to_date()
Below example converts string in date format ‘MM/dd/yyyy’ to a DateType ‘yyyy-MM-dd’
using to_date() with Scala example.
import org.apache.spark.sql.functions._
Seq(("04/13/2019"))
.toDF("Input")
.select( col("Input"),
to_date(col("Input"), "MM/dd/yyyy").as("to_date")
).show()
+----------+----------+

|Input |to_date |

+----------+----------+

|04/13/2019|2019-04-13|

+----------+----------+

datediff()
Below example returns the difference between two dates using datediff() with Scala example.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), current_date(),
datediff(current_date(),col("input")).as("diff")
).show()
+----------+--------------+--------+

| input |current_date()| diff |

+----------+--------------+--------+

|2019-01-23| 2019-07-23 | 181 |

|2019-06-24| 2019-07-23 | 29 |

|2019-09-20| 2019-07-23 | -59 |


156
Apache Spark - SparkByExamples
+----------+--------------+--------+

months_between()
Below example returns the months between two dates using months_between() with Scala
language.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("date")
.select( col("date"), current_date(),
datediff(current_date(),col("date")).as("datediff"),
months_between(current_date(),col("date")).as("months_between")
).show()
+----------+--------------+--------+--------------+

| date |current_date()|datediff|months_between|

+----------+--------------+--------+--------------+

|2019-01-23| 2019-07-23 | 181| 6.0|

|2019-06-24| 2019-07-23 | 29| 0.96774194|

|2019-09-20| 2019-07-23 | -59| -1.90322581|

+----------+--------------+--------+--------------+

trunc()
Below example truncates date at a specified unit using trunc() with Scala language.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"),
trunc(col("input"),"Month").as("Month_Trunc"),
trunc(col("input"),"Year").as("Month_Year"),
trunc(col("input"),"Month").as("Month_Trunc")
).show()
+----------+-----------+----------+-----------+
| input |Month_Trunc|Month_Year|Month_Trunc|
+----------+-----------+----------+-----------+
|2019-01-23| 2019-01-01|2019-01-01| 2019-01-01|
|2019-06-24| 2019-06-01|2019-01-01| 2019-06-01|
|2019-09-20| 2019-09-01|2019-01-01| 2019-09-01|

157
Apache Spark - SparkByExamples
+----------+-----------+----------+-----------+

add_months() , date_add(), date_sub()


Here we are adding and subtracting date and month from a given input.
import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20")).toDF("input")
.select( col("input"),
add_months(col("input"),3).as("add_months"),
add_months(col("input"),-3).as("sub_months"),
date_add(col("input"),4).as("date_add"),
date_sub(col("input"),4).as("date_sub")
).show()
+----------+----------+----------+----------+----------+

| input |add_months|sub_months| date_add | date_sub |

+----------+----------+----------+----------+----------+

|2019-01-23|2019-04-23|2018-10-23|2019-01-27|2019-01-19|

|2019-06-24|2019-09-24|2019-03-24|2019-06-28|2019-06-20|

|2019-09-20|2019-12-20|2019-06-20|2019-09-24|2019-09-16|

+----------+----------+----------+----------+----------+

year(), month(), month()


dayofweek(), dayofmonth(), dayofyear()
next_day(), weekofyear()

import org.apache.spark.sql.functions._
Seq(("2019-01-23"),("2019-06-24"),("2019-09-20"))
.toDF("input")
.select( col("input"), year(col("input")).as("year"),
month(col("input")).as("month"),
dayofweek(col("input")).as("dayofweek"),
dayofmonth(col("input")).as("dayofmonth"),
dayofyear(col("input")).as("dayofyear"),
next_day(col("input"),"Sunday").as("next_day"),
weekofyear(col("input")).as("weekofyear")
).show()
158
Apache Spark - SparkByExamples

+----------+----+-----+---------+----------+---------+----------+----------+

| input|year|month|dayofweek|dayofmonth|dayofyear| next_day|weekofyear|

+----------+----+-----+---------+----------+---------+----------+----------+

|2019-01-23|2019| 1| 4| 23| 23|2019-01-27| 4|

|2019-06-24|2019| 6| 2| 24| 175|2019-06-30| 26|

|2019-09-20|2019| 9| 6| 20| 263|2019-09-22| 38|

+----------+----+-----+---------+----------+---------+----------+----------+

Spark Timestamp Functions Examples


Below are most used examples of Timestamp Functions.
current_timestamp()
Returns the current timestamp in spark default format yyyy-MM-dd HH:mm:ss
import org.apache.spark.sql.functions._
val df = Seq((1)).toDF("seq")
val curDate = df.withColumn("current_date",current_date().as("current_date"))
.withColumn("current_timestamp",current_timestamp().as("current_timestamp"))
curDate.show(false)

+---+------------+-----------------------+

|seq|current_date|current_timestamp |

+---+------------+-----------------------+

|1 |2019-11-16 |2019-11-16 21:00:55.349|

+---+------------+-----------------------+

to_timestamp()
Converts string timestamp to Timestamp type format.
import org.apache.spark.sql.functions._
val dfDate = Seq(("07-01-2019 12 01 19 406"),
("06-24-2019 12 01 19 406"),
("11-16-2019 16 44 55 406"),
("11-16-2019 16 50 59 406")).toDF("input_timestamp")

dfDate.withColumn("datetype_timestamp",
to_timestamp(col("input_timestamp"),"MM-dd-yyyy HH mm ss SSS"))
159
Apache Spark - SparkByExamples

.show(false)
+-----------------------+-------------------+

|input_timestamp |datetype_timestamp |

+-----------------------+-------------------+

|07-01-2019 12 01 19 406|2019-07-01 12:01:19|

|06-24-2019 12 01 19 406|2019-06-24 12:01:19|

|11-16-2019 16 44 55 406|2019-11-16 16:44:55|

|11-16-2019 16 50 59 406|2019-11-16 16:50:59|

+-----------------------+-------------------+

hour(), Minute() and second()


import org.apache.spark.sql.functions._
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")

df.withColumn("hour", hour(col("input_timestamp")))
.withColumn("minute", minute(col("input_timestamp")))
.withColumn("second", second(col("input_timestamp")))
.show(false)

+-----------------------+----+------+------+

|input_timestamp |hour|minute|second|

+-----------------------+----+------+------+

|2019-07-01 12:01:19.000|12 |1 |19 |

|2019-06-24 12:01:19.000|12 |1 |19 |

|2019-11-16 16:44:55.406|16 |44 |55 |

|2019-11-16 16:50:59.406|16 |50 |59 |

+-----------------------+----+------+------+

Conclusion:
In this post, I’ve consolidated the complete list of Spark Date and Timestamp Functions with a
description and example of some commonly used. You can find more information about these at
the following blog

160
Apache Spark - SparkByExamples

Spark SQL Map functions – complete list


In this article, I will explain the usage of the Spark SQL map
functions map(), map_keys(), map_values(), map_contact(), map_from_entries() on DataFrame
column using Scala example.
Though I’ve explained here with Scala, a similar method could be used to work Spark SQL map
functions with PySpark and if time permits I will cover it in the future. If you are looking for
PySpark, I would still recommend reading through this article as it would give you an Idea on
Spark map functions and its usage.
Spark SQL provides built-in standard map functions defines in DataFrame API, these come in
handy when we need to make operations on map (MapType) columns. All these functions accept
input as, map column and several other arguments based on the functions.
When possible try to leverage standard library as they are little bit more compile-time safety,
handles null and perform better when compared to UDF’s. If your application is critical on
performance try to avoid using custom UDF at all costs as these are not guarantee on
performance.

Spark SQL map Functions


Spark SQL map functions are grouped as “collection_funcs” in spark SQL along with several array
functions. These map functions are useful when we want to concatenate two or more map
columns, convert arrays of StructType entries to map column e.t.c

map Creates a new map column.

map_keys Returns an array containing the keys of the map.

map_values Returns an array containing the values of the map.

map_concat Merges maps specified in arguments.

map_from_entries Returns a map from the given array of StructType entries.

map_entries Returns an array of all StructType in the given map.

explode(e: Column) Creates a new row for every key-value pair in the map by
ignoring null & empty. It creates two new columns one for
key and one for value.

explode_outer(e: Column) Creates a new row for every key-value pair in the map
including null & empty. It creates two new columns one for

161
Apache Spark - SparkByExamples

key and one for value.

posexplode(e: Column) Creates a new row for each key-value pair in a map by
ignoring null & empty. It also creates 3 columns “pos” to
hold the position of the map element, “key” and “value”
columns for every row.

posexplode_outer(e: Column) Creates a new row for each key-value pair in a map
including null & empty. It also creates 3 columns “pos” to
hold the position of the map element, “key” and “value”
columns for every row.

transform_keys(expr: Column, f: Transforms map by applying functions to every key-value


(Column, Column) => Column) pair and returns a transformed map.

transform_values(expr: Column, f: Transforms map by applying functions to every key-value


(Column, Column) => Column) pair and returns a transformed map.

map_zip_with( Merges two maps into a single map.


left: Column,
right: Column,
f: (Column, Column, Column) =>
Column)

element_at(column: Column, Returns a value of a key in a map.


value: Any)

size(e: Column) Returns length of a map column.

Before we start, let’s create a DataFrame with some sample data to work with.
val structureData = Seq(
Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
162
Apache Spark - SparkByExamples

.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)
.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)

root

|-- id: string (nullable = true)

|-- dept: string (nullable = true)

|-- properties: struct (nullable = true)

| |-- salary: integer (nullable = true)

| |-- location: string (nullable = true)

+-----+---------+-----------+

|id |dept |properties |

+-----+---------+-----------+

|36636|Finance |[3000, USA]|

|40288|Finance |[5000, IND]|

|42114|Sales |[3900, USA]|

|39192|Marketing|[2500, CAN]|

|34534|Sales |[6500, USA]|

+-----+---------+-----------+

map() – Spark SQL map functions


Syntax - map(cols: Column*): Column
org.apache.spark.sql.functions.map() SQL function is used to create a map column of MapType on
DataFrame. The input columns to the map function must be grouped as key-value pairs. e.g.
(key1, value1, key2, value2, …).
Note: All key columns must have the same data type, and can’t be null and All value columns
must have the same data type. Below snippet converts all columns from “properties” struct into
map key, value pairs “propertiesmap” column.
val index = df.schema.fieldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fields.foreach(field =>{
163
Apache Spark - SparkByExamples

columns.add(lit(field.name))
columns.add(col("properties." + field.name)) })
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
First, we find “properties” column on Spark DataFrame using df.schema.fieldIndex(“properties”)
and retrieves all columns and it’s values to a LinkedHashSet. we need LinkedHashSet in order to
maintain the insertion order of key and value pair. and finally use map() function with a key, value
set pair.
root

|-- id: string (nullable = true)

|-- dept: string (nullable = true)

|-- propertiesMap: map (nullable = false)

| |-- key: string

| |-- value: string (valueContainsNull = true)

+-----+---------+---------------------------------+

|id |dept |propertiesMap |

+-----+---------+---------------------------------+

|36636|Finance |[salary -> 3000, location -> USA]|

|40288|Finance |[salary -> 5000, location -> IND]|

|42114|Sales |[salary -> 3900, location -> USA]|

|39192|Marketing|[salary -> 2500, location -> CAN]|

|34534|Sales |[salary -> 6500, location -> USA]|

+-----+---------+---------------------------------+

map_keys() – Returns map keys from a Spark SQL DataFrame


Syntax - map_keys(e: Column): Column
use map_keys() spark function in order to retrieve all keys from a Spark
DataFrame MapType column. Note that map_keys takes an argument of MapType while passing
any other type returns an error at run time.
 df.select(col("id"),map_keys(col("propertiesMap"))).show(false)
+-----+-----------------------+

|id |map_keys(propertiesMap)|

+-----+-----------------------+

|36636|[salary, location] |

|40288|[salary, location] |

|42114|[salary, location] |

164
Apache Spark - SparkByExamples
|39192|[salary, location] |

|34534|[salary, location] |

map_values() – Returns map values from a Spark DataFrame


Syntax - map_values(e: Column): Column
use map_values() spark function in order to retrieve all values from a Spark
DataFrame MapType column. Note that map_values takes an argument of MapType while passing
any other type returns an error at run time.
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)

+-----+-------------------------+

|id |map_values(propertiesMap)|

+-----+-------------------------+

|36636|[3000, USA] |

|40288|[5000, IND] |

|42114|[3900, USA] |

|39192|[2500, CAN] |

|34534|[6500, USA] |

+-----+-------------------------+

map_concat() – Concatenating two or more maps on DataFrame


Syntax - map_concat(cols: Column*): Column
Use Spark SQL map_concat() function in order to concatenate keys and values from more than
one map to a single map. All arguments to this function should be MapType, passing any other
type results a run time error.

val arrayStructureData = Seq(


Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"->"black","eye"-
>"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"->"brown","eye"-
>"black"),Map("height"->"6")),
Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"->"gray"),Map("height"-
>"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"->"black","eye"-
>"black"),Map("height"->"5.2"))
)

165
Apache Spark - SparkByExamples

val arrayStructureSchema = new StructType()


.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp")))
.select("name","mapConcat")
.show(false)
+-------+---------------------------------------------+

|name |mapConcat |

+-------+---------------------------------------------+

|James |[hair -> black, eye -> brown, height -> 5.9] |

|Michael|[hair -> brown, eye -> black, height -> 6] |

|Robert |[hair -> red, eye -> gray, height -> 6.3] |

|Maria |[hair -> blond, eye -> red, height -> 5.6] |

|Jen |[white -> black, eye -> black, height -> 5.2]|

+-------+---------------------------------------------+

map_from_entries() – convert array of StructType entries to map


Use map_from_entries() SQL functions to convert array of StructType entries to map (MapType)
on Spark DataFrame. This function take DataFrame column ArrayType[StructType] as an
argument, passing any other type results an error.
Syntax - map_from_entries(e: Column): Column
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
+-------+-------------------------------+

|name |mapFromEntries |

+-------+-------------------------------+

|James |[Newark -> NY, Brooklyn -> NY] |

|Michael|[SanJose -> CA, Sandiago -> CA]|

|Robert |[LasVegas -> NV] |

166
Apache Spark - SparkByExamples
|Maria |null |

|Jen |[LAX -> CA, Orange -> CA] |

map_entries() – convert map of StructType to array of StructType

Syntax - map_entries(e: Column): Column


Use Spark SQL map_entries() function to convert map of StructType to array of StructType column
on DataFrame.

Complete Spark SQL map functions example


package com.sparkbyexamples.spark.dataframe.functions.collection
import org.apache.spark.sql.functions.{col, explode, lit, map, map_concat, map_from_entries,
map_keys, map_values}
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StringType, StructType}
import org.apache.spark.sql.{Column, Row, SparkSession}
import scala.collection.mutable
object MapFunctions extends App {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
import spark.implicits._
val structureData = Seq(
Row("36636","Finance",Row(3000,"USA")),
Row("40288","Finance",Row(5000,"IND")),
Row("42114","Sales",Row(3900,"USA")),
Row("39192","Marketing",Row(2500,"CAN")),
Row("34534","Sales",Row(6500,"USA"))
)
val structureSchema = new StructType()
.add("id",StringType)
.add("dept",StringType)
.add("properties",new StructType()
.add("salary",IntegerType)

167
Apache Spark - SparkByExamples

.add("location",StringType)
)
var df = spark.createDataFrame(
spark.sparkContext.parallelize(structureData),structureSchema)
df.printSchema()
df.show(false)
// Convert to Map
val index = df.schema.fieldIndex("properties")
val propSchema = df.schema(index).dataType.asInstanceOf[StructType]
var columns = mutable.LinkedHashSet[Column]()
propSchema.fields.foreach(field =>{
columns.add(lit(field.name))
columns.add(col("properties." + field.name))
})
df = df.withColumn("propertiesMap",map(columns.toSeq:_*))
df = df.drop("properties")
df.printSchema()
df.show(false)
//Retrieve all keys from a Map
val keys =
df.select(explode(map_keys(<pre></pre>quot;propertiesMap"))).as[String].distinct.collect
print(keys.mkString(","))
// map_keys
df.select(col("id"),map_keys(col("propertiesMap")))
.show(false)
//map_values
df.select(col("id"),map_values(col("propertiesMap")))
.show(false)
//Creating DF with MapType
val arrayStructureData = Seq(
Row("James",List(Row("Newark","NY"),Row("Brooklyn","NY")),Map("hair"->"black","eye"-
>"brown"), Map("height"->"5.9")),
Row("Michael",List(Row("SanJose","CA"),Row("Sandiago","CA")),Map("hair"->"brown","eye"-
>"black"),Map("height"->"6")),

168
Apache Spark - SparkByExamples

Row("Robert",List(Row("LasVegas","NV")),Map("hair"->"red","eye"->"gray"),Map("height"-
>"6.3")),
Row("Maria",null,Map("hair"->"blond","eye"->"red"),Map("height"->"5.6")),
Row("Jen",List(Row("LAX","CA"),Row("Orange","CA")),Map("white"->"black","eye"-
>"black"),Map("height"->"5.2"))
)
val arrayStructureSchema = new StructType()
.add("name",StringType)
.add("addresses", ArrayType(new StructType()
.add("city",StringType)
.add("state",StringType)))
.add("properties", MapType(StringType,StringType))
.add("secondProp", MapType(StringType,StringType))
val concatDF = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
concatDF.printSchema()
concatDF.show()
concatDF.withColumn("mapConcat",map_concat(col("properties"),col("secondProp")))
.select("name","mapConcat")
.show(false)
concatDF.withColumn("mapFromEntries",map_from_entries(col("addresses")))
.select("name","mapFromEntries")
.show(false)
}
Conclusion
In this article, you have learned how to convert an array of StructType to map and Map of
StructType to array and concatenating several maps using SQL map functions on the Spark
DataFrame column.

169
Apache Spark - SparkByExamples

Spark SQL Sort Functions – Complete List


Spark SQL provides built-in standard sort functions define in DataFrame API, these come in handy
when we need to make sorting on the DataFrame column. All these accept input as, column name
in String and returns a Column type.
When possible try to leverage standard library as they are little bit more compile-time safety,
handles null and perform better when compared to UDF’s. If your application is critical on
performance try to avoid using custom UDF at all costs as UDF does not guarantee performance.
Spark SQL sort functions are grouped as “sort_funcs” in spark SQL, these sort functions come
handy when we want to perform any ascending and descending operations on columns.
These are primarily used on the Sort function of the Dataframe or Dataset.

SPARK SQL SORT FUNCTION SPARK FUNCTION DESCRIPTION


SYNTAX

asc(columnName: String): Column asc function is used to specify the ascending order of
the sorting column on DataFrame or DataSet

asc_nulls_first(columnName: String): Similar to asc function but null values return first and
Column then non-null values

asc_nulls_last(columnName: String): Similar to asc function but non-null values return first
Column and then null values

desc(columnName: String): Column desc function is used to specify the descending order of
the DataFrame or DataSet sorting column.

desc_nulls_first(columnName: String): Similar to desc function but null values return first and
Column then non-null values.

desc_nulls_last(columnName: String): Similar to desc function but non-null values return first
Column and then null values.

asc() – ascending function


asc function is used to specify the ascending order of the sorting column on DataFrame or
DataSet.
Syntax: asc(columnName: String): Column
asc_nulls_first() – ascending with nulls first

170
Apache Spark - SparkByExamples

Similar to asc function but null values return first and then non-null values.
asc_nulls_first(columnName: String): Column
asc_nulls_last() – ascending with nulls last
Similar to asc function but non-null values return first and then null values.
asc_nulls_last(columnName: String): Column

desc() – descending function


desc function is used to specify the descending order of the DataFrame or DataSet sorting
column.
desc(columnName: String): Column
desc_nulls_first() – descending with nulls first
Similar to desc function but null values return first and then non-null values.
desc_nulls_first(columnName: String): Column
desc_nulls_last() – descending with nulls last
Similar to desc function but non-null values return first and then null values.
desc_nulls_last(columnName: String): Column
______________________________________________________________________________

Spark SQL Aggregate Functions


Spark SQL provides built-in standard Aggregate functions defines in DataFrame API, these come
in handy when we need to make aggregate operations on DataFrame columns. Aggregate
functions operate on a group of rows and calculate a single return value for every group.
All these aggregate functions accept input as, Column type or column name in a string and several
other arguments based on the function and return Column type.
When possible try to leverage standard library as they are little bit more compile-time safety,
handles null and perform better when compared to UDF’s. If your application is critical on
performance try to avoid using custom UDF at all costs as these are not guarantee on
performance.

Spark Aggregate Functions


Spark SQL Aggregate functions are grouped as “agg_funcs” in spark SQL. Below is a list of
functions defined under this group. Click on each link to learn with a Scala example.

171
Apache Spark - SparkByExamples

Note that each and every below function has another signature which takes String as a column
name instead of Column.

AGGREGATE FUNCTION AGGREGATE FUNCTION DESCRIPTION


SYNTAX

approx_count_distinct(e: Returns the count of distinct items in a group.


Column)

approx_count_distinct(e: Column, Returns the count of distinct items in a group.


rsd: Double)

avg(e: Column) Returns the average of values in the input column.

collect_list(e: Column) Returns all values from an input column with duplicates.

collect_set(e: Column) Returns all values from an input column with duplicate
values .eliminated.

corr(column1: Column, column2: Returns the Pearson Correlation Coefficient for two columns.
Column)

count(e: Column) Returns number of elements in a column.

countDistinct(expr: Column, Returns number of distinct elements in the columns.


exprs: Column*)

covar_pop(column1: Column, Returns the population covariance for two columns.


column2: Column)

covar_samp(column1: Column, Returns the sample covariance for two columns.


column2: Column)

first(e: Column, ignoreNulls: Returns the first element in a column when ignoreNulls is set
Boolean) to true, it returns first non null element.

first(e: Column): Column Returns the first element in a column.

grouping(e: Column) Indicates whether a specified column in a GROUP BY list is


aggregated or not, returns 1 for aggregated or 0 for not
aggregated in the result set.

172
Apache Spark - SparkByExamples

AGGREGATE FUNCTION AGGREGATE FUNCTION DESCRIPTION


SYNTAX

kurtosis(e: Column) Returns the kurtosis of the values in a group.

last(e: Column, ignoreNulls: Returns the last element in a column. when ignoreNulls is
Boolean) set to true, it returns last non null element.

last(e: Column) Returns the last element in a column.

max(e: Column) Returns the maximum value in a column.

mean(e: Column) Alias for Avg. Returns the average of the values in a column.

min(e: Column) Returns the minimum value in a column.

skewness(e: Column) Returns the skewness of the values in a group.

stddev(e: Column) alias for `stddev_samp`.

stddev_samp(e: Column) Returns the sample standard deviation of values in a


column.

stddev_pop(e: Column) Returns the population standard deviation of the values in a


column.

sum(e: Column) Returns the sum of all values in a column.

sumDistinct(e: Column) Returns the sum of all distinct values in a column.

variance(e: Column) alias for `var_samp`.

var_samp(e: Column) Returns the unbiased variance of the values in a column.

var_pop(e: Column) returns the population variance of the values in a column.

173
Apache Spark - SparkByExamples

Aggregate Functions Examples


First, let’s create a DataFrame to work with aggregate functions. All example provided here is also
available at GitHub project.
import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
)
val df = simpleData.toDF("employee_name", "department", "salary")
df.show()
+-------------+----------+------+

|employee_name|department|salary|

+-------------+----------+------+

| James| Sales| 3000|

| Michael| Sales| 4600|

| Robert| Sales| 4100|

| Maria| Finance| 3000|

| James| Sales| 3000|

| Scott| Finance| 3300|

| Jen| Finance| 3900|

| Jeff| Marketing| 3000|

| Kumar| Marketing| 2000|

| Saif| Sales| 4100|

+-------------+----------+------+

174
Apache Spark - SparkByExamples

approx_count_distinct Aggregate Function


approx_count_distinct() function returns the count of distinct items in a group.

//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))

//Prints approx_count_distinct: 6
avg (average) Aggregate Function
avg() function returns the average of values in the input column.

//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))

//Prints avg: 3400.0


collect_list Aggregate Function
collect_list() function returns all values from an input column with duplicates.

//collect_list
df.select(collect_list("salary")).show(false)

+------------------------------------------------------------+
|collect_list(salary) |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

175
Apache Spark - SparkByExamples

collect_set Aggregate Function


collect_set() function returns all values from an input column with duplicate values eliminated.

//collect_set
df.select(collect_set("salary")).show(false)
+------------------------------------+

|collect_set(salary) |

+------------------------------------+

|[4600, 3000, 3900, 4100, 3300, 2000]|

+------------------------------------+

countDistinct Aggregate Function


countDistinct() function returns the number of distinct elements in a columns
//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+df2.collect()(0)(0))

count function()
count() function returns number of elements in a column.

println("count: "+
df.select(count("salary")).collect()(0))

Prints county: 10

grouping function()
grouping() Indicates whether a given input column is aggregated or not. returns 1 for aggregated
or 0 for not aggregated in the result. If you try grouping directly on the salary column you will get
below error.
Exception in thread "main" org.apache.spark.sql.AnalysisException:
// grouping() can only be used with GroupingSets/Cube/Rollup

176
Apache Spark - SparkByExamples

first function()
first() function returns the first element in a column when ignoreNulls is set to true, it returns the
first non-null element.
//first
df.select(first("salary")).show(false)
+--------------------+

|first(salary, false)|

+--------------------+

|3000 |

last()
last() function returns the last element in a column. when ignoreNulls is set to true, it returns the
last non-null element.
//last
df.select(last("salary")).show(false)
+-------------------+

|last(salary, false)|

+-------------------+

|4100 |

kurtosis()
kurtosis() function returns the kurtosis of the values in a group.
df.select(kurtosis("salary")).show(false)
+-------------------+

|kurtosis(salary) |

+-------------------+

|-0.6467803030303032|

max()
max() function returns the maximum value in a column.
df.select(max("salary")).show(false)
+-----------+
|max(salary)|
+-----------+
177
Apache Spark - SparkByExamples
|4600 |

min()
min() function
df.select(min("salary")).show(false)
+-----------+

|min(salary)|

+-----------+

|2000 |

mean()
mean() function returns the average of the values in a column. Alias for Avg
df.select(mean("salary")).show(false)
+-----------+

|avg(salary)|

+-----------+

|3400.0 |

skewness()
skewness() function returns the skewness of the values in a group.
df.select(skewness("salary")).show(false)
+--------------------+

|skewness(salary) |

+--------------------+

|-0.12041791181069571|

stddev(), stddev_samp() and stddev_pop()


stddev() alias for stddev_samp.
stddev_samp() function returns the sample standard deviation of values in a column.
stddev_pop() function returns the population standard deviation of the values in a column.

df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)
+-------------------+-------------------+------------------+
|stddev_samp(salary)|stddev_samp(salary)|stddev_pop(salary)|

178
Apache Spark - SparkByExamples
+-------------------+-------------------+------------------+
|765.9416862050705 |765.9416862050705 |726.636084983398 |

sum()
sum() function Returns the sum of all values in a column.
df.select(sum("salary")).show(false)
+-----------+

|sum(salary)|

+-----------+

|34000 |

sumDistinct()
sumDistinct() function returns the sum of all distinct values in a column.
df.select(sumDistinct("salary")).show(false)
+--------------------+

|sum(DISTINCT salary)|

+--------------------+

|20900 |

variance(), var_samp(), var_pop()


variance() alias for var_samp
var_samp() function returns the unbiased variance of the values in a column.
var_pop() function returns the population variance of the values in a column.

df.select(variance("salary"),var_samp("salary"),var_pop("salary"))
.show(false)

+-----------------+-----------------+---------------+

|var_samp(salary) |var_samp(salary) |var_pop(salary)|

+-----------------+-----------------+---------------+

|586666.6666666666|586666.6666666666|528000.0 |

Source code of Spark SQL Aggregate Functions examples

package com.sparkbyexamples.spark.dataframe.functions.aggregate

import org.apache.spark.sql.SparkSession
179
Apache Spark - SparkByExamples

import org.apache.spark.sql.functions._

object AggregateFunctions extends App {


val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
)
val df = simpleData.toDF("employee_name", "department", "salary")
df.show()

//approx_count_distinct()
println("approx_count_distinct: "+
df.select(approx_count_distinct("salary")).collect()(0)(0))

//avg
println("avg: "+
df.select(avg("salary")).collect()(0)(0))

180
Apache Spark - SparkByExamples

//collect_list
df.select(collect_list("salary")).show(false)
//collect_set
df.select(collect_set("salary")).show(false)

//countDistinct
val df2 = df.select(countDistinct("department", "salary"))
df2.show(false)
println("Distinct Count of Department & Salary: "+df2.collect()(0)(0))

println("count: "+
df.select(count("salary")).collect()(0))

//first
df.select(first("salary")).show(false)

//last
df.select(last("salary")).show(false)

//Exception in thread "main" org.apache.spark.sql.AnalysisException:


// grouping() can only be used with GroupingSets/Cube/Rollup;
//df.select(grouping("salary")).show(false)

df.select(kurtosis("salary")).show(false)

df.select(max("salary")).show(false)

df.select(min("salary")).show(false)

df.select(mean("salary")).show(false)

df.select(skewness("salary")).show(false)

181
Apache Spark - SparkByExamples

df.select(stddev("salary"), stddev_samp("salary"),
stddev_pop("salary")).show(false)

df.select(sum("salary")).show(false)

df.select(sumDistinct("salary")).show(false)

df.select(variance("salary"),var_samp("salary"),
var_pop("salary")).show(false)
}
Conclusion
In this article, I’ve consolidated and listed all Spark SQL Aggregate functions with scala examples
and also learned the benefits of using Spark SQL functions.

Spark Window Functions with Examples


Spark Window functions are used to calculate results such as the rank, row number e.t.c over a
range of input rows and these are available to you by importing org.apache.spark.sql.functions._,
this article explains the concept of window functions, it’s usage, syntax and finally how to use them
with Spark SQL and Spark’s DataFrame API. These come in handy when we need to make
aggregate operations in a specific window frame on DataFrame columns.
When possible, try to leverage the standard libraries as they are a little bit safer in compile-time,
handle null, and perform better when compared to UDFs. If your application is critical on
performance, try to avoid using custom UDF at all costs as these are not guaranteed on
performance.
1. Spark Window Functions
Spark Window functions operate on a group of rows (like frame, partition) and return a single value
for every input row. Spark SQL supports three kinds of window functions:
 ranking functions
 analytic functions
 aggregate functions

182
Apache Spark - SparkByExamples

The below table defines Ranking and Analytic functions and for aggregate functions, we can use
any existing aggregate functions as a window function.
To perform an operation on a group first, we need to partition the data using Window.partitionBy() ,
and for row number and rank function we need to additionally order by on partition data
using orderBy clause.
Click on each link to know more about these functions along with the Scala examples.

WINDOW FUNCTIONS USAGE & PYSPARK WINDOW FUNCTIONS DESCRIPTION


SYNTAX

row_number(): Column Returns a sequential number starting from 1 within a


window partition

rank(): Column Returns the rank of rows within a window partition, with
gaps.

percent_rank(): Column Returns the percentile rank of rows within a window


partition.

dense_rank(): Column Returns the rank of rows within a window partition without
any gaps. Where as Rank() returns rank with gaps.

ntile(n: Int): Column Returns the ntile id in a window partition

cume_dist(): Column Returns the cumulative distribution of values within a


window partition

lag(e: Column, offset: Int): Column returns the value that is `offset` rows before the current
lag(columnName: String, offset: row, and `null` if there is less than `offset` rows before the
Int): Column current row.
lag(columnName: String, offset: Int,
defaultValue: Any): Column

Before we start with an example, first let’s create a DataFrame to work with.
import spark.implicits._
183
Apache Spark - SparkByExamples

val simpleData = Seq(("James", "Sales", 3000),


("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
)
val df = simpleData.toDF("employee_name", "department", "salary")
df.show()

+-------------+----------+------+

|employee_name|department|salary|

+-------------+----------+------+

| James| Sales| 3000|

| Michael| Sales| 4600|

| Robert| Sales| 4100|

| Maria| Finance| 3000|

| James| Sales| 3000|

| Scott| Finance| 3300|

| Jen| Finance| 3900|

| Jeff| Marketing| 3000|

| Kumar| Marketing| 2000|

| Saif| Sales| 4100|

+-------------+----------+------+

2. Spark Window Ranking functions


2.1 row_number Window Function
row_number() window function is used to give the sequential row number starting from 1 to the
result of each window partition.

import org.apache.spark.sql.functions._

184
Apache Spark - SparkByExamples

import org.apache.spark.sql.expressions.Window

//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()
+-------------+----------+------+----------+

|employee_name|department|salary|row_number|

+-------------+----------+------+----------+

| James| Sales| 3000| 1|

| James| Sales| 3000| 2|

| Robert| Sales| 4100| 3|

| Saif| Sales| 4100| 4|

| Michael| Sales| 4600| 5|

| Maria| Finance| 3000| 1|

| Scott| Finance| 3300| 2|

| Jen| Finance| 3900| 3|

| Kumar| Marketing| 2000| 1|

| Jeff| Marketing| 3000| 2|

+-------------+----------+------+----------+

2.2 rank Window Function


rank() window function is used to provide a rank to the result within a window partition. This
function leaves gaps in rank when there are ties.
import org.apache.spark.sql.functions._

//rank
df.withColumn("rank",rank().over(windowSpec))
.show()
+-------------+----------+------+----+

|employee_name|department|salary|rank|

+-------------+----------+------+----+

| James| Sales| 3000| 1|

| James| Sales| 3000| 1|

| Robert| Sales| 4100| 3|

| Saif| Sales| 4100| 3|

| Michael| Sales| 4600| 5|


185
Apache Spark - SparkByExamples
| Maria| Finance| 3000| 1|

| Scott| Finance| 3300| 2|

| Jen| Finance| 3900| 3|

| Kumar| Marketing| 2000| 1|

| Jeff| Marketing| 3000| 2|

+-------------+----------+------+----+

This is the same as the RANK function in SQL.


2.3 dense_rank Window Function
dense_rank() window function is used to get the result with rank of rows within a window partition
without any gaps. This is similar to rank() function difference being rank function leaves gaps in
rank when there are ties.
import org.apache.spark.sql.functions._
//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()

+-------------+----------+------+----------+

|employee_name|department|salary|dense_rank|

+-------------+----------+------+----------+

| James| Sales| 3000| 1|

| James| Sales| 3000| 1|

| Robert| Sales| 4100| 2|

| Saif| Sales| 4100| 2|

| Michael| Sales| 4600| 3|

| Maria| Finance| 3000| 1|

| Scott| Finance| 3300| 2|

| Jen| Finance| 3900| 3|

| Kumar| Marketing| 2000| 1|

| Jeff| Marketing| 3000| 2|

+-------------+----------+------+----------+

This is the same as the DENSE_RANK function in SQL.

2.4 percent_rank Window Function


import org.apache.spark.sql.functions._
//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()
186
Apache Spark - SparkByExamples

+-------------+----------+------+------------+

|employee_name|department|salary|percent_rank|

+-------------+----------+------+------------+

| James| Sales| 3000| 0.0|

| James| Sales| 3000| 0.0|

| Robert| Sales| 4100| 0.5|

| Saif| Sales| 4100| 0.5|

| Michael| Sales| 4600| 1.0|

| Maria| Finance| 3000| 0.0|

| Scott| Finance| 3300| 0.5|

| Jen| Finance| 3900| 1.0|

| Kumar| Marketing| 2000| 0.0|

| Jeff| Marketing| 3000| 1.0|

+-------------+----------+------+------------+

This is the same as the PERCENT_RANK function in SQL.

2.5 ntile Window Function


ntile() window function returns the relative rank of result rows within a window partition. In below
example we have used 2 as an argument to ntile hence it returns ranking between 2 values (1 and
2)
//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()

+-------------+----------+------+-----+

|employee_name|department|salary|ntile|

+-------------+----------+------+-----+

| James| Sales| 3000| 1|

| James| Sales| 3000| 1|

| Robert| Sales| 4100| 1|

| Saif| Sales| 4100| 2|

| Michael| Sales| 4600| 2|

| Maria| Finance| 3000| 1|

| Scott| Finance| 3300| 1|

| Jen| Finance| 3900| 2|

| Kumar| Marketing| 2000| 1|

| Jeff| Marketing| 3000| 2|

187
Apache Spark - SparkByExamples
+-------------+----------+------+-----+

This is the same as the NTILE function in SQL.

3. Spark Window Analytic functions


3.1 cume_dist Window Function
cume_dist() window function is used to get the cumulative distribution of values within a window
partition.
This is the same as the DENSE_RANK function in SQL.
//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()
+-------------+----------+------+------------------+

|employee_name|department|salary| cume_dist|

+-------------+----------+------+------------------+

| James| Sales| 3000| 0.4|

| James| Sales| 3000| 0.4|

| Robert| Sales| 4100| 0.8|

| Saif| Sales| 4100| 0.8|

| Michael| Sales| 4600| 1.0|

| Maria| Finance| 3000|0.3333333333333333|

| Scott| Finance| 3300|0.6666666666666666|

| Jen| Finance| 3900| 1.0|

| Kumar| Marketing| 2000| 0.5|

| Jeff| Marketing| 3000| 1.0|

+-------------+----------+------+------------------+

3.2 lag Window Function


This is the same as the LAG function in SQL.
//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()
+-------------+----------+------+----+

|employee_name|department|salary| lag|
188
Apache Spark - SparkByExamples
+-------------+----------+------+----+

| James| Sales| 3000|null|

| James| Sales| 3000|null|

| Robert| Sales| 4100|3000|

| Saif| Sales| 4100|3000|

| Michael| Sales| 4600|4100|

| Maria| Finance| 3000|null|

| Scott| Finance| 3300|null|

| Jen| Finance| 3900|3000|

| Kumar| Marketing| 2000|null|

| Jeff| Marketing| 3000|null|

+-------------+----------+------+----+

3.3 lead Window Function


This is the same as the LEAD function in SQL.
//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()
+-------------+----------+------+----+

|employee_name|department|salary|lead|

+-------------+----------+------+----+

| James| Sales| 3000|4100|

| James| Sales| 3000|4100|

| Robert| Sales| 4100|4600|

| Saif| Sales| 4100|null|

| Michael| Sales| 4600|null|

| Maria| Finance| 3000|3900|

| Scott| Finance| 3300|null|

| Jen| Finance| 3900|null|

| Kumar| Marketing| 2000|null|

| Jeff| Marketing| 3000|null|

+-------------+----------+------+----+

4. Spark Window Aggregate Functions


In this section, I will explain how to calculate sum, min, max for each department using Spark SQL
Aggregate window functions and WindowSpec. When working with Aggregate functions, we don’t
need to use order by clause.
val windowSpec = Window.partitionBy("department").orderBy("salary")
189
Apache Spark - SparkByExamples

val windowSpecAgg = Window.partitionBy("department")


val aggDF = df.withColumn("row",row_number.over(windowSpec))
.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()

+----------+------+-----+----+----+

|department| avg| sum| min| max|

+----------+------+-----+----+----+

| Sales|3760.0|18800|3000|4600|

| Finance|3400.0|10200|3000|3900|

| Marketing|2500.0| 5000|2000|3000|

+----------+------+-----+----+----+

5. Source Code of Window Functions Example


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

object WindowFunctions extends App {

val spark: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

import spark.implicits._

val simpleData = Seq(("James", "Sales", 3000),


190
Apache Spark - SparkByExamples

("Michael", "Sales", 4600),


("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
)
val df = simpleData.toDF("employee_name", "department", "salary")
df.show()

//row_number
val windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("row_number",row_number.over(windowSpec))
.show()

//rank
df.withColumn("rank",rank().over(windowSpec))
.show()

//dens_rank
df.withColumn("dense_rank",dense_rank().over(windowSpec))
.show()

//percent_rank
df.withColumn("percent_rank",percent_rank().over(windowSpec))
.show()

//ntile
df.withColumn("ntile",ntile(2).over(windowSpec))
.show()
191
Apache Spark - SparkByExamples

//cume_dist
df.withColumn("cume_dist",cume_dist().over(windowSpec))
.show()

//lag
df.withColumn("lag",lag("salary",2).over(windowSpec))
.show()

//lead
df.withColumn("lead",lead("salary",2).over(windowSpec))
.show()

//Aggregate Functions
val windowSpecAgg = Window.partitionBy("department")
val aggDF = df.withColumn("row",row_number.over(windowSpec))
.withColumn("avg", avg(col("salary")).over(windowSpecAgg))
.withColumn("sum", sum(col("salary")).over(windowSpecAgg))
.withColumn("min", min(col("salary")).over(windowSpecAgg))
.withColumn("max", max(col("salary")).over(windowSpecAgg))
.where(col("row")===1).select("department","avg","sum","min","max")
.show()
}
6. Conclusion
In this tutorial, you have learned what are Spark SQL Window functions their syntax and how to
use them with aggregate function along with several examples in Scala.

Spark Read CSV file into DataFrame


Apache Spark provides a DataFrame API that allows an easy and efficient way to read a CSV file
into DataFrame. DataFrames are distributed collections of data organized into named columns.

192
Apache Spark - SparkByExamples

Use spark.read.csv("path") from the API to read a CSV file. Spark supports reading files with pipe,
comma, tab, or any other delimiter/separator files.
In this tutorial, you will learn how to read a single file, multiple files, and all files from a local
directory into Spark DataFrame, apply some transformations, and finally write DataFrame back to
a CSV file using Scala.
Note: Spark out of the box supports reading files in CSV, JSON, TEXT, Parquet, Avro, ORC and
many more file formats into Spark DataFrame.
Table of contents:
 Spark Read CSV file into DataFrame
 Read CSV files with a user-specified schema
 Read multiple CSV files
 Read all CSV files in a directory
 Options while reading CSV file
o delimiter
o InferSchema
o header
o quotes
o nullValues
o dateFormat
 Applying DataFrame transformations
 Write DataFrame to CSV file
o Using options
o Saving Mode

Spark Read CSV file into DataFrame


Spark reads CSV files in parallel, leveraging its distributed computing capabilities. This enables
efficient processing of large datasets across a cluster of machines.
Using spark.read.csv("path") or spark.read.format("csv").load("path") you can read a CSV file into
a Spark DataFrame. These methods take a file path as an argument.
The CSV file I used in this article can be found at GitHub. You can download it from the below
command.
// Test CSV file
wget
https://github.com/spark-examples/spark-scala-examples/blob/3ea16e4c6c1614609c2bd7ebdffcee
01c0fe6017/src/main/resources/zipcodes.csv
Note: Spark uses lazy evaluation, which means that the actual reading of data doesn’t happen
until an action is triggered. This allows for optimizations in the execution plan.

193
Apache Spark - SparkByExamples

// Import
import org.apache.spark.sql.SparkSession
// Create SparkSession
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()
// Read CSV file into DataFrame
val df = spark.read.csv("src/main/resources/zipcodes.csv")
df.printSchema()
Here, the spark is a SparkSession object. read is an object of DataFrameReader class and csv() is
a method in DataFrameReader.
This example reads the data into DataFrame column names “_c0” for the first column and “_c1” for
the second, and so on. By default, the data type of all these columns would be String.

When you use format("csv") method, you can also specify the Data sources by their fully qualified
name (i.e., org.apache.spark.sql.csv), but for built-in sources, you can also use their short names
(csv,json, parquet, jdbc, text e.t.c). For example:
// Using format()
val df2 = spark.read.format("CSV").load("src/main/resources/zipcodes.csv")
df2.printSchema()

Spark Read CSV with Header


If you have a header with column names on the CSV file, you need to explicitly
specify header=true option using option("header",true). Not mentioning this, the API treats the
header as a data record while reading CSV file.
// Specify header to true to get column names from CSV file
val df3 = spark.read.option("header",true)
.csv("src/main/resources/zipcodes.csv")
df3.printSchema()
Not that it still reads all columns as a string (StringType) by default.

194
Apache Spark - SparkByExamples

Read with Schema using inferSchema


Read CSV with Schema – Read the schema (inferschema) from the header record and derive
the column type based on the data. Use option("inferSchema", true) to automatically detect the
column type based on data. The default value set to this option is false.
Note: This option requires reading the data again to infer the schema.
// User inferSchema option to get right data type
val df4 = spark.read.option("inferSchema",true)
.csv("src/main/resources/zipcodes.csv")
You should see appropriate datatypes assigned to columns.

You can use the options() method to specify multiple options at a time.
// User multiple options together
val options = Map("inferSchema"->"true","delimiter"->",","header"->"true")
val df5 = spark.read.options(options)
.csv("src/main/resources/zipcodes.csv")
df5.printSchema()

Read CSV with Custom Schema


While reading CSV files into Spark DataFrame, you can either infer the schema automatically or
explicitly specify it. Specifying the schema helps in avoiding schema inference overhead and
ensures accurate data types.
If you know the schema of the file ahead and do not want to use the inferSchema option for
column names and types, use user-defined custom column names and type using schema option.
// Import types
import org.apache.spark.sql.types._
// Read with custom schema
val schema = new StructType()
195
Apache Spark - SparkByExamples

.add("RecordNumber",IntegerType,true)
.add("Zipcode",IntegerType,true)
.add("ZipCodeType",StringType,true)
.add("City",StringType,true)
.add("State",StringType,true)
.add("LocationType",StringType,true)
.add("Lat",DoubleType,true)
.add("Long",DoubleType,true)
.add("Xaxis",IntegerType,true)
.add("Yaxis",DoubleType,true)
.add("Zaxis",DoubleType,true)
.add("WorldRegion",StringType,true)
.add("Country",StringType,true)
.add("LocationText",StringType,true)
.add("Location",StringType,true)
.add("Decommisioned",BooleanType,true)
.add("TaxReturnsFiled",StringType,true)
.add("EstimatedPopulation",IntegerType,true)
.add("TotalWages",IntegerType,true)
.add("Notes",StringType,true)
// Read CSV file with custom schema
val df_with_schema = spark.read.format("csv")
.option("header", "true")
.schema(schema)
.load("src/main/resources/zipcodes.csv")
df_with_schema.printSchema()
df_with_schema.show(false)

Read Multiple CSV files


Using the spark.read.csv() method, you can also read multiple CSV files, just pass all file names
by separating commas as a path, for example :

// Read multiple files

196
Apache Spark - SparkByExamples

val df8 = spark.read.csv("path1,path2,path3")

Read all CSV files in a directory


We can read all CSV files from a directory into DataFrame just by passing the directory as a path
to the csv() method.

// Read all files from directory


val df8 = spark.read.csv("Folder path")
Caching & Persistence
After reading the CSV file, you can choose to persist the DataFrame in memory or on disk using
caching mechanisms. This enhances the performance of subsequent operations by avoiding
redundant reads from the file. It is always a best practice to persist/cache the Spark DataFrame
after reading from the CSV file.
// Caching & Persistence
val df6 = spark.read.option("inferSchema",true)
.csv("src/main/resources/zipcodes.csv")

// Cache DataFrame
val df7 = df6.cache()

Using SQL to Query


In Spark SQL, you can use the spark.sql() method to execute SQL queries on DataFrames. To
query DataFrame using Spark SQL, you can follow these steps:

// Create Temporary table


df7.createOrReplaceTempView("ZipCodes")

// Query table
spark.sql("select RecordNumber, Zipcode, ZipcodeType, City, State from ZipCodes")
.show()

Options while reading CSV file


Spark CSV dataset provides multiple options while reading such as setting delimiter, handling
header and footer, customizing null values, and more. Below are some of the most important

197
Apache Spark - SparkByExamples

options explained with examples. These options are specified using


the option() or options() method.
delimiter
delimiter option is used to specify the column delimiter of the CSV file. By default, it is comma (,)
character, but can be set to pipe (|), tab, space, or any character using this option.

// Using delimiter option


val df2 = spark.read.options(Map("delimiter"->","))
.csv("src/main/resources/zipcodes.csv")
inferSchema
The default value set to this option is false when setting to true it automatically infers column types
based on the data. Note that, it requires reading the data one more time to infer the schema.
// Using inferSchema option
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->","))
.csv("src/main/resources/zipcodes.csv")

header
This option is used to read the first line of the CSV file as column names. By default the value of
this option is false , and all column types are assumed to be a string.
// Using header
val df2 = spark.read.options(Map("inferSchema"->"true","delimiter"->",","header"->"true"))
.csv("src/main/resources/zipcodes.csv")

quotes
When you have a column with a delimiter that used to split the columns, use quotes option to
specify the quote character, by default it is ” and delimiters inside quotes are ignored. but using
this option you can set any character.
nullValues
Using nullValues option you can specify the string in a CSV to consider as null. For example, if you
want to consider a date column with a value “1900-01-01” to set null on DataFrame.
dateFormat
dateFormat option to be used to set the format of the input DateType and
TimestampType columns. Supports all java.text.SimpleDateFormat formats.
charset

198
Apache Spark - SparkByExamples

Pay attention to the character encoding of the CSV file, especially when dealing with
internationalization. Spark’s CSV reader allows specifying encoding options to handle different
character sets. By default, it uses ‘UTF-8‘ but can be set to other valid charset names.
Note: Besides the above options, Spark CSV dataset also supports many other options, please
refer to this article for details.

Data Cleansing and Transformation


After reading the CSV file, incorporate necessary data cleansing and transformation steps in
Spark to handle missing values, outliers, or any other data quality issues specific to your use case.
Once you have created DataFrame from the CSV file, you can apply transformations and actions
to DataFrame. Please refer to the link for more details.
Write Spark DataFrame to CSV file
Use the write() method of the Spark DataFrameWriter object to write Spark DataFrame to a CSV
file. For detailed example refer to Writing Spark DataFrame to CSV File using Options.
// Write DataFrame to CSV file
df2.write.option("header","true")
.csv("/tmp/spark_output/zipcodes")
Options
While writing a CSV file you can use several options. for example, header to output the
DataFrame column names as header record and delimiter to specify the delimiter on the CSV
output file.

// Using write options


df2.write.options("header",true)
.csv("/tmp/spark_output/zipcodes")
Other options available quote,escape,nullValue,dateFormat,quoteMode .

Saving modes
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this
method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can
use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can
use SaveMode.Ignore.
errorifexists – This is a default option when the file already exists, it returns an error, alternatively,
you can use SaveMode.ErrorIfExists.

199
Apache Spark - SparkByExamples

// Import
import org.apache.spark.sql.SaveMode
// Using Saving modes
df2.write.mode(SaveMode.Overwrite).csv("/tmp/spark_output/zipcodes")

Conclusion:
In this tutorial, you have learned how to read a CSV file, multiple csv files and all files from a local
folder into Spark DataFrame. Use options to change the default behavior and write CSV files back
to DataFrame using different save options.

Spark Read and Write JSON file into DataFrame


Spark SQL provides spark.read.json("path") to read a single line and multiline (multiple lines)
JSON file into Spark DataFrame and dataframe.write.json("path") to save or write to JSON file, In
this tutorial, you will learn how to read a single file, multiple files, all files from a directory into
DataFrame and writing DataFrame back to JSON file using Scala example.
Note: Spark out of the box supports to read JSON files and many more file formats into Spark
DataFrame and spark uses Jackson library natively to work with JSON files.
The complete example explained here is available at GitHub project to download.
Table of contents:
 Spark Read JSON file into DataFrame
 Read JSON file from multiline
 Reading multiple files at a time
 Reading all files in a directory
 Reading file with a user-specified schema
 Reading file using Spark SQL
 Options while reading JSON file
o nullValues
o dateFormat
 Spark Write DataFrame to JSON file
o Using options
o Saving Mode

1. Spark Read JSON File into DataFrame


Using spark.read.json("path") or spark.read.format("json").load("path") you can read a JSON file
into a Spark DataFrame, these methods take a file path as an argument.
200
Apache Spark - SparkByExamples

Unlike reading a CSV, By default JSON data source inferschema from an input file.
Refer dataset used in this article at zipcodes.json on GitHub

//read json file into dataframe


val df = spark.read.json("src/main/resources/zipcodes.json")
df.printSchema()
df.show(false)

When you use format("json") method, you can also specify the Data sources by their fully qualified
name (i.e., org.apache.spark.sql.json), for built-in sources, you can also use short name “json”.
2. Read JSON file from multiline
Sometimes you may want to read records from JSON file that scattered multiple lines, In order to
read such files, use-value true to multiline option, by default multiline option, is set to false.
Below is the input file we going to read, this same file is also available at multiline-zipcode.json on
GitHub.
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
Using spark.read.option("multiline","true")
//read multiline json file
val multiline_df = spark.read.option("multiline","true")
.json("src/main/resources/multiline-zipcode.json")
multiline_df.show(false)
201
Apache Spark - SparkByExamples

3. Reading Multiple Files at a Time


Using the spark.read.json() method you can also read multiple JSON files from different paths, just
pass all file names with fully qualified paths by separating comma, for example

//read multiple files


val df2 = spark.read.json(
"src/main/resources/zipcodes_streaming/zipcode1.json",
"src/main/resources/zipcodes_streaming/zipcode2.json")
df2.show(false)
4. Reading all Files in a Directory
We can read all JSON files from a directory into DataFrame just by passing directory as a path to
the json() method. Below snippet, “zipcodes_streaming” is a folder that contains multiple JSON
files.

//read all files from a folder


val df3 = spark.read.json("src/main/resources/zipcodes_streaming")
df3.show(false)

5. Reading files with a user-specified custom schema


Spark Schema defines the structure of the data, in other words, it is the structure of the
DataFrame. Spark SQL provides StructType & StructField classes to programmatically specify the
structure to the DataFrame.
If you know the schema of the file ahead and do not want to use the default inferSchema option for
column names and types, use user-defined custom column names and type using schema option.
Use the StructType class to create a custom schema, below we initiate this class and use add a
method to add columns to it by providing the column name, data type and nullable option.

//Define custom schema


val schema = new StructType()
.add("RecordNumber",IntegerType,true)
.add("Zipcode",IntegerType,true)
.add("ZipCodeType",StringType,true)
.add("City",StringType,true)
.add("State",StringType,true)

202
Apache Spark - SparkByExamples

.add("LocationType",StringType,true)
.add("Lat",DoubleType,true)
.add("Long",DoubleType,true)
.add("Xaxis",IntegerType,true)
.add("Yaxis",DoubleType,true)
.add("Zaxis",DoubleType,true)
.add("WorldRegion",StringType,true)
.add("Country",StringType,true)
.add("LocationText",StringType,true)
.add("Location",StringType,true)
.add("Decommisioned",BooleanType,true)
.add("TaxReturnsFiled",StringType,true)
.add("EstimatedPopulation",IntegerType,true)
.add("TotalWages",IntegerType,true)
.add("Notes",StringType,true)
val df_with_schema = spark.read.schema(schema)
.json("src/main/resources/zipcodes.json")
df_with_schema.printSchema()
df_with_schema.show(false)

6. Read JSON file using Spark SQL


Spark SQL also provides a way to read a JSON file by creating a temporary view directly from
reading file using spark.sqlContext.sql(“load json to temporary view”)

spark.sqlContext.sql("CREATE TEMPORARY VIEW zipcode USING json OPTIONS" +


" (path 'src/main/resources/zipcodes.json')")
spark.sqlContext.sql("select * from zipcodes").show(false)

7. Options while reading JSON file


7.1 nullValues
Using nullValues option you can specify the string in a JSON to consider as null. For example, if
you want to consider a date column with a value “1900-01-01” set null on DataFrame.
7.2 dateFormat

203
Apache Spark - SparkByExamples

dateFormat option to used to set the format of the input DateType and TimestampType columns.
Supports all java.text.SimpleDateFormat formats.
Note: Besides the above options, Spark JSON dataset also supports many other options.

8. Applying DataFrame Transformations


Once you have created DataFrame from the JSON file, you can apply all transformation and
actions DataFrame support. Please refer to the link for more details.

9. Write Spark DataFrame to JSON file


Use the Spark DataFrameWriter object “write” method on DataFrame to write a JSON file.

df2.write
.json("/tmp/spark_output/zipcodes.json")
9.1 Spark Options while writing JSON files
While writing a JSON file you can use several options.
Other options available nullValue,dateFormat
9.2 Saving modes
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this
method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can
use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can
use SaveMode.Ignore.
errorifexists or error – This is a default option when the file already exists, it returns an error,
alternatively, you can use SaveMode.ErrorIfExists.

df2.write.mode(SaveMode.Overwrite).json("/tmp/spark_output/zipcodes.json")

10. Source Code for Reference


package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object FromJsonFile {
def main(args:Array[String]): Unit = {
204
Apache Spark - SparkByExamples

val spark: SparkSession = SparkSession.builder()


.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
//read json file into dataframe
val df = spark.read.json("src/main/resources/zipcodes.json")
df.printSchema()
df.show(false)
//read multiline json file
val multiline_df = spark.read.option("multiline", "true")
.json("src/main/resources/multiline-zipcode.json")
multiline_df.printSchema()
multiline_df.show(false)
//read multiple files
val df2 = spark.read.json(
"src/main/resources/zipcodes_streaming/zipcode1.json",
"src/main/resources/zipcodes_streaming/zipcode2.json")
df2.show(false)
//read all files from a folder
val df3 = spark.read.json("src/main/resources/zipcodes_streaming/*")
df3.show(false)
//Define custom schema
val schema = new StructType()
.add("City", StringType, true)
.add("Country", StringType, true)
.add("Decommisioned", BooleanType, true)
.add("EstimatedPopulation", LongType, true)
.add("Lat", DoubleType, true)
.add("Location", StringType, true)
.add("LocationText", StringType, true)
.add("LocationType", StringType, true)
.add("Long", DoubleType, true)
205
Apache Spark - SparkByExamples

.add("Notes", StringType, true)


.add("RecordNumber", LongType, true)
.add("State", StringType, true)
.add("TaxReturnsFiled", LongType, true)
.add("TotalWages", LongType, true)
.add("WorldRegion", StringType, true)
.add("Xaxis", DoubleType, true)
.add("Yaxis", DoubleType, true)
.add("Zaxis", DoubleType, true)
.add("Zipcode", StringType, true)
.add("ZipCodeType", StringType, true)
val df_with_schema = spark.read.schema(schema)
.json("src/main/resources/zipcodes.json")
df_with_schema.printSchema()
df_with_schema.show(false)
spark.sqlContext.sql("CREATE TEMPORARY VIEW zipcode USING json OPTIONS" +
" (path 'src/main/resources/zipcodes.json')")
spark.sqlContext.sql("SELECT *FROM zipcode").show()
//Write json file
df2.write
.json("/tmp/spark_output/zipcodes.json")
}
}
Conclusion:
In this tutorial, you have learned how to read a JSON file with single line record and multiline
record into Spark DataFrame, and also learned reading single and multiple files at a time and
writing JSON file back to DataFrame using different save options.

Spark Read and Write Apache Parquet

206
Apache Spark - SparkByExamples

In this tutorial, we will learn what is Apache Parquet?, It’s advantages and how to read from and
write Spark DataFrame to Parquet file format using Scala example. The example provided here is
also available at Github repository for reference.
 Apache Parquet Introduction
 Apache Parquet Advantages
 Spark Write DataFrame to Parquet file format
 Spark Read Parquet file into DataFrame
 Appending to existing Parquet file
 Running SQL queries
 Partitioning and Performance Improvement
 Reading a specific Parquet Partition
 Spark parquet schema
Apache Parquet Introduction
Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a
far more efficient file format than CSV or JSON, supported by many data processing systems.
It is compatible with most of the data processing frameworks in the Hadoop echo systems. It
provides efficient data compression and encoding schemes with enhanced performance to handle
complex data in bulk.
Spark SQL provides support for both reading and writing Parquet files that automatically capture
the schema of the original data, It also reduces data storage by 75% on average. Below are some
advantages of storing data in a parquet format. Spark by default supports Parquet in its library
hence we don’t need to add any dependency libraries.
Apache Parquet Advantages:
Below are some of the advantages of using Apache Parquet. combining these benefits with Spark
improves performance and gives the ability to work with structure files.
 Reduces IO operations.
 Fetches specific columns that you need to access.
 It consumes less space.
 Support type-specific encoding.

Apache Parquet Spark Example


Before we go over the Apache parquet with the Spark example, first, let’s Create a Spark
DataFrame from Seq object. Note that toDF() function on sequence object is available only when
you import implicits using spark.sqlContext.implicits._. This complete spark parquet example is
available at Github repository for reference.

val data = Seq(("James ","","Smith","36636","M",3000),

207
Apache Spark - SparkByExamples

("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1))

val columns = Seq("firstname","middlename","lastname","dob","gender","salary")

import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
The above example creates a data frame with columns “firstname”, “middlename”, “lastname”,
“dob”, “gender”, “salary”

Spark Write DataFrame to Parquet file format


Using parquet() function of DataFrameWriter class, we can write Spark DataFrame to the Parquet
file. As mentioned earlier Spark doesn’t need any additional packages or libraries to use Parquet
as it by default provides with Spark. easy isn’t it? so we don’t have to worry about version and
compatibility issues. In this example, we are writing DataFrame to “people.parquet” file.
df.write.parquet("/tmp/output/people.parquet")

Writing Spark DataFrame to Parquet format preserves the column names and data types, and all
columns are automatically converted to be nullable for compatibility reasons. Notice that all part
files Spark creates has parquet extension.

Spark Read Parquet file into DataFrame


Similar to write, DataFrameReader provides parquet() function (spark.read.parquet) to read the
parquet files and creates a Spark DataFrame. In this example snippet, we are reading data from
an apache parquet file we have written before.
val parqDF = spark.read.parquet("/tmp/output/people.parquet")
printing schema of DataFrame returns columns with the same names and data types.

Append to existing Parquet file

208
Apache Spark - SparkByExamples

Spark provides the capability to append DataFrame to existing parquet files using “append” save
mode. In case, if you want to overwrite use “overwrite” save mode.
df.write.mode('append').parquet("/tmp/output/people.parquet")

Using SQL queries on Parquet


We can also create a temporary view on Parquet files and then use it in Spark SQL statements.
This temporary table would be available until the SparkContext present.
parqDF.createOrReplaceTempView("ParquetTable")
val parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
Above predicate on spark parquet file does the file scan which is performance bottleneck like table
scan on a traditional database. We should use partitioning in order to improve performance.

Spark parquet partition – Improving performance


Partitioning is a feature of many databases and data processing frameworks and it is key to make
jobs work at scale. We can do a parquet file partition using spark partitionBy() function.
df.write.partitionBy("gender","salary")
.parquet("/tmp/output/people2.parquet")

Parquet Partition creates a folder hierarchy for each spark partition; we have mentioned the first
partition as gender followed by salary hence, it creates a salary folder inside the gender folder.

This is an example of how to write a Spark DataFrame by preserving the partitioning on gender
and salary columns.

val parqDF = spark.read.parquet("/tmp/output/people2.parquet")


parqDF.createOrReplaceTempView("Table2")
val df = spark.sql("select * from Table2 where gender='M' and salary >= 4000")
The execution of this query is significantly faster than the query without partition. It filters the data
first on gender and then applies filters on salary.

209
Apache Spark - SparkByExamples

Spark Read a specific Parquet partition


val parqDF = spark.read.parquet("/tmp/output/people2.parquet/gender=M")
This code snippet retrieves the data from the gender partition value “M”.
The complete code can be downloaded from GitHub

Complete Spark Parquet Example

package com.sparkbyexamples.spark.dataframe
import org.apache.spark.sql.SparkSession

object ParquetExample {
def main(args:Array[String]):Unit= {

val spark: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

val data = Seq(("James ","","Smith","36636","M",3000),


("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1))

val columns = Seq("firstname","middlename","lastname","dob","gender","salary")


import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)
df.show()
df.printSchema()
df.write
.parquet("/tmp/output/people.parquet")
val parqDF = spark.read.parquet("/tmp/output/people.parquet")
parqDF.createOrReplaceTempView("ParquetTable")
210
Apache Spark - SparkByExamples

spark.sql("select * from ParquetTable where salary >= 4000").explain()


val parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show()
parkSQL.printSchema()
df.write
.partitionBy("gender","salary")
.parquet("/tmp/output/people2.parquet")
val parqDF2 = spark.read.parquet("/tmp/output/people2.parquet")
parqDF2.createOrReplaceTempView("ParquetTable2")
val df3 = spark.sql("select * from ParquetTable2 where gender='M' and salary >= 4000")
df3.explain()
df3.printSchema()
df3.show()
val parqDF3 = spark.read
.parquet("/tmp/output/people2.parquet/gender=M")
parqDF3.show()
}
}
Conclusion:
You have learned how to read a write an apache parquet data files in Spark and also learned how
to improve the performance by using partition and filtering data with a partition key and finally
appending to and overwriting existing parquet files.

Spark Read XML file using Databricks API


Apache Spark can also be used to process or read simple to complex nested XML files into Spark
DataFrame and writing it back to XML using Databricks Spark XML API (spark-xml) library. In this
article, I will explain how to read XML file with several options using the Scala example.
 Spark XML Databricks dependency
 Spark Read XML into DataFrame
o Handling attributes
211
Apache Spark - SparkByExamples

 Writing DataFrame to XML File


o Limitations
 Spark XML DataFrame to Avro File
 Spark XML DataFrame to Parquet File
Databricks Spark-XML Maven dependency
Processing XML files in Apache Spark is enabled by using below Databricks spark-xml
dependency into the maven pom.xml file.
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.6.0</version>
</dependency>
Spark Read XML into DataFrame
Databricks Spark-XML package allows us to read simple or nested XML files into DataFrame,
once DataFrame is created, we can leverage its APIs to perform transformations and actions like
any other DataFrame.
Spark-XML API accepts several options while reading an XML file. for example, option rowTag is
used to specify the rows tag. rootTag is used to specify the root tag of the input nested XML
Input XML file we use on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("src/main/resources/persons.xml")
Alternatively, you can also use the short
form format("xml") and load("src/main/resources/persons.xml")
While API reads XML file into DataFrame, It automatically infers the schema based on data. Below
schema ouputs from df.printSchma() .
root

|-- _id: long (nullable = true)

|-- dob_month: long (nullable = true)

|-- dob_year: long (nullable = true)

|-- firstname: string (nullable = true)

|-- gender: string (nullable = true)

|-- lastname: string (nullable = true)

|-- middlename: string (nullable = true)

|-- salary: struct (nullable = true)

212
Apache Spark - SparkByExamples
| |-- _VALUE: long (nullable = true)

| |-- _currency: string (nullable = true)

We can also supply our own struct schema and use it while reading a file as described below.
val schema = new StructType()
.add("_id",StringType)
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType)
.add("dob_year",StringType)
.add("dob_month",StringType)
.add("gender",StringType)
.add("salary",StringType)
val df = spark.read
.option("rowTag", "book")
.schema(schema)
.xml("src/main/resources/persons.xml")
df.show()
Output:
show() on DataFrame outputs the following.
+---+---------+--------+---------+------+--------+----------+---------------+

|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|

+---+---------+--------+---------+------+--------+----------+---------------+

| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|

| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|

+---+---------+--------+---------+------+--------+----------+---------------+

Handling XML Attributes


“_” is added to the variable prefix for attributes, for example, _value & _currency are attributes
from XML file. We can change the prefix to be any special character by using the
option attributePrefix . Handling attributes can be disabled with the option excludeAttribute
Spark Write DataFrame to XML File
Use “com.databricks.spark.xml” DataSource on format method of the DataFrameWriter to write
Spark DataFrame to XML file. This data source is provided as part of the Spark-XML API. simar to
reading, write also takes options rootTag and rowTag to specify the root tag and row tag
respectively on the output XML file.
df2.write
213
Apache Spark - SparkByExamples

.format("com.databricks.spark.xml")
.option("rootTag", "persons")
.option("rowTag", "person")
.save("src/main/resources/persons_new.xml")

This snippet writes a Spark DataFrame “df2” to XML file “pesons_new.xml” with “persons” as root
tag and “person” as row tag.
Limitations:
This API is most useful when reading and writing simple XML files. However, At the time of writing
this article, this API has the following limitations.
 Reading/Writing attribute to/from root element not supported in this API.
 Doesn’t support complex XML structures where you want to read header and footer along
with row elements.
If you have one root element following data elements then Spark XML is GO to API. If you wanted
to write a complex structure and this API is not suitable for you, please read below article where
I’ve explained using XStream API
Spark – Writing complex XML structures using XStream API

Write Spark XML DataFrame to Avro File


Once you create a DataFrame by reading XML, We can easily write it to Avro by using below
maven dependency.
Apache Avro is a serialization system and is used to store persistent data in a binary format. When
Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any
program.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.0</version>
</dependency>
format("avro") is provided by spark-avro API to read/write Avro files.

df2.write.format("avro")
.mode(SaveMode.Overwrite)
.save("\tmp\spark_out\avro\persons.avro")
Below snippet provides writing to Avro file by using partitions.

214
Apache Spark - SparkByExamples

df2.write.partitionBy("_id")
.format("avro").save("persons_partition.avro")

Write Spark XML DataFrame to Parquet File


Spark SQL provides a parquet method to read/write parquet files hence, no additional libraries are
not needed, once the DatraFrame created from XML we can use the parquet method on
DataFrameWriter class to write to the Parquet file.
Apache Parquet is a columnar file format that provides optimizations to speed up queries and is a
far more efficient file format than CSV or JSON. Spark SQL comes with a parquet method to read
data. It automatically captures the schema of the original data and reduces data storage by 75%
on average.
df2.write
.parquet("\tmp\spark_output\parquet\persons.parquet")
Below snippet, writes DataFrame to parquet file with partition by “_id”.
df2.write
.partitionBy("_id")
.parquet("\tmp\spark_output\parquet\persons_partition.parquet")
Conclusion:
In this article, you have learned how to read XML files into Apache Spark DataFrame and write it
back to XML, Avro, and Parquet files after processing using spark xml API. Also, explains some
limitations of using Databricks Spark-XML API.

Read & Write Avro files using Spark DataFrame


Spark provides built-in support to read from and write DataFrame to Avro file using “spark-avro”
library. In this tutorial, you will learn reading and writing Avro file along with schema, partitioning
data for performance with Scala example.
Table of the contents:
 Apache Avro Introduction
 Apache Avro Advantages
 Spark Avro dependency
 Writing Avro Data File from DataFrame
 Reading Avro Data File to DataFrame
 Writing DataFrame to Avro Partition

215
Apache Spark - SparkByExamples

 Reading Avro Partition data to DataFrame


 Using Avro Schema
 Using Spark SQL
1. What is Apache Avro?
Apache Avro is an open-source, row-based, data serialization and data exchange framework for
Hadoop projects, originally developed by databricks as an open-source library that supports
reading and writing data in Avro file format. it is mostly used in Apache Spark especially for Kafka-
based data pipelines. When Avro data is stored in a file, its schema is stored with it, so that files
may be processed later by any program.
It has build to serialize and exchange big data between different Hadoop based projects. It
serializes data in a compact binary format and schema is in JSON format that defines the field
names and data types.
It is similar to Thrift and Protocol Buffers, but does not require the code generation as it’s data
always accompanied by a schema that permits full processing of that data without code
generation. This is one of the great advantages compared with other serialization systems.
2. Apache Avro Advantages
 Supports complex data structures like Arrays, Map, Array of map and map of array
elements.
 A compact, binary serialization format which provides fast while transferring data.
 row-based data serialization system.
 Support multi-languages, meaning data written by one language can be read by different
languages.
 Code generation is not required to read or write data files.
 Simple integration with dynamic languages.

3. Spark Avro dependencies


Since Spark 2.4, Spark SQL provides built-in support for reading and writing Apache Avro data
files, however, the spark-avro module is external and by default, it’s not included in spark-
submit or spark-shell hence, accessing Avro file format in Spark is enabled by providing a
package.

3.1 maven dependencies.


// Maven dependencies
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.0</version>

216
Apache Spark - SparkByExamples

</dependency>

3.2 spark-submit
While using spark-submit, provide spark-avro_2.12 and its dependencies directly using --
packages, such as,
// Spark-submit
./bin/spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.4

3.3 spark-shell
While working with spark-shell, you can also use --packages to add spark-avro_2.12 and its
dependencies directly,
// Spark-shell
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.4
4. Write Spark DataFrame to Avro Data File
Since Avro library is external to Spark, it doesn’t provide avro() function on DataFrameWriter ,
hence we should use DataSource “avro” or “org.apache.spark.sql.avro” to write Spark DataFrame
to Avro file.
// Write Spark DataFrame to Avro Data File
df.write.format("avro").save("person.avro")

5. Read Avro Data File to Spark DataFrame


Similarly avro() function is not provided in Spark DataFrameReader hence, we should use
DataSource format as “avro” or “org.apache.spark.sql.avro” and load() is used to read the Avro
file.
// Read Avro Data File to Spark DataFrame
val personDF= spark.read.format("avro").load("person.avro")

6. Writing Avro Partition Data


Spark DataFrameWriter provides partitionBy() function to partition the Avro at the time of writing.
Partition improves performance on reading by reducing Disk I/O.
// Writing Avro Partition Data
val data = Seq(("James ","","Smith",2018,1,"M",3000),
("Michael ","Rose","",2010,3,"M",4000),
("Robert ","","Williams",2010,3,"M",4000),
("Maria ","Anne","Jones",2005,5,"F",4000),

217
Apache Spark - SparkByExamples

("Jen","Mary","Brown",2010,7,"",-1)
)
val columns = Seq("firstname", "middlename", "lastname", "dob_year",
"dob_month", "gender", "salary")
import spark.sqlContext.implicits._
val df = data.toDF(columns:_*)

df.write.partitionBy("dob_year","dob_month")
.format("avro").save("person_partition.avro")
This example creates partition by “date of birth year and month” on person data. As shown in the
below screenshot, Avro creates a folder for each partition data.

Avro Partitioned Data on dob_year field.


Each of the folders contains dob_month folder.

7. Reading Avro Partition Data


When we try to retrieve the data from partition, It just reads the data from the partition folder
without scanning entire Avro files.
// Reading Avro Partition Data
spark.read
.format("avro")
.load("person_partition.avro")
.where(col("dob_year") === 2010)
.show()

8. Using Avro Schema


Avro schemas are usually defined with .avsc extension and the format of the file is in JSON. Will
store below schema in person.avsc file and provide this file using option() while reading an Avro
file. This schema provides the structure of the Avro file with field names and it’s data types.
// Using Avro Schema
{
"type": "record",

218
Apache Spark - SparkByExamples

"name": "Person",
"namespace": "com.sparkbyexamples",
"fields": [
{"name": "firstname","type": "string"},
{"name": "middlename","type": "string"},
{"name": "lastname","type": "string"},
{"name": "dob_year","type": "int"},
{"name": "dob_month","type": "int"},
{"name": "gender","type": "string"},
{"name": "salary","type": "int"}
] }
val schemaAvro = new Schema.Parser()
.parse(new File("src/main/resources/person.avsc"))

val df = spark.read
.format("avro")
.option("avroSchema", schemaAvro.toString)
.load("person.avro")
Alternatively, we can also specify the StructType using the schema method.

9. Using Avro with Spark SQL


We can also read Avro data files using SQL, to do this, first, create a temporary table by pointing
to the Avro data file and run the SQL command on the table.
// Using Avro with Spark SQL
spark.sqlContext.sql("CREATE TEMPORARY VIEW PERSON USING avro
OPTIONS (path \"person.avro\")")
spark.sqlContext.sql("SELECT * FROM PERSON").show()
Conclusion:
We have seen examples of how to write Avro data files and how to read using Spark DataFrame.
Also, I’ve explained working with Avro partition and how it improves while reading Avro file. Using
Partition we can achieve a significant performance on reading.

219
Apache Spark - SparkByExamples

Create Spark DataFrame from HBase using Hortonworks


This tutorial explains with a Scala example of how to create Spark DataFrame from HBase table
using Hortonworks DataSource "org.apache.spark.sql.execution.datasources.hbase" from shc-
core library.
I would recommend reading Inserting Spark DataFrame to HBase table before you proceed to the
rest of the article where I explained Maven dependencies and their usage.
Related: Libraries and DataSource API’s to connect Spark with HBase
In summary, To interact Spark DataFrame with HBase, you would need hbase-clinet, hbase-
spark and shc-core API’s.
<dependencies>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.0.2.3.1.0.6-1</version> <!-- Hortonworks Latest -->
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-spark</artifactId>
<version>2.0.2.3.1.0.6-1</version> <!-- Hortonworks Latest -->
</dependency>
<dependency>
<groupId>com.hortonworks</groupId>
<artifactId>shc-core</artifactId>
<version>1.1.1-2.1-s_2.11</version> <!-- Hortonworks Latest -->
</dependency>
</dependencies>

Below is the complete example, for your reference and the same example is also available
at GitHub.

package com.sparkbyexamples.spark.dataframe.hbase.hortonworks

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog

220
Apache Spark - SparkByExamples

object HBaseSparkRead {

def main(args: Array[String]): Unit = {

def catalog =
s"""{
|"table":{"namespace":"default", "name":"employee"},
|"rowkey":"key",
|"columns":{
|"key":{"cf":"rowkey", "col":"key", "type":"string"},
|"fName":{"cf":"person", "col":"firstName", "type":"string"},
|"lName":{"cf":"person", "col":"lastName", "type":"string"},
|"mName":{"cf":"person", "col":"middleName", "type":"string"},
|"addressLine":{"cf":"address", "col":"addressLine", "type":"string"},
|"city":{"cf":"address", "col":"city", "type":"string"},
|"state":{"cf":"address", "col":"state", "type":"string"},
|"zipCode":{"cf":"address", "col":"zipCode", "type":"string"}
|}
|}""".stripMargin

val sparkSession: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
.getOrCreate()

import sparkSession.implicits._

val hbaseDF = sparkSession.read


.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()

221
Apache Spark - SparkByExamples

hbaseDF.printSchema()
hbaseDF.show(false)

hbaseDF.filter($"key" === "1" && $"state" === "FL")


.select("key", "fName", "lName")
.show()

//Create Temporary Table on DataFrame


hbaseDF.createOrReplaceTempView("employeeTable")

//Run SQL
sparkSession.sql("select * from employeeTable where fName = 'Amaya' ").show } }
Let me explain what’s happening at a few statements in this example.
First, we need to define a catalog to bridge the gap between HBase KV store and Spark
DataFrame table structure. using this we will also map the column names between the two
structures and keys.
A couple of things happening at below snippet, format
takes "org.apache.spark.sql.execution.datasources.hbase" DataSource defined in “shc-core” API
which enables us to use DataFrame with HBase tables. And, df.read.options take the catalog
which we defined earlier. Finally, load() reads the HBase table.

val hbaseDF = sparkSession.read


.options(Map(HBaseTableCatalog.tableCatalog -> catalog))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
hbaseDF.printSchema() displays the below schema.
root

|-- key: string (nullable = true)

|-- fName: string (nullable = true)

|-- lName: string (nullable = true)

|-- mName: string (nullable = true)

|-- addressLine: string (nullable = true)

|-- city: string (nullable = true)

|-- state: string (nullable = true)

|-- zipCode: string (nullable = true)

222
Apache Spark - SparkByExamples

hbaseDF.show(false) get the below data. Please note the DataFrame field names differences with
table column cell names.
+---+-------+--------+-----+-----------+-------+-----+-------+

|key|fName |lName |mName|addressLine|city |state|zipCode|

+---+-------+--------+-----+-----------+-------+-----+-------+

|1 |Abby |Smith |K |3456 main |Orlando|FL |45235 |

|2 |Amaya |Williams|L |123 Orange |Newark |NJ |27656 |

|3 |Alchemy|Davis |P |Warners |Sanjose|CA |34789 |

+---+-------+--------+-----+-----------+-------+-----+-------+

hbaseDF.filter() filter the data using DSL functions.


+---+-----+-----+

|key|fName|lName|

+---+-----+-----+

| 1| Abby|Smith|

finally, we can create a temporary SQL table and run all SQL queries.
+---+-----+--------+-----+-----------+------+-----+-------+

|key|fName| lName|mName|addressLine| city|state|zipCode|

+---+-----+--------+-----+-----------+------+-----+-------+

| 2|Amaya|Williams| L| 123 Orange|Newark| NJ| 27656|

+---+-----+--------+-----+-----------+------+-----+-------+

Conclusion:
In this tutorial, you have learned how to create Spark DataFrame from HBase table using
Hortonworks DataSource API and also have seen how to run DSL and SQL queries on Hbase
DataFrame.

Spark Read ORC file into DataFrame


Spark natively supports ORC data source to read ORC into DataFrame and write it back to the
ORC file format using orc() method of DataFrameReader and DataFrameWriter. In this article, I
will explain how to read an ORC file into Spark DataFrame, proform some filtering, creating a table
by reading the ORC file, and finally writing is back by partition using scala examples.
Advertisements
Table of contents

223
Apache Spark - SparkByExamples

 What is ORC?
 ORC advantages
 Write Spark DataFrame to ORC file
 Read ORC file into Spark DataFrame
 Creating a table on ORC file & using SQL
 Using Partition
 Which compression to choose
What is the ORC file?
ORC stands of Optimized Row Columnar which provides a highly efficient way to store the data in
a self-describing, type-aware column-oriented format for the Hadoop ecosystem. This is similar to
other columnar storage formats Hadoop supports such as RCFile, parquet.
ORC file format heavily used as a storage for Apache Hive due to its highly efficient way of storing
data which enables high-speed processing and ORC also used or natively supported by many
frameworks like Hadoop MapReduce, Apache Spark, Pig, Nifi, and many more.
ORC Advantages
 Compression: ORC stores data as columns and in compressed format hence it takes way
less disk storage than other formats.
 Reduces I/O: ORC reads only columns that are mentioned in a query for processing hence
it takes reduces I/O.
 Fast reads: ORC is used for high-speed processing as it by default creates built-in index
and has some default aggregates like min/max values for numeric data.
ORC Compression
Spark supports the following compression options for ORC data source. By default, it
uses SNAPPY when not specified.
 SNAPPY
 ZLIB
 LZO
 NONE

Create a DataFrame
Spark by default supports ORC file formats without importing third party ORC dependencies.
Since we don’t have an ORC file to read, first will create an ORC file from the DataFrame. Below
is a sample DataFrame we use to create an ORC file.
val data =Seq(("James ","","Smith","36636","M",3000),
("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
224
Apache Spark - SparkByExamples

("Jen","Mary","Brown","","F",-1))
val columns=Seq("firstname","middlename","lastname","dob","gender","salary")
val df=spark.createDataFrame(data).toDF(columns:_*)
df.printSchema()
df.show(false)

Spark Write ORC file


Spark DataFrameWriter uses orc() method to write or create ORC file from DataFrame. This
method takes a path as an argument where to write a ORC file.
df.write.orc("/tmp/orc/data.orc")
Alternatively, you can also write using format("orc")
df.write.format("orc").save("/tmp/orc/data.orc")

Spark by default uses snappy compression while writing ORC file. You can notice this on the part
file names. And you can change the compression from default snappy to either none or zlib using
an option compression
df.write.mode("overwrite")
.option("compression","zlib")
.orc("/tmp/orc/data-zlib.orc")
This creates ORC files with zlib compression.

Using append save mode, you can append a DataFrame to an existing ORC file. Incase to
overwrite use overwrite save mode.
df.write.mode('append').orc("/tmp/orc/people.orc")
df.write.mode('overwrite').orc("/tmp/orc/people.orc")

Spark Read ORC file

225
Apache Spark - SparkByExamples

Use Spark DataFrameReader’s orc() method to read ORC file into DataFrame. This supports
reading snappy, zlib or no compression, it is not necessary to specify in compression option while
reading a ORC file.
df.read.orc("/tmp/orc/data.orc")

In order to read ORC files from Amazon S3, use the below prefix to the path along with third-party
dependencies and credentials.
 s3:\\ = > First gen
 s3n:\\ => second Gen
 s3a:\\ => Third gen

Executing SQL queries on DataFrame


We can also create a temporary view on Stark DataFrame that was created on ORC file and run
SQL queries.. These views are available until your program exits.
df2.createOrReplaceTempView("ORCTable")
val orcSQL = spark.sql("select firstname,dob from ORCTable where salary >= 4000 ")
orcSQL.show(false)
In this example, the physical table scan loads only columns firstname, dob, and age at runtime,
without reading all columns from the file system. This improves read performance.

Creating a table on ORC file


Now let’s walk through executing SQL queries on the ORC file without creating a DataFrame first.
In order to execute SQL queries, create a temporary view or table directly on the ORC file instead
of creating from DataFrame.
spark.sql("CREATE TEMPORARY VIEW PERSON USING orc OPTIONS (path
\"/tmp/orc/data.orc\")")
spark.sql("SELECT * FROM PERSON").show()

Here, we created a temporary view PERSON from ORC file “data” file. This gives the following
results.
+---------+----------+--------+-----+------+------+

|firstname|middlename|lastname| dob|gender|salary|

+---------+----------+--------+-----+------+------+

| Robert | |Williams|42114| M| 4000|

| Maria | Anne| Jones|39192| F| 4000|

226
Apache Spark - SparkByExamples
| Michael | Rose| |40288| M| 4000|

| James | | Smith|36636| M| 3000|

| Jen| Mary| Brown| | F| -1|

+---------+----------+--------+-----+------+------+

Using Partition
When we execute a particular query on PERSON table, it scan’s through all the rows and returns
the results the selected columns back. In Spark, we can improve query execution in an optimized
way by doing partitions on the data using partitionBy() method. Following is the example of
partitionBy().
df.write.partitionBy("gender","salary")
.mode("overwrite").orc("/tmp/orc/data.orc")
When you check the people.orc file, it has two partitions “gender” followed by “salary” inside.
Reading a specific Partition
The example below explains of reading partitioned ORC file into DataFrame with gender=M.
val parDF=spark.read.orc("/tmp/orc/data.orc/gender=M")
parDF.show(false)

Which compression to choose


Not writing ORC files in compression results in larger disk space and slower in performance.
Hence, it is suggestable to use compression. Below are basic comparison between ZLIB and
SNAPPY when to use what.
 When you need a faster read then ZLIB compression is to-go option, without a doubt, It also
takes smaller storage on disk compared with SNAPPY.
 ZLIB is slightly slower in write compared with SNAPPY. If you have large data set to write,
use SNAPPY. For smaller datasets, it is still suggestible to use ZLIB.

Complete Example of using ORC in Spark

import org.apache.spark.sql.{SparkSession}

object ReadORCFile extends App{

val spark: SparkSession = SparkSession.builder()


.master("local[1]")
.appName("SparkByExamples.com")
227
Apache Spark - SparkByExamples

.getOrCreate()

val data =Seq(("James ","","Smith","36636","M",3000),


("Michael ","Rose","","40288","M",4000),
("Robert ","","Williams","42114","M",4000),
("Maria ","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1))
val columns=Seq("firstname","middlename","lastname","dob","gender","salary")
val df=spark.createDataFrame(data).toDF(columns:_*)

df.write.mode("overwrite")
.orc("/tmp/orc/data.orc")

df.write.mode("overwrite")
.option("compression","none12")
.orc("/tmp/orc/data-nocomp.orc")

df.write.mode("overwrite")
.option("compression","zlib")
.orc("/tmp/orc/data-zlib.orc")

val df2=spark.read.orc("/tmp/orc/data.orc")
df2.show(false)

df2.createOrReplaceTempView("ORCTable")
val orcSQL = spark.sql("select firstname,dob from ORCTable where salary >= 4000 ")
orcSQL.show(false)

spark.sql("CREATE TEMPORARY VIEW PERSON USING orc OPTIONS (path


\"/tmp/orc/data.orc\")")
spark.sql("SELECT * FROM PERSON").show()
}

228
Apache Spark - SparkByExamples

Conclusion
In summary, ORC is a high efficient, compressed columnar format that is capable to store
petabytes of data without compromising fast reads. Spark natively supports ORC data source to
read and write an ORC files using orc() method on DataFrameReader and DataFrameWrite.

Spark 3.0 Read Binary File into DataFrame


Since Spark 3.0, Spark supports a data source format binaryFile to read binary file (image, pdf,
zip, gzip, tar e.t.c) into Spark DataFrame/Dataset. When used binaryFile format,
the DataFrameReader converts the entire contents of each binary file into a single DataFrame, the
resultant DataFrame contains the raw content and metadata of the file.
In this Spark 3.0 article, I will provide a Scala example of how to read single, multiple, and all
binary files from a folder into DataFrame and also know different options it supports.
Using binaryFile data source, you should able to read files like image, pdf, zip, gzip, tar, and many
binary files into DataFrame, each file will be read as a single record along with the metadata of the
file. The resultant DataFrame contains the following columns.
 path: StringType => Absolute path of the file
 modificationTime: TimestampType => Last modified time stamp of the file
 length: LongType => length of the file
 content: BinaryType => binary contents of the file

1. Read a Single Binary File


The below example read the spark.png image binary file into DataFrame. The RAW data of the file
will be loaded into content column.
// Read a Single Binary File
val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
df.printSchema()
df.show()
// Output:

root

229
Apache Spark - SparkByExamples
|-- path: string (nullable = true)

|-- modificationTime: timestamp (nullable = true)

|-- length: long (nullable = true)

|-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+

| path| modificationTime|length| content|

+--------------------+--------------------+------+--------------------+

|file:/C:/tmp/bina...|2020-07-25 10:11:...| 74675|[89 50 4E 47 0D 0...|

+--------------------+--------------------+------+--------------------+

The data in content column shows binary data.


In order to read binary files from Amazon S3 using the below prefix to the path along with third-
party dependencies and credentials.
 s3:\\ = > First gen
 s3n:\\ => second Gen
 s3a:\\ => Third gen

2. Read Multiple Binary Files


The below example reads all PNG image files from a path into Spark DataFrame.
// Read Multiple Binary Files
val df3 = spark.read.format("binaryFile").load("/tmp/binary/*.png")
df3.printSchema()
df3.show(false)
It reads all png files and converts each file into a single record in DataFrame.

3. Read all Binary Files in a Folder


In order to read all binary files from a folder, just pass the folder path to the load() method.
// Read all Binary Files in a Folder
val df2 = spark.read.format("binaryFile").load("/tmp/binary/")
df2.printSchema()
df2.show(false)

4. Reading Binary File Options


pathGlobFilter: To load files with paths matching a given glob pattern while keeping the behavior of
partition discovery.
230
Apache Spark - SparkByExamples

For example, the following code reads all PNG files from the path with any partitioned directories.
// Reading Binary File Options
val df = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/tmp/binary/")
recursiveFileLookup: Ignores the partition discovery and recursively search files under the input
directory path.
val df = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.png")
.option("recursiveFileLookup", "true")
.load("/tmp/binary/")
5. Few things to note
 While using binaryFile data source, if you pass text file to the load() method, it reads the
contents of the text file as a binary into DataFrame.
 binary() method on DataFrameReader still not available hence, you can’t
use spark.read.binary("path") yet. I will update this article when it’s available.
 Currently, the binary file data source does not support writing a DataFrame back to the
binary file format.

Conclusion
In summary, Spark 3.0 provides a binaryFile data source to read the binary file into DataFrame but
it does not support writing the data frame back into a binary file. It also has option pathGlobFilter to
load files by preserving the partition and recursiveFileLookup option to recursively load the files
from the subdirectories by ignoring partition.

231
Apache Spark - SparkByExamples

PySparkhttps://www.google.com/imgres?q=pyspark&imgurl=https%3A%2F
%2Fwww.freecodecamp.org%2Fnews%2Fcontent%2Fimages
%2F2024%2F06%2Fpyspark.jpg&imgrefurl=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttps%2Fwww.freecodecamp.org
%2Fnews%2Fpyspark-for-beginners
%2F&docid=HkSERTuznZ09LM&tbnid=qKLrsNgtBTqfBM&vet=12ahUKEwjriIjs5ZmJAxX
5S2wGHXraBvcQM3oECBgQAA..i&w=800&h=451&hcb=2&ved=2ahUKEwjriIjs5ZmJAx
X5S2wGHXraBvcQM3oECBgQAA

232

You might also like