0% found this document useful (0 votes)
2 views28 pages

Unit 4(Big Data Analytics)

Apache Spark is a fast, open-source data processing engine that outperforms Hadoop by utilizing in-memory processing and offering a unified framework for various data tasks. Key features include lazy evaluation for optimization, a directed acyclic graph (DAG) for task execution, and the ability to create resilient distributed datasets (RDDs) that are fault-tolerant and can be manipulated through transformations and actions. The introduction of Spark Session simplifies access to Spark functionalities, combining multiple contexts into a single entry point for easier data handling.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views28 pages

Unit 4(Big Data Analytics)

Apache Spark is a fast, open-source data processing engine that outperforms Hadoop by utilizing in-memory processing and offering a unified framework for various data tasks. Key features include lazy evaluation for optimization, a directed acyclic graph (DAG) for task execution, and the ability to create resilient distributed datasets (RDDs) that are fault-tolerant and can be manipulated through transformations and actions. The introduction of Spark Session simplifies access to Spark functionalities, combining multiple contexts into a single entry point for easier data handling.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit 4(Big Data Analytics)(Search with  for new topic)

Apache spark- Advantages over Hadoop


Apache Spark — which is also open source — is a data processing engine for big data sets. Like Hadoop,
Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and
it uses random access memory (RAM) to cache and process data instead of a file system. This enables
Spark to handle use cases that Hadoop cannot.
Benefits of the Spark framework include the following:
 A unified engine that supports SQL queries, streaming data, machine learning (ML) and graph
processing
 Can be 100x faster than Hadoop for smaller workloads via in-memory processing, disk data
storage, etc.
 APIs designed for ease of use when manipulating semi-structured data and transforming data
lazy evaluation
Lazy evaluation means the execution will not start until anaction is triggered. Transformations are lazy
in nature i.e. when we call some operation on RDD, it does not execute immediately. Spark adds them to
a DAG of computation and only when driver requests some data, this DAG actually gets executed
Advantages of lazy evaluation.
1) It is an optimization technique i.e. it provides optimization by reducing the number of queries.
2) It saves the round trips between driver and cluster, thus speeds up the process
In-Memory Processing in Spark
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The
heap size is what referred to as the Spark executor memory which is controlled with the
spark.executor.memory property of the –executor-memory flag. Every spark application will have one
executor on each worker node.

Executors are worker nodes' processes in charge of running individual tasks in a given Spark job and The
spark driver is the program that declares the transformations and actions on RDDs of data and submits
such requests to the master.

DAG

A DAG is a directed acyclic graph. They are commonly used in computer systems for task execution.

Page 1 of 28
In this context, a graph is a collection of nodes that are connected by edges. In the case of Hadoop and
Spark, the nodes represent executable tasks, and the edges are task dependencies. Think of the DAG like
a flow chart that tells the system which tasks to execute and in what order. The following is a simple
example of an undirected graph of tasks.

This graph is undirected because there it does not capture which node is the start node and which is the
end node. In other words, this graph does not tell me if the reduce task should be feeding the map tasks
or vice versa. The next graph shows a directed graph of tasks.

A directed graph gives an unambiguous direction for each edge. This means that we know that the map
tasks feed into the reduce task, rather than the other way around. This property is essential for
executing complex workflows since we need to know which tasks should be executed in which order.

Lastly, the graph is acyclic, because it does not contain any cycles. A cycle happens when it is possible to
loop back to a previous node. Cycles are useful for tasks involving recursion but not as good for large-
scale distributed systems. The following are two examples of graphs with cycles.

Page 2 of 28
Spark context

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs,
accumulators and broadcast variables on that cluster.
Only one SparkContext should be active per JVM. You must stop() the active SparkContext before
creating a new one.
C#Copy
public sealed class SparkContext
Inheritance
Object
SparkContext
Constructors
SparkContext() Create a SparkContext that loads settings from system properties (for
instance, when launching with spark-submit).
SparkContext(SparkConf) Create a SparkContext object with the given config.
SparkContext(String, String) Initializes a SparkContext instance with a specific master and
application name.
SparkContext(String, String, Alternative constructor that allows setting common Spark properties
SparkConf) directly.
SparkContext(String, String, Alternative constructor that allows setting common Spark properties
String) directly.
Properties
DefaultParallelism Default level of parallelism to use when not given by user (e.g.
Parallelize()).
Methods
AddFile(String, Boolean) Add a file to be downloaded with this Spark job on every node.
Broadcast<T>(T) Broadcast a read-only variable to the cluster, returning a
Microsoft.Spark.Broadcast object for reading it in distributed
functions. The variable will be sent to each executor only once.
ClearJobGroup() Clear the current thread's job group ID and its description.
GetConf() Returns SparkConf object associated with this SparkContext object.
Note that modifying the SparkConf object will not have any impact.
GetOrCreate(SparkConf) This function may be used to get or instantiate a SparkContext and
register it as a singleton object. Because we can only have one active
SparkContext per JVM, this is useful when applications may wish to
share a SparkContext.
SetCheckpointDir(String) Sets the directory under which RDDs are going to be checkpointed.
SetJobDescription(String) Sets a human readable description of the current job.
SetJobGroup(String, String, Assigns a group ID to all the jobs started by this thread until the group
Boolean) ID is set to a different value or cleared.
SetLogLevel(String) Control our logLevel. This overrides any user-defined log settings.
Stop() Shut down the SparkContext.

Page 3 of 28
Spark Session

Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact

with various spark’s functionality with a lesser number of constructs. Instead of having a spark context,

hive context, SQL context, now all of it is encapsulated in a Spark session.


Some History….

Prior Spark 2.0, Spark Context was the entry point of any spark application and used to access all spark

features and needed a sparkConf which had all the cluster configs and parameters to create a Spark

Context object. We could primarily create just RDDs using Spark Context and we had to create specific

spark contexts for any other spark interactions. For SQL SQLContext, hive HiveContext, streaming

Streaming Application. In a nutshell, Spark session is a combination of all these different

contexts. Internally, Spark session creates a new SparkContext for all the operations and also all the

above-mentioned contexts can be accessed using the SparkSession object.


How do I create a Spark session?

A Spark Session can be created using a builder pattern.


import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder
.appName("SparkSessionExample")
.master("local[4]")
.config("spark.sql.warehouse.dir", "target/spark-warehouse")
.enableHiveSupport()
.getOrCreate

The spark session builder will try to get a spark session if there is one already created or create a new

one and assigns the newly created SparkSession as the global default. Note that enableHiveSupport here

is similar to creating a HiveContext and all it does is enables access to Hive metastore, Hive serdes, and

Hive udfs.

Note that, we don’t have to create a spark session object when using spark-shell. It is already created for

us with the variable spark.

Page 4 of 28
scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@2bd158ea
RDD?

RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of Apache Spark.
RDD in Apache Spark is an immutable collection of objects which computes on the different node of the
cluster.
Decomposing the name RDD:
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute
missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
 Dataset represents records of the data you work with. The user can load the data set externally
which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.
Hence, each and every dataset in RDD is logically partitioned across many servers so that they can be
computed on different nodes of the cluster. RDDs are fault tolerant i.e. It posses self-recovery in the
case of failure.

There are three ways to create RDDs in Spark such as – Data in stable storage, other RDDs, and
parallelizing already existing collection in driver program. One can also operate Spark RDDs in parallel
with a low-level API that offers transformations and actions. We will study these Spark RDD Operations
later in this section.
Spark RDD can also be cached and manually partitioned. Caching is beneficial when we use RDD several
times. And manual partitioning is important to correctly balance partitions. Generally, smaller partitions
allow distributing RDD data more equally, among more executors. Hence, fewer partitions make the
work easy.
Programmers can also call a persist method to indicate which RDDs they want to reuse in future
operations. Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not
enough RAM. Users can also request other persistence strategies, such as storing the RDD only on disk
or replicating it across machines, through flags to persist.
Several features of Apache Spark RDD are:

Page 5 of 28
Transformations
In Spark, the core data structures are immutable meaning they cannot be changed once created. This
might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In
order to “change” a DataFrame you will have to instruct Spark how you would like to modify the
DataFrame you have into the one that you want. These instructions are called transformations.
Transformations are the core of how you will be expressing your business logic using Spark. There are
two types of transformations, those that specify narrow dependencies and those that specify wide
dependencies.

Narrow Transformation

Narrow transformations are the result of map() and filter() functions and these compute data that live
on a single partition meaning there will not be any data movement between partitions to execute
narrow transformations.

Functions such as map(), mapPartition(), flatMap(), filter(), union() are some examples of narrow
transformation
Wider Transformation

Wider transformations are the result of groupByKey() and reduceByKey() functions and these compute
data that live on many partitions meaning there will be data movements between partitions to execute
wider transformations. Since these shuffles the data, they also called shuffle transformations.
Functions such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some examples
of a wider transformations.

Page 6 of 28
Spark Actions
RDD ACTION METHODS METHOD DEFINITION
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒U, combOp: (U, U) ⇒ Aggregate the elements of each partition, and
U)(implicit arg0: ClassTag[U]): U then the results for all the partitions.
collect():Array[T] Return the complete dataset as an Array.
count():Long Return the count of elements in the dataset.
countApprox(timeout: Long, confidence: Double = 0.95): Return approximate count of elements in the
PartialResult[BoundedDouble] dataset, this method returns incomplete
when execution time meets timeout.
countApproxDistinct(relativeSD: Double = 0.05): Long Return an approximate number of distinct
elements in the dataset.
countByValue(): Map[T, Long] Return Map[T,Long] key representing each
unique value in dataset and value represent
count each value present.
countByValueApprox(timeout: Long, confidence: Double = Same as countByValue() but returns
0.95)(implicit ord: Ordering[T] = null): PartialResult[Map[T, approximate result.
BoundedDouble]]
first():T Return the first element in the dataset.
fold(zeroValue: T)(op: (T, T) ⇒T): T Aggregate the elements of each partition, and
then the results for all the partitions.
foreach(f: (T) ⇒Unit): Unit Iterates all elements in the dataset by
applying function f to all elements.
foreachPartition(f: (Iterator[T]) ⇒Unit): Unit Similar to foreach, but applies function f for
each partition.
min()(implicit ord: Ordering[T]): T Return the minimum value from the dataset.
max()(implicit ord: Ordering[T]): T Return the maximum value from the dataset.
reduce(f: (T, T) ⇒T): T Reduces the elements of the dataset using
the specified binary operator.
saveAsObjectFile(path: String): Unit Saves RDD as a serialized object's to the
storage system.
saveAsTextFile(path: String, codec: Class[_ <: Saves RDD as a compressed text file.
CompressionCodec]): Unit
saveAsTextFile(path: String): Unit Saves RDD as a text file.
take(num: Int): Array[T] Return the first num elements of the dataset.

Page 7 of 28
RDD ACTION METHODS METHOD DEFINITION
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] Return the first num (smallest) elements from
the dataset and this is the opposite of the
take() action.
Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
takeSample(withReplacement: Boolean, num: Int, seed: Long = Return the subset of the dataset in an Array.
Utils.random.nextLong): Array[T] Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
toLocalIterator(): Iterator[T] Return the complete dataset as an Iterator.
Note: Use this method only when the
resulting array is small, as all the data is
loaded
ded into the driver's memory.
top(num: Int)(implicit ord: Ordering[T]): Array[T] Note: Use this method only when the
resulting array is small, as all the data is
loaded into the driver's memory.
treeAggregate Aggregates the elements of this RDD in a
multi-level
level tree pattern.
treeReduce Reduces the elements of this RDD in a multi-
multi
level tree pattern.

DataFrame
A DataFrame is a programming abstraction in the Spark SQL module. DataFrames resemble relational
database tables or excel spreadsheets with headers: the data resides in rows and columns of different
datatypes.

Processing is achieved using complex user


user-defined
defined functions and familiar data manipulation functions,
such as sort, join, group, etc.

The information for distributed data is structured into schemas.. Every column in a DataFrame contains
the column name, datatype, and nullable properties. When nullable is set to true,, a column
accepts null properties as well.

Page 8 of 28
RDD to Data frames

Convert RDD to DataFrame – Using toDF()

Spark provides an implicit function toDF() which would be used to convert RDD, Seq[T], List[T] to
DataFrame. In order to use toDF() function, we should import implicits first using import
spark.implicits._.

val dfFromRDD1 = rdd.toDF()


dfFromRDD1.printSchema()
Scala
Copy
By default, toDF() function creates column names as “_1” and “_2” like Tuples. Outputs below schema.

root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
Scala
Copy
toDF() has another signature that takes arguments to define column names as shown below.

val dfFromRDD1 = rdd.toDF("language","users_count")


dfFromRDD1.printSchema()
Scala
Copy
Outputs below schema.

root
|-- language: string (nullable = true)
|-- users_count: string (nullable = true)
Scala
Copy
By default, the datatype of these columns infers to the type of data and set’s nullable to true. We can
change this behavior by supplying schema using StructType – where we can specify a column name, data
type and nullable for each field/column.

Page 9 of 28
Convert RDD to DataFrame – Using createDataFrame()

SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd object as
an argument. and chain it with toDF() to specify names to the co
columns.

val columns = Seq("language","users_count")


val dfFromRDD2 = spark.createDataFrame(rdd).toDF(columns:_*)
Scala
Copy
Here, we are using scala operator <strong>:_*</strong> to explode columns array to comma-separated
comma
values.
Catalyst optimizer
The Catalyst Optimizer in Spark offers rulerule-based and cost-based optimization. Rule-based
based optimization
indicates how to execute the query from a set of defined rules. Meanwhile, cost cost-based
based optimization
generates multiple execution plans and compares them to ch choose
oose the lowest cost one.
Phases
The four phases of the transformation that Catalyst performs are as follows:
1. Analysis
The first phase of Spark SQL optimization is the analysis. Spark SQL starts with a relationship to be
processed that can be in two ways. A serious form from an AST (abstract syntax tree) returned by an SQL
parser, and on the other hand from a DataFrame object of the Spark SQL API.
2. Logic Optimization Plan
The second phase is the logical optimization plan. In this phase, rule
rule-based optimization
ptimization is applied to the
logical plan. It is possible to easily add new rules.
3. Physical plan
In the physical plan phase, Spark SQL takes the logical plan and generates one or more physical plans
using the physical operators that match the Spark exe execution
cution engine. The plan to be executed is selected
using the cost-based
based model (comparison between model costs).
4. Code generation
Code generation is the final phase of optimizing Spark SQL. To run on each machine, it is necessary to
generate Java code bytecode.

Phases of the query plan in Spark SQL. Rounded squares represent the Catalyst trees
Data Frame Transformations

The transformations themselves can be divided into two groups, DataFrame transformations, and

column transformations. The first group transform the entire DataFrame, for example
df.select(col1, col2, col3)
df.filter(col('user_id') == 123)

Page 10 of 28
df.orderBy('age')
...

The most frequently used DataFrame transformations are probably the following (but it of course

depends on the use case):


1. select(), withColumn() — for projecting columns
2. filter() — for filtering
3. orderBy(), sort(), sortWithinPartitions() — for sorting
4. distinct(), dropDuplicates() — for deduplication
5. join ()— for joining (see my other article about joins in Spark 3.0)
6. groupBy ()— for aggregations
Working with Dates and Timestamps
Spark SQL provides built-in standard Date and Timestamp (includes date and time) Functions defines in
DataFrame API, these come in handy when we need to make operations on date and time. All these
accept input as, Date type, Timestamp type or String. If a String, it should be in a format that can be cast
to date, such as yyyy-MM-dd and timestamp in yyyy-MM-dd HH:mm:ss.SSSS and returns date and
timestamp respectively; also returns null if the input data was a string that could not be cast to date and
timestamp.
DATE FUNCTION SIGNATURE DATE FUNCTION DESCRIPTION
current_date () : Column Returns the current date as a date column.
date_format(dateExpr: Column, Converts a date/timestamp/string to a value of string in the format
format: String): Column specified by the date format given by the second argument.
to_date(e: Column): Column Converts the column into `DateType` by casting rules to `DateType`.
to_date(e: Column, fmt: String): Converts the column into a `DateType` with a specified format
Column
add_months(startDate: Column, Returns the date that is `numMonths` after `startDate`.
numMonths: Int): Column
date_add(start: Column, days: Int): Returns the date that is `days` days after `start`
Column
date_sub(start: Column, days: Int):
Column
datediff(end: Column, start: Returns the number of days from `start` to `end`.
Column): Column
months_between(end: Column, Returns number of months between dates `start` and `end`. A whole
start: Column): Column number is returned if both inputs have the same day of month or both are
the last day of their respective months. Otherwise, the difference is
calculated assuming 31 days per month.
months_between(end: Column, Returns number of months between dates `end` and `start`. If `roundOff`
start: Column, roundOff: Boolean): is set to true, the result is rounded off to 8 digits; it is not rounded
Column otherwise.
next_day(date: Column, Returns the first date which is later than the value of the `date` column
dayOfWeek: String): Column that is on the specified day of the week.

Page 11 of 28
DATE FUNCTION SIGNATURE DATE FUNCTION DESCRIPTION
For example, `next_day('2015-07-27', "Sunday")` returns 2015-08-02
because that is the first Sunday after 2015-07-27.
trunc(date: Column, format: Returns date truncated to the unit specified by the format.
String): Column For example, `trunc("2018-11-19 12:01:19", "year")` returns 2018-01-01
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month
date_trunc(format: String, Returns timestamp truncated to the unit specified by the format.
timestamp: Column): Column For example, `date_trunc("year", "2018-11-19 12:01:19")` returns 2018-
01-01 00:00:00
format: 'year', 'yyyy', 'yy' to truncate by year,
'month', 'mon', 'mm' to truncate by month,
'day', 'dd' to truncate by day,
Other options are: 'second', 'minute', 'hour', 'week', 'month', 'quarter'
year(e: Column): Column Extracts the year as an integer from a given date/timestamp/string
quarter(e: Column): Column Extracts the quarter as an integer from a given date/timestamp/string.
month(e: Column): Column Extracts the month as an integer from a given date/timestamp/string
dayofweek(e: Column): Column Extracts the day of the week as an integer from a given
date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a
Saturday
dayofmonth(e: Column): Column Extracts the day of the month as an integer from a given
date/timestamp/string.
dayofyear(e: Column): Column Extracts the day of the year as an integer from a given
date/timestamp/string.
weekofyear(e: Column): Column Extracts the week number as an integer from a given
date/timestamp/string. A week is considered to start on a Monday and
week 1 is the first week with more than 3 days, as defined by ISO 8601
last_day(e: Column): Column Returns the last day of the month which the given date belongs to. For
example, input "2015-07-27" returns "2015-07-31" since July 31 is the last
day of the month in July 2015.
from_unixtime(ut: Column): Converts the number of seconds from unix epoch (1970-01-01 00:00:00
Column UTC) to a string representing the timestamp of that moment in the current
system time zone in the yyyy-MM-dd HH:mm:ss format.
from_unixtime(ut: Column, f: Converts the number of seconds from unix epoch (1970-01-01 00:00:00
String): Column UTC) to a string representing the timestamp of that moment in the current
system time zone in the given format.
unix_timestamp(): Column Returns the current Unix timestamp (in seconds) as a long
unix_timestamp(s: Column): Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp
Column (in seconds), using the default timezone and the default locale.
unix_timestamp(s: Column, p: Converts time string with given pattern to Unix timestamp (in seconds).
String): Column
Working with Nulls in Data
We know Spark needs to be aware of the null, in terms of data, but you, as a programmer, should be
aware of some details. Null in Spark, is not as straight forward as we wish it to be. At the beginning of
this article, I stated that this is not a simple problem we face here. Here I’m going to discuss why I think
that’s the case:
Spark is Null safe, well, almost!

Page 12 of 28
The fact that Spark functions are null safe (at least most of the times) is, quite pleasant. Take a look at
the following example:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val schema = List(


StructField("v1", IntegerType, true),
StructField("v2", IntegerType, true)
)

val data = Seq(


Row(1, 2),
Row(3, 4),
Row(null, 5)
)

val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)

val result = df.withColumn("v3", $"v1" + $"v2")

As you can see the third row of our data contains a null, but as you see in the following code box, Scala
considers the result of that row as null (which is the desired value if one party of your calculation is
already null):
scala> result.show
+----+---+----+
| v1| v2| v3|
+----+---+----+
| 1| 2| 3|
| 3| 4| 7|
|null| 5|null|
+----+---+----+
Working with Complex Types

Apache Spark natively supports complex data types, and in some cases like JSON where an appropriate

data source connector is available, it makes a pretty decent dataframe representation of the data. Top

Page 13 of 28
level key value pairs are presented in their own columns, whilst more complex hierarchical data is

persisted using a column cast to a complex data type. Using dot notation within a select clause,

individual data points within a complex object can be selected. For example:
from pyspark.sql.functions import coljsonStrings = ['{"car":{"color":"red",
"model":"jaguar"},"name":"Jo","address":{"city":"Houston",' + \
'"state":"Texas","zip":{"first":1234,"second":4321}}}']
otherPeopleRDD = spark.sparkContext.parallelize(jsonStrings)
source_json_df = spark.read.json(otherPeopleRDD)source_json_df.select(col("car.color"),
col("car.model")).show()

This will return the following dataframe:

This mechanism is simple and it works. However, if the data is complex, has multiple levels, spans a large

number of attributes and/or columns, each aligned to a different schema and the consumer of the data

isn’t able to cope (i.e. like most BI tools, which like to report from relational databases like Oracle,

MySQL, etc) then problems will ensue. The manual approach of writing out the Select statement can be

labour intensive too and be difficult to maintain (from a coding perspective).

To simplify working with complex data, this article will present a function designed to transform multi-

level complex hierarchical columns into a non-hierarchical verison of themselves. Essentially, a

dataframe that has no complex data type columns. All nested attributes are assigned their own column

named after their original location. For example:


car.colorbecomescar_color
Getting Started, the Approach

Lets assume that we need to transform the following JSON, which has been loaded into Spark using

spark.read.json:
{
"car":{
"color":"red",
"model":"jaguar"
},

Page 14 of 28
"name":"Jo",
"address":{
"city":"Houston",
"state":"Texas",
"zip":{
"first":1234,
"second":4321
}
}
}

The first task is to create a function that can parse the schema bound to the Dataframe. The schema is

accessed via a property of the same name found on the dataframe itself.

Working with JSON


Spark SQL provides a natural syntax for querying JSON data along with automatic inference of JSON
schemas for both reading and writing data. Spark SQL understands the nested fields in JSON data and
allows users to directly access these fields without any explicit transformations. The above query in
Spark SQL is written as follows:
SELECT name, age, address.city, address.state FROM people

Loading and saving JSON datasets in Spark SQL


To query a JSON dataset in Spark SQL, one only needs to point Spark SQL to the location of the data. The
schema of the dataset is inferred and natively available without any user specification. In the
programmatic APIs, it can be done through jsonFile and jsonRDD methods provided by SQLContext. With
these two methods, you can create a SchemaRDD for a given JSON dataset and then you can register the
SchemaRDD as a table. Here is an example:
// Create a SQLContext (sc is an existing SparkContext)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Suppose that you have a text file called people with the following content:
// {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
// {"name":"Michael", "address":{"city":null, "state":"California"}}
// Create a SchemaRDD for the JSON dataset.
val people = sqlContext.jsonFile("[the path to file people]")
// Register the created SchemaRDD as a temporary table.
people.registerTempTable("people")

Page 15 of 28
Group By
The GROUP BY clause is used to group the rows based on a set of specified grouping expressions and
compute aggregations on the group of rows based on one or more specified aggregate functions. Spark
also supports advanced aggregations to do multiple aggregations for the same input record set
via GROUPING SETS, CUBE, ROLLUP clauses. The grouping expressions and advanced aggregations can
be mixed in the GROUP BY clause and nested in a GROUPING SETS clause. See more details in
the Mixed/Nested Grouping Analytics section. When a FILTER clause is attached to an aggregate
function, only the matching rows are passed to that function.
Syntax
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]

GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } (grouping_set [ , ...]) } [ , ... ]


While aggregate functions are defined as
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
 group_expression
Specifies the criteria based on which the rows are grouped together. The grouping of rows is
performed based on result values of the grouping expressions. A grouping expression may be a
column name like GROUP BY a, a column position like GROUP BY 0, or an expression like GROUP BY
a + b.
 grouping_set
A grouping set is specified by zero or more comma-separated expressions in parentheses. When the
grouping set has only one element, parentheses can be omitted. For example, GROUPING SETS ((a),
(b)) is the same as GROUPING SETS (a, b).
Syntax: { ( [ expression [ , ... ] ] ) | expression }
 GROUPING SETS
Groups the rows for each grouping set specified after GROUPING SETS. For example, GROUP BY
GROUPING SETS ((warehouse), (product)) is semantically equivalent to union of results of GROUP BY
warehouse and GROUP BY product. This clause is a shorthand for a UNION ALL where each leg of
the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING
SETS clause. Similarly, GROUP BY GROUPING SETS ((warehouse, product), (product), ()) is
semantically equivalent to the union of results of GROUP BY warehouse, product, GROUP BY
product and global aggregate.
 ROLLUP
Specifies multiple levels of aggregations in a single statement. This clause is used to compute
aggregations based on multiple grouping sets. ROLLUP is a shorthand for GROUPING SETS. For
example, GROUP BY warehouse, product WITH ROLLUP or GROUP BY ROLLUP(warehouse,
product) is equivalent to GROUP BY GROUPING SETS((warehouse, product), (warehouse),
()). GROUP BY ROLLUP(warehouse, product, (warehouse, location)) is equivalent to GROUP BY
GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ()). The N
elements of a ROLLUP specification results in N+1 GROUPING SETS.
 CUBE
CUBE clause is used to perform aggregations based on combination of grouping columns specified in
the GROUP BY clause. CUBE is a shorthand for GROUPING SETS. For example, GROUP BY warehouse,
product WITH CUBE or GROUP BY CUBE(warehouse, product) is equivalent to GROUP BY GROUPING
SETS((warehouse, product), (warehouse), (product), ()). GROUP BY CUBE(warehouse, product,
(warehouse, location)) is equivalent to GROUP BY GROUPING SETS((warehouse, product, location),
(warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse),

Page 16 of 28
(product), (warehouse, product), ()). The N elements of a CUBE specification results in
2^N GROUPING SETS.
 aggregate_name
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
 DISTINCT
Removes duplicates in input rows before they are passed to aggregate functions.
 FILTER
Filters the input rows for which the boolean_expression in the WHERE clause evaluates to true are
passed to the aggregate function; other rows are discarded.
Window Functions

WINDOW FUNCTION USAGE & SYNTAX WINDOW FUNCTION DESCRIPTION


row_number(): Column Returns a sequential number starting from 1 within a window
partition
rank(): Column Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column Returns the percentile rank of rows within a window partition.
dense_rank(): Column Returns the rank of rows within a window partition without any
gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column Returns the ntile id in a window partition
cume_dist(): Column Returns the cumulative distribution of values within a window
partition
lag(e: Column, offset: Int): Column returns the value that is `offset` rows before the current row,
lag(columnName: String, offset: Int): and `null` if there is less than `offset` rows before the current
Column row.
lag(columnName: String, offset: Int,
defaultValue: Any): Column
lead(columnName: String, offset: Int): returns the value that is `offset` rows after the current row, and
Column `null` if there is less than `offset` rows after the current row.
lead(columnName: String, offset: Int):
Column
lead(columnName: String, offset: Int,
defaultValue: Any): Column

Page 17 of 28
Join in Spark SQL
Following are the different types of Joins:

1. INNER JOIN
The INNER JOIN returns the dataset which has the rows that have matching values in both the datasets

i.e. value of the common field will be the same.

2. CROSS JOIN
The CROSS JOIN returns the dataset which is the number of rows in the first dataset multiplied by the

number of rows in the second dataset. Such kind of result is called the Cartesian Product.

3. LEFT OUTER JOIN


The LEFT OUTER JOIN returns the dataset that has all rows from the left dataset, and the matched rows

from the right dataset.

4. RIGHT OUTER JOIN


The RIGHT OUTER JOIN returns the dataset that has all rows from the right dataset, and the matched

rows from the left dataset.

5. FULL OUTER JOIN


The FULL OUTER JOIN returns the dataset that has all rows when there is a match in either the left or

right dataset.

6. LEFT SEMI JOIN


The LEFT SEMI JOIN returns the dataset which has all rows from the left dataset having their

correspondence in the right dataset. Unlike the LEFT OUTER JOIN, the returned dataset in LEFT SEMI

JOIN contains only the columns from the left dataset.

7. LEFT ANTI JOIN


The ANTI SEMI JOIN returns the dataset which has all the rows from the left dataset that don’t have

their matching in the right dataset. It also contains only the columns from the left dataset.

Data sources
This section describes the Apache Spark data sources you can use in Databricks. Many include a
notebook that demonstrates how to use the data source to read and write data.

Page 18 of 28
The following data sources are either directly supported in Databricks Runtime or require simple shell
commands to enable access:
 Avro file
 Binary file
 CSV file
 Hive table
 Image
 JSON file
 LZO compressed file
 MLflow experiment
 Parquet file
 XML file
 Zip files
Broadcast Variables
In Spark RDD and DataFrame, Broadcast variables are read-only shared variables that are cached

and available on all nodes in a cluster in-order to access or use by the tasks. Instead of sending
this data along with every task, spark distributes broadcast variables to the machine using
efficient broadcast algorithms to reduce communication costs.
How does Spark Broadcast work?

Broadcast variables are used in the same way for RDD, DataFrame, and Dataset.
When you run a Spark RDD, DataFrame jobs that has the Broadcast variables defined and used, Spark
does the following.
 Spark breaks the job into stages that have distributed shuffling and actions are executed with in the
stage.
 Later Stages are also broken into tasks
 Spark broadcasts the common data (reusable) needed by tasks within each stage.
 The broadcasted data is cache in serialized format and deserialized before executing each task.
You should be creating and using broadcast variables for data that shared across multiple stages and
tasks.
Note that broadcast variables are not sent to executors with sc.broadcast(variable) call instead, they will
be sent to executors when they are first used.
How to create Broadcast variable

The Spark Broadcast is created using the broadcast(v) method of the SparkContext class. This method
takes the argument v that you want to broadcast.
import org.apache.spark.sql.SparkSession

object RDDBroadcast extends App {

val spark = SparkSession.builder()


.appName("SparkByExamples.com")
.master("local")
.getOrCreate()

Page 19 of 28
val states = Map(("NY","New York"),("CA","California"),("FL","Florida"))
val countries = Map(("USA","United States of America"),("IN","India"))

val broadcastStates = spark.sparkContext.broadcast(states)


val broadcastCountries = spark.sparkContext.broadcast(countries)

val data = Seq(("James","Smith","USA","CA"),


("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
)

val rdd = spark.sparkContext.parallelize(data)

val rdd2 = rdd.map(f=>{


val country = f._3
val state = f._4
val fullCountry = broadcastCountries.value.get(country).get
val fullState = broadcastStates.value.get(state).get
(f._1,f._2,fullCountry,fullState)
})

println(rdd2.collect().mkString("\n"))

Accumulators
Accumulators are variables that are only “added” to through an associative operation and can
therefore, be efficiently supported in parallel. They can be used to implement counters (as in
MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can
add support for new types. If accumulators are created with a name, they will be displayed in Spark’s
UI. This can be useful for understanding the progress of running stages (NOTE − this is not yet
supported in Python).
An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running
on the cluster can then add to it using the add method or the += operator (in Scala and Python).
However, they cannot read its value. Only the driver program can read the accumulator’s value, using
its value method.
The code given below shows an accumulator being used to add up the elements of an array −

scala> val accum = sc.accumulator(0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

If you want to see the output of above code then use the following command −

Page 20 of 28
scala> accum.value

Output
res2: Int = 10
On-Premises Cluster Deployments
Implementing an enterprise-ready, on-premises Spark deployment can be very complex and it requires

expertise that is generally not available to all.

BlueData makes it easier to deploy Apache Spark on-premises. With BlueData, you can spin up virtual

Spark clusters within minutes – providing secure, self-service, on-demand access to Big Data analytics

and infrastructure. You can deploy Spark in standalone mode or with Hadoop / YARN. You can also build

analytical pipelines and create Spark clusters using our RESTful APIs, and use web-based Zeppelin

notebooks for interactive data analytics.

BlueData’s software platform leverages virtualization and Docker containers – combined with our own

patent-pending innovations – to make it faster, and more cost-effective for enterprises to get up and

running with a multi-tenant Spark deployment on-premises.

Page 21 of 28
Apache Spark Standalone Cluster Manager
Standalone mode is a simple cluster manager incorporated with Spark. It makes it easy to setup a
cluster that Spark itself manages and can run on Linux, Windows, or Mac OSX. Often it is the simplest
way to run Spark application in a clustered environment. Learn, how to install Apache Spark On
Standalone Mode.
a. How does Spark Standalone Cluster Works?

It has masters and number of workers with configured amount of memory and CPU cores. In Spark
standalone cluster mode, Spark allocates resources based on the core. By default, an application will
grab all the cores in the cluster.
In standalone cluster manager, Zookeeper quorum recovers the master using standby master. Using the
file system, we can achieve the manual recovery of the master. Spark supports authentication with the
help of shared secret with entire cluster manager. The user configures each node with a shared secret.
For communication protocols, Data encrypts using SSL. But for block transfer, it makes use of data SASL
encryption.
To check the application, each Apache Spark application has a Web User Interface. The Web UI provides
information of executors, storage usage, running task in the application. In this cluster manager, we
have Web UI to view cluster and job statistics. It also has detailed log output for each job. If an
application has logged event for its lifetime, Spark Web UI will reconstruct the application’s UI after the
application exits.

Spark on YARN
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN
on the cluster, and the client can go away after initiating the application. In client mode, the driver runs
in the client process, and the application master is only used for requesting resources from YARN.
Unlike other cluster managers supported by Spark in which the master’s address is specified in the --
master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop
configuration. Thus, the --master parameter is yarn.
To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar>
[app options]

Page 22 of 28
Spark log files
Apache Spark log files can be useful in identifying issues with your Spark processes.
Table 1 lists the base log files that Spark generates.
Log file Location
Master logs $SPARK_LOG_DIR/spark-userID-org.apache.spark.deploy.master.Master-instance-host.out
Worker logs $SPARK_LOG_DIR/spark-userID-org.apache.spark.deploy.master.Worker-instance-host.out
Driver logs (client
Printed on the command line by default
deploy mode)
Driver logs
(cluster deploy
mode)  $SPARK_WORKER_DIR/driverID/stdout
 stdout  $SPARK_WORKER_DIR/driverID/stderr
 stderr
Executor logs
 stdout  $SPARK_WORKER_DIR/applID/executorID/stdout
 stderr  $SPARK_WORKER_DIR/applID/executorID/stderr
Table 1. Apache Spark log files
The Spark UI- Spark UI History Server

You can use an AWS CloudFormation template to start the Apache Spark history server and view the
Spark web UI. These templates are samples that you should modify to meet your requirements.
To start the Spark history server and view the Spark UI using AWS CloudFormation
1. Choose one of the Launch Stack buttons in the following table. This launches the stack on the AWS
CloudFormation console.
Region Launch for Glue 1.0/2.0 Launch for Glue 3.0
US East (Ohio)
US East (N. Virginia)
US West (N. California)
US West (Oregon)
Africa (Cape Town) Must first enable console access to the region.

Asia Pacific (Hong Kong) Must first enable console access to the region.

Asia Pacific (Mumbai)


Asia Pacific (Osaka)
Asia Pacific (Seoul)

Page 23 of 28
Region Launch for Glue 1.0/2.0 Launch for Glue 3.0
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
Canada (Central)
Europe (Frankfurt)
Europe (Ireland)
Europe (London)
Europe (Milan) Must first enable console access to the region.

Europe (Paris)
Europe (Stockholm)
Middle East (Bahrain) Must first enable console access to the region.

South America (São Paulo)


2. On the Specify template page, choose Next.
3. On the Specify stack details page, enter the Stack name. Enter additional information under Parameters.
a. Spark UI Configuration
Provide the following information:
 IP address range — The IP address range that can be used to view the Spark UI. If you want to restrict
access from a specific IP address range, you should use a custom value.
 History server port — The port for the Spark UI. You can use the default value.
 Event log directory — Choose the location where Spark event logs are stored from the AWS Glue job or
development endpoints. You must use s3a:// for the event logs path scheme.
 Spark package location — You can use the default value.
 Keystore path — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can
specify the S3 path s3://path_to_your_keystore_file here. If you leave this parameter empty, a self-
signed certificate based keystore is generated and used.
 Keystore password — Enter a SSL/TLS keystore password for HTTPS.

Debugging and Spark First Aid


Debug Spark application running Locally

Page 24 of 28
To debug a Scala or Java application, you need to run the application with JVM options agentlib:jdwp,
where agentlib:jdwp is the Java Debug Wire Protocol (JDWP) option, followed by a comma-separated list
of sub-option

agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
Scala
Copy
But to run with spark-submit, you need to add agentlib:jdwp with --conf
spark.driver.extraJavaOptions along with options as shown below.

spark-submit \
--name SparkByExamples.com \
--class org.sparkbyexamples.SparkWordCountExample \
--conf spark.driver.extraJavaOptions=-
agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
spark-by-examples.jar
Bash
Copy
By running the above command, it prompts you with the below message, and your application pauses.

Listening for transport dt_socket at address: 5005


Bash
Copy
Now, open the IntelliJ editor and do the following.
 Open the project you wanted to debug
 Open the Spark project you wanted to debug.
 Add some debugging breakpoints to the scala classes.
And, follow the below steps to create Remote application and start to debug.
 Open your Spark application you wanted to debug in IntelliJ Idea IDE
 Access Run -> Edit Configurations, this brings you Run/Debug Configurations window
 Now select Applications and select + sign from the top left corner and select Remote option.
 Enter your debugger name for Name field. for example, enter SparkLocalDebug.
 For Debugger mode option select Attach to local JVM.
 For Transport, select Socket (this selected by default).
 For Host, enter localhost as we are debugging Local and enter the port number for Port. For our
example, we are using 5005.
 Finally, select OK. This just creates the Application to debug but it doesn’t start.

Spark First Aid

The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
Let's understand each Spark component in detail.

Page 25 of 28
Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting w
with
ith storage systems
and memory management.
Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured data.
o It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault
fault-tolerant
tolerant processing of
streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches
batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
o The log files generated by web servers can be considered as a real
real-time
time example of a data
stream.
MLlib
o The MLlib is a Machine Learning library that contains various machine learning algorithms.
o These include correlations and hypothesis testing, classification and regression, clustering, and
principal component analysis.
o It is nine times faster than the disk
disk-based
based implementation used by Apache Mahout.
GraphX

Page 26 of 28
o The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
o It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Messages.

Spark debug locally with IntelliJ


In order to start the application, select the Run -> Debug SparkLocalDebug, this tries to start the
application by attaching to 5005 port.
Now you should see your spark-submit application running and when it encounter debug breakpoint,
you will get the control to IntelliJ.
Now use the debug control keys or options to step through the application. In case if you are not sure
how to step through, follow this IntelliJ step through article.
In case you are not running spark application on 5005 port on the localhost, this returns below error
message.

Error running 'SparkLocalDebug': Unable to open debugger port (localhost:5005):


java.net.ConnectException "Connection refused: connect" (6 minutes ago)
Bash
Copy
Debug Spark application running on Remote server

Page 27 of 28
If you are running spark application on a remote node and you wanted to debug via IntelliJ, you need to
set the environment variable SPARK_SUBMIT_OPTS with the debug information.

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5050
Scala
Copy
Now run your spark-submit, which will wait for the the debugger.
Finally, Open the IntelliJ and follow the above points. and for the host, enter your remote host where
your spark application is running.

Page 28 of 28

You might also like