0% found this document useful (0 votes)
23 views

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

This document discusses how to use Spark SQL to execute SQL queries and interact with Datasets and DataFrames. It covers creating DataFrames from different data sources, running SQL queries programmatically, and interoperating between RDDs and DataFrames/Datasets.

Uploaded by

bhargavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

This document discusses how to use Spark SQL to execute SQL queries and interact with Datasets and DataFrames. It covers creating DataFrames from different data sources, running SQL queries programmatically, and interoperating between RDDs and DataFrames/Datasets.

Uploaded by

bhargavi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

SQL

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from
an existing Hive installation. When running SQL from within another programming language the
results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface
using the command-line or over JDBC/ODBC.

Datasets and DataFrames


A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that
provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the
benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM
objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The
Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.
But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available
(i.e. you can access the field of a row by name naturally row.columnName). The case for R is
similar.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table
in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as: structured data files, tables
in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala,
Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the
Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to
use Dataset<Row> to represent a DataFrame.
Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.
Scala:
The entry point into all functionality in Spark is the SparkSession class. To create a
basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession

val spark = SparkSession


.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Creating DataFrames
SCALA:
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive
table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout


df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Untyped Dataset Operations (aka DataFrame Operations)


DataFrames provide a domain-specific language for structured data manipulation
in Scala, Java, Python and R.
As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API.
These operations are also referred as “untyped transformations” in contrast to “typed
transformations” come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:

 Scala
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column


df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1


df.select($"name", $"age" + 1).show()
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+

// Select people older than 21


df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age


df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// | 19| 1|
// |null| 1|
// | 30| 1|
// +----+-----+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
In addition to simple column references and expressions, Datasets also have a rich library of
functions including string manipulation, date arithmetic, common math operations and more. The
complete list is available in the

Running SQL Queries Programmatically

 Scala

The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")


sqlDF.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Global Temporary View


Temporary views in Spark SQL are session-scoped and will disappear if the session that creates
it terminates. If you want to have a temporary view that is shared among all sessions and keep
alive until the Spark application terminates, you can create a global temporary view. Global
temporary view is tied to a system preserved database global_temp, and we must use the
qualified name to refer it, e.g. SELECT * FROM global_temp.view1.
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`


spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a
specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes,
encoders are code generated dynamically and use a format that allows Spark to perform many
operations like filtering, sorting and hashing without deserializing the bytes back into an object.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)

// Encoders are created for case classes


val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by nam
e
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Interoperating with RDDs


Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects.
This reflection based approach leads to more concise code and works well when you already
know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it
allows you to construct Datasets when the columns and their types are not known until runtime.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example, the structure of records is
encoded in a string, or a text dataset will be parsed and fields will be projected differently for
different users), a DataFrame can be created programmatically with three steps.

1. Create an RDD of Rows from the original RDD;


2. Create the schema represented by a StructType matching the structure of Rows in the RDD
created in Step 1.
3. Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.

For example:
import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string


val schemaString = "name age"

// Generate the schema based on the string of schema


val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows


val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD


val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames


val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Aggregations
The built-in DataFrames functions provide common aggregations such
as count(), countDistinct(), avg(), max(), min(), etc. While those functions are designed for
DataFrames, Spark SQL also has type-safe versions for some of them in Scala and Java to work
with strongly typed Datasets. Moreover, users are not limited to the predefined aggregate
functions and can create their own.

Untyped User-Defined Aggregate Functions


Users have to extend the UserDefinedAggregateFunction abstract class to implement a custom
untyped aggregate function. For example, a user-defined average can look like:
 Scala
 Java
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

object MyAverage extends UserDefinedAggregateFunction {


// Data types of input arguments of this aggregate function
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
// Data types of values in the aggregation buffer
def bufferSchema: StructType = {
StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
}
// The data type of the returned value
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it


spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+

Type-Safe User-Defined Aggregate Functions


User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract
class. For example, a type-safe user-defined average can look like:

import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SparkSession

case class Employee(name: String, salary: Long)


case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {


// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buff
er`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name


val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala"
in the Spark repo.

Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A
DataFrame can be operated on using relational transformations and can also be used to create a
temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries
over its data. This section describes the general methods for loading and saving data using the
Spark Data Sources and then goes into specific options that are available for the built-in data
sources.

Generic Load/Save Functions


In the simplest form, the default data source (parquet unless otherwise configured
by spark.sql.sources.default) will be used for all operations.
val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Manually Specifying Options


You can also manually specify the data source that will be used along with any extra options that
you would like to pass to the data source. Data sources are specified by their fully qualified name
(i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short names
(json, parquet, jdbc, orc, libsvm, csv, text). DataFrames loaded from any data source type can be
converted into other types using this syntax.
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Run SQL on files directly


Instead of using read API to load a file into DataFrame and query it, you can also query that file
directly with SQL.
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out
the new data.

Scala/Java Any Language Meaning

SaveMode.ErrorIfExists (default) "error" (default) When saving a DataFrame to


a data source, if data already
exists, an exception is
expected to be thrown.

SaveMode.Append "append" When saving a DataFrame to


a data source, if data/table
already exists, contents of the
DataFrame are expected to be
appended to existing data.

SaveMode.Overwrite "overwrite" Overwrite mode means that


when saving a DataFrame to
a data source, if data/table
already exists, existing data is
expected to be overwritten by
the contents of the
DataFrame.

SaveMode.Ignore "ignore" Ignore mode means that when


saving a DataFrame to a data
source, if data already exists,
the save operation is expected
to not save the contents of the
DataFrame and to not change
the existing data. This is
similar to a CREATE
TABLE IF NOT EXISTS in
SQL.

Saving to Persistent Tables


DataFrames can also be saved as persistent tables into Hive metastore using
the saveAsTable command. Notice that an existing Hive deployment is not necessary to use this
feature. Spark will create a default local Hive metastore (using Derby) for you. Unlike
the createOrReplaceTempView command, saveAsTable will materialize the contents of the
DataFrame and create a pointer to the data in the Hive metastore. Persistent tables will still exist
even after your Spark program has restarted, as long as you maintain your connection to the
same metastore. A DataFrame for a persistent table can be created by calling the table method on
a SparkSession with the name of the table.
For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via
the path option, e.g. df.write.option("path", "/some/path").saveAsTable("t"). When the table is
dropped, the custom table path will not be removed and the table data is still there. If no custom
table path is specified, Spark will write data to a default table path under the warehouse
directory. When the table is dropped, the default table path will be removed too.
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the
Hive metastore. This brings several benefits:

 Since the metastore can return only necessary partitions for a query, discovering all the
partitions on the first query to the table is no longer needed.
 Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for
tables created with the Datasource API.

Note that partition information is not gathered by default when creating external datasource
tables (those with a path option). To sync the partition information in the metastore, you can
invoke MSCK REPAIR TABLE.

You might also like