Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from
an existing Hive installation. When running SQL from within another programming language the
results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface
using the command-line or over JDBC/ODBC.
Scala
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
Scala
The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a
specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes,
encoders are code generated dynamically and use a format that allows Spark to perform many
operations like filtering, sorting and hashing without deserializing the bytes back into an object.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by nam
e
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
For example:
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
Aggregations
The built-in DataFrames functions provide common aggregations such
as count(), countDistinct(), avg(), max(), min(), etc. While those functions are designed for
DataFrames, Spark SQL also has type-safe versions for some of them in Scala and Java to work
with strongly typed Datasets. Moreover, users are not limited to the predefined aggregate
functions and can create their own.
val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SparkSession
val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A
DataFrame can be operated on using relational transformations and can also be used to create a
temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries
over its data. This section describes the general methods for loading and saving data using the
Spark Data Sources and then goes into specific options that are available for the built-in data
sources.
Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out
the new data.
Since the metastore can return only necessary partitions for a query, discovering all the
partitions on the first query to the table is no longer needed.
Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for
tables created with the Datasource API.
Note that partition information is not gathered by default when creating external datasource
tables (those with a path option). To sync the partition information in the metastore, you can
invoke MSCK REPAIR TABLE.