0% found this document useful (0 votes)

23 views

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

This document discusses how to use Spark SQL to execute SQL queries and interact with Datasets and DataFrames. It covers creating DataFrames from different data sources, running SQL queries programmatically, and interoperating between RDDs and DataFrames/Datasets.

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

Uploaded by

bhargavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

SQL

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from
an existing Hive installation. When running SQL from within another programming language the
results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface
using the command-line or over JDBC/ODBC.

Datasets and DataFrames

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that
provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the
benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM
objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The
Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.
But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available
(i.e. you can access the field of a row by name naturally row.columnName). The case for R is
similar.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table
in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
DataFrames can be constructed from a wide array of sources such as: structured data files, tables
in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala,
Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the
Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to
use Dataset<Row> to represent a DataFrame.
Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.
Scala:
The entry point into all functionality in Spark is the SparkSession class. To create a
basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession

val spark = SparkSession

.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Creating DataFrames
SCALA:
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive
table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout

df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Untyped Dataset Operations (aka DataFrame Operations)

DataFrames provide a domain-specific language for structured data manipulation
in Scala, Java, Python and R.
As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API.
These operations are also referred as “untyped transformations” in contrast to “typed
transformations” come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:

 Scala
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column

df.select("name").show()
// +-------+
// | name|
// +-------+
// |Michael|
// | Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1

df.select($"name", $"age" + 1).show()
// +-------+---------+
// | name|(age + 1)|
// +-------+---------+
// |Michael| null|
// | Andy| 31|
// | Justin| 20|
// +-------+---------+

// Select people older than 21

df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age

df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// | 19| 1|
// |null| 1|
// | 30| 1|
// +----+-----+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.
In addition to simple column references and expressions, Datasets also have a rich library of
functions including string manipulation, date arithmetic, common math operations and more. The
complete list is available in the

Running SQL Queries Programmatically

 Scala

The sql function on a SparkSession enables applications to run SQL queries programmatically
and returns the result as a DataFrame.
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")

sqlDF.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Global Temporary View

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates
it terminates. If you want to have a temporary view that is shared among all sessions and keep
alive until the Spark application terminates, you can create a global temporary view. Global
temporary view is tied to a system preserved database global_temp, and we must use the
qualified name to refer it, e.g. SELECT * FROM global_temp.view1.
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`

spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a
specialized Encoder to serialize the objects for processing or transmitting over the network.
While both encoders and standard serialization are responsible for turning an object into bytes,
encoders are code generated dynamically and use a format that allows Spark to perform many
operations like filtering, sorting and hashing without deserializing the bytes back into an object.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)

// Encoders are created for case classes

val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by nam
e
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Interoperating with RDDs

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects.
This reflection based approach leads to more concise code and works well when you already
know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it
allows you to construct Datasets when the columns and their types are not known until runtime.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example, the structure of records is
encoded in a string, or a text dataset will be parsed and fields will be projected differently for
different users), a DataFrame can be created programmatically with three steps.

1. Create an RDD of Rows from the original RDD;

2. Create the schema represented by a StructType matching the structure of Rows in the RDD
created in Step 1.
3. Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.

For example:
import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string

val schemaString = "name age"

// Generate the schema based on the string of schema

val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows

val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD

val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames

val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark
repo.

Aggregations
The built-in DataFrames functions provide common aggregations such
as count(), countDistinct(), avg(), max(), min(), etc. While those functions are designed for
DataFrames, Spark SQL also has type-safe versions for some of them in Scala and Java to work
with strongly typed Datasets. Moreover, users are not limited to the predefined aggregate
functions and can create their own.

Untyped User-Defined Aggregate Functions

Users have to extend the UserDefinedAggregateFunction abstract class to implement a custom
untyped aggregate function. For example, a user-defined average can look like:
 Scala
 Java
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

object MyAverage extends UserDefinedAggregateFunction {

// Data types of input arguments of this aggregate function
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
// Data types of values in the aggregation buffer
def bufferSchema: StructType = {
StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
}
// The data type of the returned value
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
}
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it

spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+

Type-Safe User-Defined Aggregate Functions

User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract
class. For example, a type-safe user-defined average can look like:

import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SparkSession

case class Employee(name: String, salary: Long)

case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {

// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buff
er`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
b1
}
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// | name|salary|
// +-------+------+
// |Michael| 3000|
// | Andy| 4500|
// | Justin| 3500|
// | Berta| 4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name

val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// | 3750.0|
// +--------------+
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala"
in the Spark repo.

Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface. A
DataFrame can be operated on using relational transformations and can also be used to create a
temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries
over its data. This section describes the general methods for loading and saving data using the
Spark Data Sources and then goes into specific options that are available for the built-in data
sources.

Generic Load/Save Functions

In the simplest form, the default data source (parquet unless otherwise configured
by spark.sql.sources.default) will be used for all operations.
val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that
you would like to pass to the data source. Data sources are specified by their fully qualified name
(i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short names
(json, parquet, jdbc, orc, libsvm, csv, text). DataFrames loaded from any data source type can be
converted into other types using this syntax.
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Run SQL on files directly

Instead of using read API to load a file into DataFrame and query it, you can also query that file
directly with SQL.
val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
Find full example code at
"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the
Spark repo.

Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out
the new data.

Scala/Java Any Language Meaning

SaveMode.ErrorIfExists (default) "error" (default) When saving a DataFrame to

a data source, if data already
exists, an exception is
expected to be thrown.

SaveMode.Append "append" When saving a DataFrame to

a data source, if data/table
already exists, contents of the
DataFrame are expected to be
appended to existing data.

SaveMode.Overwrite "overwrite" Overwrite mode means that

when saving a DataFrame to
a data source, if data/table
already exists, existing data is
expected to be overwritten by
the contents of the
DataFrame.

SaveMode.Ignore "ignore" Ignore mode means that when

saving a DataFrame to a data
source, if data already exists,
the save operation is expected
to not save the contents of the
DataFrame and to not change
the existing data. This is
similar to a CREATE
TABLE IF NOT EXISTS in
SQL.

Saving to Persistent Tables

DataFrames can also be saved as persistent tables into Hive metastore using
the saveAsTable command. Notice that an existing Hive deployment is not necessary to use this
feature. Spark will create a default local Hive metastore (using Derby) for you. Unlike
the createOrReplaceTempView command, saveAsTable will materialize the contents of the
DataFrame and create a pointer to the data in the Hive metastore. Persistent tables will still exist
even after your Spark program has restarted, as long as you maintain your connection to the
same metastore. A DataFrame for a persistent table can be created by calling the table method on
a SparkSession with the name of the table.
For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via
the path option, e.g. df.write.option("path", "/some/path").saveAsTable("t"). When the table is
dropped, the custom table path will not be removed and the table data is still there. If no custom
table path is specified, Spark will write data to a default table path under the warehouse
directory. When the table is dropped, the default table path will be removed too.
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the
Hive metastore. This brings several benefits:

 Since the metastore can return only necessary partitions for a query, discovering all the
partitions on the first query to the table is no longer needed.
 Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for
tables created with the Datasource API.

Note that partition information is not gathered by default when creating external datasource
tables (those with a path option). To sync the partition information in the metastore, you can
invoke MSCK REPAIR TABLE.

PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Service Tool Manual
No ratings yet
Service Tool Manual
95 pages
PL 900
100% (2)
PL 900
30 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
Dcac9k Lab Guide 20160501
100% (3)
Dcac9k Lab Guide 20160501
218 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark SQL_updated
No ratings yet
Spark SQL_updated
19 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
SparkSql_AND_DF
No ratings yet
SparkSql_AND_DF
89 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Py Spark
No ratings yet
Py Spark
9 pages
Unit 4( Data Frame and Apache Kafka)
No ratings yet
Unit 4( Data Frame and Apache Kafka)
28 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
Page 01
No ratings yet
Page 01
2 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
spark_sql
No ratings yet
spark_sql
18 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
Spark SQL
No ratings yet
Spark SQL
12 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
BDA1
No ratings yet
BDA1
17 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Unit 4
No ratings yet
Unit 4
60 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Docse
No ratings yet
Docse
3 pages
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
notes - Copy
No ratings yet
notes - Copy
5 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Sparks QL Sig Mod 2015
No ratings yet
Sparks QL Sig Mod 2015
12 pages
Spark SQL - Relational Data Processing in Spark
No ratings yet
Spark SQL - Relational Data Processing in Spark
12 pages
ECS765P_W5_Spark Programming
No ratings yet
ECS765P_W5_Spark Programming
43 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
sparksql
No ratings yet
sparksql
2 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
1737249906013
No ratings yet
1737249906013
106 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
nps
No ratings yet
nps
3 pages
Pig_2
No ratings yet
Pig_2
63 pages
lec18
No ratings yet
lec18
21 pages
Prog Python
No ratings yet
Prog Python
67 pages
Lec 26
No ratings yet
Lec 26
10 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Big Data Analytics Using Hadoop
No ratings yet
Big Data Analytics Using Hadoop
26 pages
Function Spark
No ratings yet
Function Spark
9 pages
Per Partition
No ratings yet
Per Partition
3 pages
Lec 2
No ratings yet
Lec 2
20 pages
Lec 3
No ratings yet
Lec 3
28 pages
Lec 8
No ratings yet
Lec 8
24 pages
Lec 6
No ratings yet
Lec 6
16 pages
Lec 4
No ratings yet
Lec 4
28 pages
Lec 5
No ratings yet
Lec 5
6 pages
Advanced English Communication Skills Lab
No ratings yet
Advanced English Communication Skills Lab
2 pages
Lec 7
No ratings yet
Lec 7
10 pages
Lec 1
No ratings yet
Lec 1
30 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
2 pages
Advanced Data Structures
No ratings yet
Advanced Data Structures
2 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
OBIEE Repository Best practice
No ratings yet
OBIEE Repository Best practice
4 pages
Form CG-1 Help
No ratings yet
Form CG-1 Help
6 pages
Curso Python y GIS
No ratings yet
Curso Python y GIS
205 pages
Covid Project Report
No ratings yet
Covid Project Report
34 pages
Oracle EBS Projects AutoAccounting Overview
100% (1)
Oracle EBS Projects AutoAccounting Overview
26 pages
photon_prog_guide
No ratings yet
photon_prog_guide
646 pages
Microwave WDT Training For Partner_2024
No ratings yet
Microwave WDT Training For Partner_2024
14 pages
Guide For Loading in Announcement in The UMG8900
No ratings yet
Guide For Loading in Announcement in The UMG8900
16 pages
2.4 Using The BoXZY Smart Controller
No ratings yet
2.4 Using The BoXZY Smart Controller
8 pages
NetBackup10 Application Guide For Flex Appliance 2.1
No ratings yet
NetBackup10 Application Guide For Flex Appliance 2.1
72 pages
Unit 7 Strings: Structure
No ratings yet
Unit 7 Strings: Structure
15 pages
Usmsecure Wireless Network: Wireless Configuration For Windows 8 & 8.1
0% (1)
Usmsecure Wireless Network: Wireless Configuration For Windows 8 & 8.1
7 pages
Project-proposal-Bike Sales and Inventory-System-Management
No ratings yet
Project-proposal-Bike Sales and Inventory-System-Management
6 pages
Resume of Pawan Joshi
No ratings yet
Resume of Pawan Joshi
6 pages
Open Dss Manual
No ratings yet
Open Dss Manual
200 pages
Rule Based Rogue Classification in Wireless LAN Controllers (WLC) and Wireless Control System (WCS)
No ratings yet
Rule Based Rogue Classification in Wireless LAN Controllers (WLC) and Wireless Control System (WCS)
14 pages
IGL 7.1 Installation Guide
No ratings yet
IGL 7.1 Installation Guide
243 pages
Erlang-B Calculator Instructions
No ratings yet
Erlang-B Calculator Instructions
2 pages
TPA - Windows Phone: XNA - Introduction
No ratings yet
TPA - Windows Phone: XNA - Introduction
26 pages
Introduction To The Altera SOPC Builder
No ratings yet
Introduction To The Altera SOPC Builder
29 pages
Crypto Format
No ratings yet
Crypto Format
4 pages
Module 1 - Operating System
No ratings yet
Module 1 - Operating System
20 pages
Mobile Phones & Tablets Tips: Accounting For Touch
No ratings yet
Mobile Phones & Tablets Tips: Accounting For Touch
5 pages
Checking Used System Calls (SFC) in STEP 7 Projects For The Upgrade To The New SIMATIC S7-300 CPUs
No ratings yet
Checking Used System Calls (SFC) in STEP 7 Projects For The Upgrade To The New SIMATIC S7-300 CPUs
18 pages
Racf
No ratings yet
Racf
132 pages
Quiz Application Using JAVA
100% (2)
Quiz Application Using JAVA
25 pages
Database Triggers
100% (4)
Database Triggers
11 pages

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

Uploaded by

Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession

Uploaded by

SQL

Datasets and DataFrames

val spark = SparkSession

// Displays the content of the DataFrame to stdout

Untyped Dataset Operations (aka DataFrame Operations)

// Select only the "name" column

// Select everybody, but increment the age by 1

// Select people older than 21

// Count people by age

Running SQL Queries Programmatically

val sqlDF = spark.sql("SELECT * FROM people")

Global Temporary View

// Global temporary view is tied to a system preserved database `global_temp`

// Encoders are created for case classes

Interoperating with RDDs

1. Create an RDD of Rows from the original RDD;

// The schema is encoded in a string

// Generate the schema based on the string of schema

// Convert records of the RDD (people) to Rows

// Apply the schema to the RDD

// SQL can be run over a temporary view created using DataFrames

Untyped User-Defined Aggregate Functions

object MyAverage extends UserDefinedAggregateFunction {

// Register the function to access it

Type-Safe User-Defined Aggregate Functions

case class Employee(name: String, salary: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {

// Convert the function to a `TypedColumn` and give it a name

Generic Load/Save Functions

Manually Specifying Options

Run SQL on files directly

Scala/Java Any Language Meaning

SaveMode.ErrorIfExists (default) "error" (default) When saving a DataFrame to

SaveMode.Append "append" When saving a DataFrame to

SaveMode.Overwrite "overwrite" Overwrite mode means that

SaveMode.Ignore "ignore" Ignore mode means that when

Saving to Persistent Tables

You might also like