SlideShare a Scribd company logo
Making sense of RDDs, DataFrames, SparkSQL and
Datasets APIs
Motivation
overview over the different Spark APIs for working with structured data.
Timeline of Spark APIs
Spark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well known
concept from R / Python Pandas.
Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-oriented
programming interface.
RDD
RDD - Resilient Distributed Dataset
Functional transformations on partitioned collections of opaque objects.
Define case class representing schema of our data.
Each field represent column of the DB.
> 
defined class Person
Create parallelized collection (RDD)
> 
peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95
RDD of type Person
> 
rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98
NB: Return Person objects
> 
Person(Sven,38)
case class Person(name: String, age: Int)
val peopleRDD = sc.parallelize(Array(
Person("Lars", 37),
Person("Sven", 38),
Person("Florian", 39),
Person("Dieter", 37)
))
val rdd = peopleRDD
.filter(_.age > 37)
rdd
.collect
.foreach(println(_))
(http://databricks.com) Import Notebook
RDDs, SQL, DataFrames and DataSets
Person(Florian,39)
DataFrames
Declarative transformations on partitioned collection of tuples.
> 
peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
> 
+-------+---+
| name|age|
+-------+---+
| Lars| 37|
| Sven| 38|
|Florian| 39|
| Dieter| 37|
+-------+---+
> 
root
|-- name: string (nullable = true)
|-- age: integer (nullable = false)
Show only age column
> 
+---+
|age|
+---+
| 37|
| 38|
| 39|
| 37|
+---+
NB: Result set consists of Arrays of String und Ints
> 
[Sven,38]
[Florian,39]
DataSets
Create DataFrames from RDDs
Implicit conversion is also available
> 
peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
NB: Result set consist of Person objects
val peopleDF = peopleRDD.toDF
peopleDF.show()
peopleDF.printSchema()
peopleDF.select("age").show()
peopleDF
.filter("age > 37")
.collect
.foreach(row => println(row))
val peopleDS = peopleRDD.toDS
> 
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
> 
Person(Sven,38)
Person(Florian,39)
> 
res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
> 
res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045]
+- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97
> 
Person(Sven,38)
Person(Florian,39)
Spark SQL
> 
import org.apache.spark.sql.SQLContext
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129
Register DataFrame for usage via SQL
> 
The results of SQL queries are DataFrames and support all the usual RDD operations.
> 
res298: org.apache.spark.sql.DataFrame = [name: string, age: int]
Print execution plan
> 
res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- Subquery sparkpeopletbl
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
Optimized by built-in optimizer execution plan
val ds = peopleDS
.filter(_.age > 37)
ds.collect
.foreach(println(_))
ds.queryExecution.analyzed
ds.queryExecution.optimizedPlan
ds.collect
.foreach(println(_))
// Get SQL context from Spark context
// NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor.
In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access the
shared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())."
import org.apache.spark.sql.SQLContext
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
peopleDF.registerTempTable("sparkPeopleTbl")
sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
> 
res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [name#11037,age#11038]
+- Filter (age#11038 > 37)
+- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97
> 
Sven 38
Florian 39
> 
res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39])
> 
res302: Array[String] = Array(NAME: Sven, NAME: Florian)
> 
res303: Array[String] = Array(NAME: Sven, NAME: Florian)
> 
res304: org.apache.spark.sql.DataFrame = [name: string, age: int]
Running SQL queries agains Parquet files directly
> 
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89
MiB]
? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95
MiB]
> 
0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01-
13T11:08:00.000+0000
19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01-
13T10:33:00.000+0000
D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01-
13T04:58:00.000+0000
67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01-
name age
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan
%sql SELECT * FROM sparkPeopleTbl WHERE age > 37
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect()
sql("SELECT * FROM sparkPeopleTbl WHERE age > 37")
ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/")
%sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
Showing the first 1000 rows.
13T04:58:00.000+0000
5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01-
13T08:56:00.000+0000
Conclusions
RDDs
RDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should
be prefered.
DataFrames
DataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside
this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and
flexible API.
Datasets
Datasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in
experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster
serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with
structured data, and Spark plans to converge the APIs even further.
Here some benchmarks of DataSets:
Transformation of Data Types in Spark
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Ad

Recommended

PDF
Machinelearning Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
PDF
R Programming: Export/Output Data In R
Rsquared Academy
 
PPTX
Time Series Analysis for Network Secruity
mrphilroth
 
DOCX
Spark_Documentation_Template1
Nagavarunkumar Kolla
 
PDF
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Databricks
 
PDF
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
PPTX
Python database interfaces
Mohammad Javad Beheshtian
 
PPTX
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
vishal choudhary
 
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
PDF
The Ring programming language version 1.9 book - Part 33 of 210
Mahmoud Samir Fayed
 
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
PDF
Grid gain paper
SubhashiniSukumar
 
PDF
RMySQL Tutorial For Beginners
Rsquared Academy
 
PDF
Scaling up data science applications
Kexin Xie
 
PDF
Meet scala
Wojciech Pituła
 
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
PDF
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
PROIDEA
 
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Compact and safely: static DSL on Kotlin
Dmitry Pranchuk
 
PDF
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Databricks
 
PDF
Chapter 8 advanced sorting and hashing for print
Abdii Rashid
 
PDF
AJUG April 2011 Cascading example
Christopher Curtin
 
PDF
Spark Schema For Free with David Szakallas
Databricks
 
PDF
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Citus Data
 
PDF
Full Text Search in PostgreSQL
Aleksander Alekseev
 
PDF
Introduce spark (by 조창원)
I Goo Lee.
 
PPTX
Ember
mrphilroth
 
PDF
The Ring programming language version 1.9 book - Part 43 of 210
Mahmoud Samir Fayed
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
SparkSQL and Dataframe
Namgee Lee
 

More Related Content

What's hot (20)

PPTX
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
PDF
The Ring programming language version 1.9 book - Part 33 of 210
Mahmoud Samir Fayed
 
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
PDF
Grid gain paper
SubhashiniSukumar
 
PDF
RMySQL Tutorial For Beginners
Rsquared Academy
 
PDF
Scaling up data science applications
Kexin Xie
 
PDF
Meet scala
Wojciech Pituła
 
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
PDF
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
PROIDEA
 
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Compact and safely: static DSL on Kotlin
Dmitry Pranchuk
 
PDF
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Databricks
 
PDF
Chapter 8 advanced sorting and hashing for print
Abdii Rashid
 
PDF
AJUG April 2011 Cascading example
Christopher Curtin
 
PDF
Spark Schema For Free with David Szakallas
Databricks
 
PDF
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Citus Data
 
PDF
Full Text Search in PostgreSQL
Aleksander Alekseev
 
PDF
Introduce spark (by 조창원)
I Goo Lee.
 
PPTX
Ember
mrphilroth
 
PDF
The Ring programming language version 1.9 book - Part 43 of 210
Mahmoud Samir Fayed
 
ComputeFest 2012: Intro To R for Physical Sciences
alexstorer
 
The Ring programming language version 1.9 book - Part 33 of 210
Mahmoud Samir Fayed
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Grid gain paper
SubhashiniSukumar
 
RMySQL Tutorial For Beginners
Rsquared Academy
 
Scaling up data science applications
Kexin Xie
 
Meet scala
Wojciech Pituła
 
GeoMesa on Apache Spark SQL with Anthony Fox
Databricks
 
MCE^3 - Hannes Verlinde - Let The Symbols Do The Work
PROIDEA
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Compact and safely: static DSL on Kotlin
Dmitry Pranchuk
 
Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatt...
Databricks
 
Chapter 8 advanced sorting and hashing for print
Abdii Rashid
 
AJUG April 2011 Cascading example
Christopher Curtin
 
Spark Schema For Free with David Szakallas
Databricks
 
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
Citus Data
 
Full Text Search in PostgreSQL
Aleksander Alekseev
 
Introduce spark (by 조창원)
I Goo Lee.
 
Ember
mrphilroth
 
The Ring programming language version 1.9 book - Part 43 of 210
Mahmoud Samir Fayed
 

Similar to Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016 (20)

PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
SparkSQL and Dataframe
Namgee Lee
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Evolution of Spark APIs
Máté Szalay-Bekő
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PPTX
An Introduction to Spark
jlacefie
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
pyspark_df.pdf
SJain36
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
PPTX
Spark sql
Zahra Eskandari
 
PDF
Spark and cassandra (Hulu Talk)
Jon Haddad
 
PDF
Sparkcamp stratasingapore
Cheng Feng
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
SparkSQL and Dataframe
Namgee Lee
 
Intro to Spark and Spark SQL
jeykottalam
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Evolution of Spark APIs
Máté Szalay-Bekő
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
jlacefie
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
pyspark_df.pdf
SJain36
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
 
Spark sql
Zahra Eskandari
 
Spark and cassandra (Hulu Talk)
Jon Haddad
 
Sparkcamp stratasingapore
Cheng Feng
 
Learning spark ch09 - Spark SQL
phanleson
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Ad

More from Comsysto Reply GmbH (20)

PDF
Architectural Decisions: Smoothly and Consistently
Comsysto Reply GmbH
 
PDF
ljug-meetup-2023-03-hexagonal-architecture.pdf
Comsysto Reply GmbH
 
PDF
Software Architecture and Architectors: useless VS valuable
Comsysto Reply GmbH
 
PDF
Invited-Talk_PredAnalytics_München (2).pdf
Comsysto Reply GmbH
 
PDF
MicroFrontends für Microservices
Comsysto Reply GmbH
 
PDF
Alles offen = gut(ai)
Comsysto Reply GmbH
 
PDF
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Comsysto Reply GmbH
 
PDF
Smart City Munich Kickoff Meetup
Comsysto Reply GmbH
 
PDF
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Comsysto Reply GmbH
 
PDF
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
Comsysto Reply GmbH
 
PDF
Data lake vs Data Warehouse: Hybrid Architectures
Comsysto Reply GmbH
 
PPTX
Java 9 Modularity and Project Jigsaw
Comsysto Reply GmbH
 
PDF
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Comsysto Reply GmbH
 
PDF
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Comsysto Reply GmbH
 
PDF
Building a fully-automated Fast Data Platform
Comsysto Reply GmbH
 
PPTX
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
PPTX
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Comsysto Reply GmbH
 
PDF
Geospatial applications created using java script(and nosql)
Comsysto Reply GmbH
 
PDF
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Comsysto Reply GmbH
 
PDF
21.04.2016 Meetup: Spark vs. Flink
Comsysto Reply GmbH
 
Architectural Decisions: Smoothly and Consistently
Comsysto Reply GmbH
 
ljug-meetup-2023-03-hexagonal-architecture.pdf
Comsysto Reply GmbH
 
Software Architecture and Architectors: useless VS valuable
Comsysto Reply GmbH
 
Invited-Talk_PredAnalytics_München (2).pdf
Comsysto Reply GmbH
 
MicroFrontends für Microservices
Comsysto Reply GmbH
 
Alles offen = gut(ai)
Comsysto Reply GmbH
 
Bable on Smart City Munich Meetup: How cities are leveraging innovative partn...
Comsysto Reply GmbH
 
Smart City Munich Kickoff Meetup
Comsysto Reply GmbH
 
Data Reliability Challenges with Spark by Henning Kropp (Spark & Hadoop User ...
Comsysto Reply GmbH
 
"Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wo...
Comsysto Reply GmbH
 
Data lake vs Data Warehouse: Hybrid Architectures
Comsysto Reply GmbH
 
Java 9 Modularity and Project Jigsaw
Comsysto Reply GmbH
 
Distributed Computing and Caching in the Cloud: Hazelcast and Microsoft
Comsysto Reply GmbH
 
Grundlegende Konzepte von Elm, React und AngularDart 2 im Vergleich
Comsysto Reply GmbH
 
Building a fully-automated Fast Data Platform
Comsysto Reply GmbH
 
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Ein Prozess lernt laufen: LEGO Mindstorms Steuerung mit BPMN
Comsysto Reply GmbH
 
Geospatial applications created using java script(and nosql)
Comsysto Reply GmbH
 
Java cro 2016 - From.... to Scrum by Jurica Krizanic
Comsysto Reply GmbH
 
21.04.2016 Meetup: Spark vs. Flink
Comsysto Reply GmbH
 
Ad

Recently uploaded (20)

PPTX
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
PDF
Predicting Titanic Survival Presentation
praxyfarhana
 
PDF
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
PPTX
Attendance Presentation Project Excel.pptx
s2025266191
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
PDF
Measurecamp Copenhagen - Consent Context
Human37
 
PPTX
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
PPTX
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
PDF
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
PDF
All the DataOps, all the paradigms .
Lars Albertsson
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PPTX
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
PPTX
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
PPTX
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
PPTX
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
PPTX
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
PDF
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
PDF
Informatics Market Insights AI Workforce.pdf
karizaroxx
 
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
Predicting Titanic Survival Presentation
praxyfarhana
 
lecture12.pdf Introduction to bioinformatics
SergeyTsygankov6
 
Attendance Presentation Project Excel.pptx
s2025266191
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
Measurecamp Copenhagen - Consent Context
Human37
 
最新版美国约翰霍普金斯大学毕业证(JHU毕业证书)原版定制
Taqyea
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
All the DataOps, all the paradigms .
Lars Albertsson
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
定制OCAD学生卡加拿大安大略艺术与设计大学成绩单范本,OCAD成绩单复刻
taqyed
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
RESEARCH-FINAL-GROUP-3, about the final .pptx
gwapokoha1
 
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
Informatics Market Insights AI Workforce.pdf
karizaroxx
 

Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016

  • 1. Making sense of RDDs, DataFrames, SparkSQL and Datasets APIs Motivation overview over the different Spark APIs for working with structured data. Timeline of Spark APIs Spark 1.0 used the RDD API - a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Spark 1.3 introduced the DataFrames API - a distributed collection of data organized into named columns. Also a well known concept from R / Python Pandas. Spark 1.6 introduced an experimental Datasets API - extension of the DataFrame API that provides a type-safe, object-oriented programming interface. RDD RDD - Resilient Distributed Dataset Functional transformations on partitioned collections of opaque objects. Define case class representing schema of our data. Each field represent column of the DB. >  defined class Person Create parallelized collection (RDD) >  peopleRDD: org.apache.spark.rdd.RDD[Person] = ParallelCollectionRDD[5885] at parallelize at <console>:95 RDD of type Person >  rdd: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[5886] at filter at <console>:98 NB: Return Person objects >  Person(Sven,38) case class Person(name: String, age: Int) val peopleRDD = sc.parallelize(Array( Person("Lars", 37), Person("Sven", 38), Person("Florian", 39), Person("Dieter", 37) )) val rdd = peopleRDD .filter(_.age > 37) rdd .collect .foreach(println(_)) (http://databricks.com) Import Notebook RDDs, SQL, DataFrames and DataSets
  • 2. Person(Florian,39) DataFrames Declarative transformations on partitioned collection of tuples. >  peopleDF: org.apache.spark.sql.DataFrame = [name: string, age: int] >  +-------+---+ | name|age| +-------+---+ | Lars| 37| | Sven| 38| |Florian| 39| | Dieter| 37| +-------+---+ >  root |-- name: string (nullable = true) |-- age: integer (nullable = false) Show only age column >  +---+ |age| +---+ | 37| | 38| | 39| | 37| +---+ NB: Result set consists of Arrays of String und Ints >  [Sven,38] [Florian,39] DataSets Create DataFrames from RDDs Implicit conversion is also available >  peopleDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] NB: Result set consist of Person objects val peopleDF = peopleRDD.toDF peopleDF.show() peopleDF.printSchema() peopleDF.select("age").show() peopleDF .filter("age > 37") .collect .foreach(row => println(row)) val peopleDS = peopleRDD.toDS
  • 3. >  ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int] >  Person(Sven,38) Person(Florian,39) >  res294: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045] +- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97 >  res295: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = MapPartitions <function1>, class[name[0]: string, age[0]: int], class[name[0]: string, age[0]: int], [name#11044,age#11045] +- LogicalRDD [name#11041,age#11042], MapPartitionsRDD[5894] at rddToDatasetHolder at <console>:97 >  Person(Sven,38) Person(Florian,39) Spark SQL >  import org.apache.spark.sql.SQLContext sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@50ec5129 Register DataFrame for usage via SQL >  The results of SQL queries are DataFrames and support all the usual RDD operations. >  res298: org.apache.spark.sql.DataFrame = [name: string, age: int] Print execution plan >  res299: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- Subquery sparkpeopletbl +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97 Optimized by built-in optimizer execution plan val ds = peopleDS .filter(_.age > 37) ds.collect .foreach(println(_)) ds.queryExecution.analyzed ds.queryExecution.optimizedPlan ds.collect .foreach(println(_)) // Get SQL context from Spark context // NB: "In Databricks, developers should utilize the shared HiveContext instead of creating one using the constructor. In Scala and Python notebooks, the shared context can be accessed as sqlContext. When running a job, you can access the shared context by calling SQLContext.getOrCreate(SparkContext.getOrCreate())." import org.apache.spark.sql.SQLContext val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate()) peopleDF.registerTempTable("sparkPeopleTbl") sqlContext.sql("SELECT * FROM sparkPeopleTbl WHERE age > 37") sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.analyzed
  • 4. >  res300: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [name#11037,age#11038] +- Filter (age#11038 > 37) +- LogicalRDD [name#11037,age#11038], MapPartitionsRDD[5887] at rddToDataFrameHolder at <console>:97 >  Sven 38 Florian 39 >  res301: Array[org.apache.spark.sql.Row] = Array([Sven,38], [Florian,39]) >  res302: Array[String] = Array(NAME: Sven, NAME: Florian) >  res303: Array[String] = Array(NAME: Sven, NAME: Florian) >  res304: org.apache.spark.sql.DataFrame = [name: string, age: int] Running SQL queries agains Parquet files directly >  ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_SUCCESS [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_common_metadata [0.00 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/_metadata [0.03 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00000-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00001-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00002-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.84 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00003-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.92 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00004-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00005-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.87 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00006-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00007-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00008-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.89 MiB] ? s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/part-r-00009-5df3eb09-ce03-4ac6-8770-287ce4782749.gz.parquet [53.95 MiB] >  0E0157EB3E6F927FE7FA86C3A0762B3B 4E0569072023608FFEB72F454CCF408B VTS 1 2013-01- 13T11:08:00.000+0000 19BF1BB516C4E992EA3FBAEDA73D6262 E4CAC9101BFE631554B57906364761D3 VTS 2 2013-01- 13T10:33:00.000+0000 D57C7392455C38D9404660F7BC63D1B6 EC5837D805127379D72FF6C35279890B VTS 1 2013-01- 13T04:58:00.000+0000 67108EDF8123623806A1DAFE8811EE63 36C9437BD2FF31940BEBED44DDDDDB8A VTS 5 2013-01- name age medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").queryExecution.optimizedPlan %sql SELECT * FROM sparkPeopleTbl WHERE age > 37 sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").collect sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").map(row => "NAME: " + row(0)).collect sql("SELECT * FROM sparkPeopleTbl WHERE age > 37").rdd.map(row => "NAME: " + row(0)).collect() sql("SELECT * FROM sparkPeopleTbl WHERE age > 37") ls3("s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/") %sql SELECT * FROM parquet.`s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/` WHERE trip_time_in_secs <= 60
  • 5. Showing the first 1000 rows. 13T04:58:00.000+0000 5F78CC6D4ECD0541B765FECE17075B6F 39703F5449DADC0AEFFAFEFB1A6A7E64 VTS 1 2013-01- 13T08:56:00.000+0000 Conclusions RDDs RDDs remain the core component for the native distributed collections. But due to lack of built-in optimization, DFs and DSs should be prefered. DataFrames DataFrames and Spark SQL are very flexible and bring built-in optimization also for dynamic languages like Python and R. Beside this, it allows to combine both declarative and functional way of working with structured data. It's regarded to be most stable and flexible API. Datasets Datasets unify the best from both worlds: type safety from RDDs and built-in optimization available for DataFrames. But it is in experimental phase and can be used with Scala, Java and Python. DSs allow even further optimizations (memory compaction + faster serialization using encoders). DataSets are not yet mature, but it is expected, that it will quickly become a common way to work with structured data, and Spark plans to converge the APIs even further. Here some benchmarks of DataSets: Transformation of Data Types in Spark