SlideShare a Scribd company logo
Type Checking Scala Spark
Datasets: 

Data Set Transforms
John Nestor 47 Degrees
www.47deg.com
Seattle Spark Meetup
September 22, 2016
147deg.com
47deg.com © Copyright 2016 47 Degrees
Outline
• Introduction
• Transforms
• Demos
• Implementation
• Getting the Code
2
Introduction
3
47deg.com © Copyright 2016 47 Degrees
Spark Scala APIs
• RDD (pass closures)
• Functional programming model
• Types checked at compile time
• DataFrame (pass SQL)
• SQL programming model (can be optimized)
• Types checked at run time
• Dataset (pass SQL)
• Combines best of RDDs and DataFrames
• Some (not all) types checked at compile time
4
47deg.com © Copyright 2016 47 Degrees
Run-Time Scala Checking
• Field/column names
• Names specified as strings
• RT error if no such field
• Field/column types
• Specified via casting to expected type
• RT error if not of expected type
5
47deg.com © Copyright 2016 47 Degrees
Dataset Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but must pass closure and can’t optimize */

val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))



/* Can be query optimized;
but run-time type and field name checking */

val ds2 = ds.select($"b" as "c",
($"a" * 2 + $"a") as "a").as[CA]
6
Transforms
7
47deg.com © Copyright 2016 47 Degrees
Goal
• Add strong typing to Scala Spark Datasets
• Check field names at compile time
• Check field types at compile time
• Each transform maps one of more Datasets to a new
Dataset.
• Dataset rows are compile-time types: Scala case
classes
8
47deg.com © Copyright 2016 47 Degrees
Transform Example
case class ABC(a: Int, b: String, c: String)

case class CA(c: String, a: Int)


val abc = ABC(3, "foo", "test")

val abc1 = ABC(5, "xxx", "alpha")

val abc3 = ABC(10, "aaa", "aaa")

val abcs = Seq(abc, abc1, abc3)

val ds = abcs.toDS()
/* Compile time type checking;
but can do query optimization */


val smap = SqlMap[ABC, CA]
.act(cols => (cols.b, cols.a * 2 + cols.a))
val ds3 = smap(ds)


9
47deg.com © Copyright 2016 47 Degrees
Current Transforms
• Filter
• Map
• Sort
• Join (combines 2 DataSets)
• Aggregate (sum, count, max)
10
Demos
11
47deg.com © Copyright 2016 47 Degrees
Demo
• Dataset example
• map
• select
• Transform examples
• Map
• Sort
• Join
• Filter
• Aggregate
12
Implementation
13
47deg.com © Copyright 2016 47 Degrees
Scala Macros
• Scala code executed at compile time
• Kinds
• Black box - single result type specified
• * White box - result type computed
14
47deg.com © Copyright 2016 47 Degrees
Transform Implementation
• case class Person(name:String,age:Int)

val p = Person(“Sam”,30)
• Scala macro converts
• from: an arbitrary case class type
• classOf[p]
• to: a meta structure that encodes field names and
types
• case class PersonM(name:StringCol,age:IntCol)

val cols =
PersonM(name:StringCol(“name”),age:IntCol(“age”))
15
47deg.com © Copyright 2016 47 Degrees
Column Operations
• StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”)
• IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”)
• IntCol(“A”).max => IntCol(“A.max”)
16
47deg.com © Copyright 2016 47 Degrees
White Box Macro Restrictions
• Works fine in SBT and Eclipse
• Not supported in Intellij but can use
• Reports type errors
• Does not show available completions
17
Getting the Code
18
47deg.com © Copyright 2016 47 Degrees
Transforms Code
• https://github.com/nestorpersist/dataset-transform
• Code
• Documentation
• Examples
• "com.persist" % "dataset-transforms_2.11" % "0.0.5"
19
Questions
20

More Related Content

Viewers also liked (20)

PDF
Logging in Scala
John Nestor
 
PDF
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
PDF
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
PDF
Full stack analytics with Hadoop 2
Gabriele Modena
 
PDF
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
PDF
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
PDF
Resilient Distributed Datasets
Gabriele Modena
 
PDF
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
PPTX
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
PPTX
Apache Spark Introduction @ University College London
Vitthal Gogate
 
PPTX
Think Like Spark
Alpine Data
 
PDF
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
PDF
Hadoop to spark_v2
elephantscale
 
PDF
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
PDF
Spark in 15 min
Christophe Marchal
 
PDF
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
PPTX
Intro to Spark development
Spark Summit
 
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Logging in Scala
John Nestor
 
Spark fundamentals i (bd095 en) version #1: updated: april 2015
Ashutosh Sonaliya
 
Unikernels: in search of a killer app and a killer ecosystem
rhatr
 
Full stack analytics with Hadoop 2
Gabriele Modena
 
New Analytics Toolbox DevNexus 2015
Robbie Strickland
 
臺灣高中數學講義 - 第一冊 - 數與式
Xuan-Chao Huang
 
Think Like Spark: Some Spark Concepts and a Use Case
Rachel Warren
 
Resilient Distributed Datasets
Gabriele Modena
 
Apache Spark: killer or savior of Apache Hadoop?
rhatr
 
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Think Like Spark
Alpine Data
 
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Hadoop to spark_v2
elephantscale
 
Escape from Hadoop: Ultra Fast Data Analysis with Spark & Cassandra
Piotr Kolaczkowski
 
Spark in 15 min
Christophe Marchal
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
Intro to Spark development
Spark Summit
 
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 

Similar to Type Checking Scala Spark Datasets: Dataset Transforms (20)

PPT
Scaling web applications with cassandra presentation
Murat Çakal
 
PDF
Streaming Microservices With Akka Streams And Kafka Streams
Lightbend
 
PDF
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
PDF
Scala in Places API
Łukasz Bałamut
 
PDF
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PPTX
Taxonomy of Scala
shinolajla
 
PPTX
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
PDF
楽々Scalaプログラミング
Tomoharu ASAMI
 
PDF
Scala active record
鉄平 土佐
 
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
PDF
Rails on Oracle 2011
Raimonds Simanovskis
 
PDF
Scala for Java Programmers
Eric Pederson
 
PDF
Scala Macros
Knoldus Inc.
 
PDF
Apache Spark RDDs
Dean Chen
 
PPTX
Cassandra Overview
Sergey Titov, Ph.D.
 
KEY
No SQL, No problem - using MongoDB in Ruby
sbeam
 
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
PPTX
Presentation
Dimitris Stripelis
 
Scaling web applications with cassandra presentation
Murat Çakal
 
Streaming Microservices With Akka Streams And Kafka Streams
Lightbend
 
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Scala in Places API
Łukasz Bałamut
 
3 Dundee-Spark Overview for C* developers
Christopher Batey
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Taxonomy of Scala
shinolajla
 
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
楽々Scalaプログラミング
Tomoharu ASAMI
 
Scala active record
鉄平 土佐
 
Scaling Web Applications with Cassandra Presentation (1).ppt
veronica380506
 
Rails on Oracle 2011
Raimonds Simanovskis
 
Scala for Java Programmers
Eric Pederson
 
Scala Macros
Knoldus Inc.
 
Apache Spark RDDs
Dean Chen
 
Cassandra Overview
Sergey Titov, Ph.D.
 
No SQL, No problem - using MongoDB in Ruby
sbeam
 
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
DataStax Academy
 
Presentation
Dimitris Stripelis
 
Ad

More from John Nestor (7)

PDF
LambdaFlow: Scala Functional Message Processing
John Nestor
 
PDF
LambdaTest
John Nestor
 
PDF
Messaging patterns
John Nestor
 
PDF
Experience Converting from Ruby to Scala
John Nestor
 
PPTX
Scala and Spark are Ideal for Big Data
John Nestor
 
PDF
Scala Json Features and Performance
John Nestor
 
PPT
Neutronium
John Nestor
 
LambdaFlow: Scala Functional Message Processing
John Nestor
 
LambdaTest
John Nestor
 
Messaging patterns
John Nestor
 
Experience Converting from Ruby to Scala
John Nestor
 
Scala and Spark are Ideal for Big Data
John Nestor
 
Scala Json Features and Performance
John Nestor
 
Neutronium
John Nestor
 
Ad

Recently uploaded (20)

PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Import Data Form Excel to Tally Services
Tally xperts
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 

Type Checking Scala Spark Datasets: Dataset Transforms

  • 1. Type Checking Scala Spark Datasets: 
 Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 147deg.com
  • 2. 47deg.com © Copyright 2016 47 Degrees Outline • Introduction • Transforms • Demos • Implementation • Getting the Code 2
  • 4. 47deg.com © Copyright 2016 47 Degrees Spark Scala APIs • RDD (pass closures) • Functional programming model • Types checked at compile time • DataFrame (pass SQL) • SQL programming model (can be optimized) • Types checked at run time • Dataset (pass SQL) • Combines best of RDDs and DataFrames • Some (not all) types checked at compile time 4
  • 5. 47deg.com © Copyright 2016 47 Degrees Run-Time Scala Checking • Field/column names • Names specified as strings • RT error if no such field • Field/column types • Specified via casting to expected type • RT error if not of expected type 5
  • 6. 47deg.com © Copyright 2016 47 Degrees Dataset Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but must pass closure and can’t optimize */
 val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
 
 /* Can be query optimized; but run-time type and field name checking */
 val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA] 6
  • 8. 47deg.com © Copyright 2016 47 Degrees Goal • Add strong typing to Scala Spark Datasets • Check field names at compile time • Check field types at compile time • Each transform maps one of more Datasets to a new Dataset. • Dataset rows are compile-time types: Scala case classes 8
  • 9. 47deg.com © Copyright 2016 47 Degrees Transform Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but can do query optimization */ 
 val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds) 
 9
  • 10. 47deg.com © Copyright 2016 47 Degrees Current Transforms • Filter • Map • Sort • Join (combines 2 DataSets) • Aggregate (sum, count, max) 10
  • 12. 47deg.com © Copyright 2016 47 Degrees Demo • Dataset example • map • select • Transform examples • Map • Sort • Join • Filter • Aggregate 12
  • 14. 47deg.com © Copyright 2016 47 Degrees Scala Macros • Scala code executed at compile time • Kinds • Black box - single result type specified • * White box - result type computed 14
  • 15. 47deg.com © Copyright 2016 47 Degrees Transform Implementation • case class Person(name:String,age:Int)
 val p = Person(“Sam”,30) • Scala macro converts • from: an arbitrary case class type • classOf[p] • to: a meta structure that encodes field names and types • case class PersonM(name:StringCol,age:IntCol)
 val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”)) 15
  • 16. 47deg.com © Copyright 2016 47 Degrees Column Operations • StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”) • IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”) • IntCol(“A”).max => IntCol(“A.max”) 16
  • 17. 47deg.com © Copyright 2016 47 Degrees White Box Macro Restrictions • Works fine in SBT and Eclipse • Not supported in Intellij but can use • Reports type errors • Does not show available completions 17
  • 19. 47deg.com © Copyright 2016 47 Degrees Transforms Code • https://github.com/nestorpersist/dataset-transform • Code • Documentation • Examples • "com.persist" % "dataset-transforms_2.11" % "0.0.5" 19