SlideShare a Scribd company logo
Scala and Spark are
Ideal for Big Data
John Nestor
47 Degrees
Seattle Unstructured Data Science Pop-Up
October 7, 2015
www.47deg.com
147deg.com
47deg.com
Why Scala?
• Strong typing
• Concise elegant syntax
• Runs on JVM (Java Virtual Machine)
• Supports both object-oriented and functional
• Small simple programs through large parallel distributed
systems
• Easy to cleanly extend with new libraries and DSL’s
• Ideal for parallel and distributed systems
2
47deg.com
Scala: Strong Typing and Concise Syntax
• Strong typing like Java.
• Compile time checks
• Better modularity via strongly typed interfaces
• Easier maintenance: types make code easier to
understand
• Concise syntax like Python.
• Type inference. Compiler infers most types that had to be
explicit in Java.
• Powerful syntax that avoid much of the boilerplate of Java
code (see next slide).
• Best of both worlds: safety of strong typing with conciseness
(like Python).
3
47deg.com
Scala Case Class
• Java version
class User {
private String name;
private Int age;
public User(String name, Int age) {
this.name = name; this.age = age;
}
public getAge() { return age; }
public setAge(Int age) { this.age = age;}
}
User joe = new User(“Joe”, 30);
• Scala version
case class User(name:String, var age:Int)
val joe = User(“Joe”, 30)
4
47deg.com
Functional Scala
• Anonymous functions.
(a:Int,b:Int) => a+b
• Functions that take and return other functions.
• Rarely need variables or loops
• Immutable collections: Seq[T], Map[K,V], …
• Works well with concurrent or distributed systems
• Natural for functional programming
• Functional collection operations (a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
5
47deg.com
Scala Availability and Support
• Open Source
• Typesafe provides support. Founded my Martin
Odersky who designed Scala.
• IDEs: Intellij IDEA and Eclipse
• Libraries: lots now and more every day
• ScalaNLP - Epic (natural language processing)
• Major Scala users: LinkedIn, Twitter, Goldman Sachs,
Coursera, Angies List, Whitepages
• Major systems written in Scala: Spark, Kafka
6
47deg.com
Typesafe Scala Components
• Scala Compiler (includes REPL)
• Scala Standard Libraries
• SBT - Scala Build Tool
• Play - scaleable web applications
• Scala JS - compiles Scala to JavaScript
• Akka - for parallel and distributed computation
• Spray - high performance asynchronous TCP/ HTTP library
• Spark - Typesafe also supports Spark
• Slick - for SQL database access
• ConductR - Scala deployment/devops tool
• Reactive Monitoring (Beta)
7
47deg.com
Why Spark?
• Support for not only batch but also (near) real-time
• Fast - keeps data in memory as much as possible
• Often 10X to 100X Hadoop speed
• A clean easy-to-use API
• A richer set of functional operations than just map and
reduce
• A foundation for a wide set of integrated data
applications
• Can recover from failures - recompute or (optional)
replication
• Scalable for very large data sets and reduced time
8
47deg.com
Spark RDDs
• RDD[T] - resilient distributed data set
• typed (must be serializable)
• immutable
• ordered
• can be processed in parallel
• lazy evaluation - permits more global optimizations
• Rich set of functional operations ( a small sample)
• map, flatMap, reduce, …
• filter, groupBy, sortBy, take, drop, …
9
47deg.com
Spark Components
• Spark Core
• Scalable multi-node cluster
• Failure detection and recovery
• RDDs and functional operations
• MLLib - for machine learning
• linear regression, SVMs, clustering, collaborative
filtering, dimension reduction
• more on the way!
• GraphX - for graph computation
• Streaming - for near real-time
• Dataframes - for SQL and Json
10
47deg.com
Spark Availability and Support
• Open Source - top level Apache project
• Over 750 contributors from over 200 organizations
• Can process multiple petabytes on clusters of over
8000 nodes
• Databricks. Matei Zaharia who wrote the original Spark
is a founder and CTO
• Packages (more every day)
• Zeppelin - Scala notebooks
• Cassandra, Kafka connectors
11
47deg.com
Clusters and Scalability
• Scala Akka clusters (process distribution, micro services)
• message passing
• remote Actors
• Spark clusters (data distribution)
• local
• Stand alone (optionally with ZooKeeper)
• Apache Mesos
• Hadoop Yarn
• can run above on Amazon and Google clouds
12
47deg.com
Why Scala for Spark?
• Why not Python, R, or Java for Spark?
• Spark is written in Scala
• Scala source code is important Spark documentation
• Spark is best extended in Scala
• The primary API for Spark is Scala
• The functional features of Scala and Spark are a
natural fit and easiest to use in Scala
• If you want to build scalable high performance
production code based on Spark, R by itself is too
specialized, Python is too slow and Java is tedious to
write and maintain
13
47deg.com
Demo
14
47deg.com
Seattle Resources
• Seattle Meetups
• Scala at the Sea Meetup
http://www.meetup.com/Seattle-Scala-User-Group/
• Seattle Spark Meetup
http://www.meetup.com/Seattle-Spark-Meetup/
• Seattle Training: Spark and Typesafe Scala Classes
http://www.47deg.com/events#training
• UW Scala Professional Certificate Program
http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html
15

More Related Content

What's hot (20)

PDF
Solr cloud the 'search first' nosql database extended deep dive
lucenerevolution
 
PDF
Heterogeneous Workflows With Spark At Netflix
Jen Aman
 
PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PPTX
Neo4j tms
_mdev_
 
PPTX
Is there a SQL for NoSQL?
Arthur Keen
 
PDF
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
PPTX
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Cohesive Networks
 
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
PDF
Getting started with Riak in the Cloud
Ines Sombra
 
PPTX
Change Data Capture using Kafka
Akash Vacher
 
PDF
Change data capture with MongoDB and Kafka.
Dan Harvey
 
PPTX
Devops Days, 2019 - Charlotte
botsplash.com
 
PDF
Transitioning From SQL Server to MySQL - Presentation from Percona Live 2016
Dylan Butler
 
PDF
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PPTX
Getting started with Apache Spark
Habib Ahmed Bhutto
 
PDF
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
PDF
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
PPTX
Bootstrap SaaS startup using Open Source Tools
botsplash.com
 
Solr cloud the 'search first' nosql database extended deep dive
lucenerevolution
 
Heterogeneous Workflows With Spark At Netflix
Jen Aman
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Jen Aman
 
Apache Spark in Industry
Dorian Beganovic
 
Neo4j tms
_mdev_
 
Is there a SQL for NoSQL?
Arthur Keen
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Cohesive Networks
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
Getting started with Riak in the Cloud
Ines Sombra
 
Change Data Capture using Kafka
Akash Vacher
 
Change data capture with MongoDB and Kafka.
Dan Harvey
 
Devops Days, 2019 - Charlotte
botsplash.com
 
Transitioning From SQL Server to MySQL - Presentation from Percona Live 2016
Dylan Butler
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Getting started with Apache Spark
Habib Ahmed Bhutto
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
Neha Narkhede
 
Queryable State for Kafka Streamsを使ってみた
Yoshiyasu SAEKI
 
Bootstrap SaaS startup using Open Source Tools
botsplash.com
 

Similar to Scala and Spark are Ideal for Big Data (20)

PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
PPTX
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
PDF
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
PDF
Scala at Treasure Data
Taro L. Saito
 
PPTX
What is scala
Piyush Katariya
 
PPTX
963
Annu Ahmed
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PDF
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
Big Data Processing with Spark and Scala
Edureka!
 
PDF
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
PDF
Building Applications with Scala 1st Edition Pacheco
waldalowey4n
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PDF
Logging in Scala
John Nestor
 
PPTX
Apache Spark & Scala
Edureka!
 
PPTX
963
Annu Ahmed
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PDF
Big data apache spark + scala
Juantomás García Molina
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
Sviluppare applicazioni nell'era dei "Big Data" con Scala e Spark - Mario Car...
Codemotion
 
Introduction to Spark - DataFactZ
DataFactZ
 
Big Data pipeline with Scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 
Scala at Treasure Data
Taro L. Saito
 
What is scala
Piyush Katariya
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Liferay & Big Data Dev Con 2014
Miguel Pastor
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Big Data Processing with Spark and Scala
Edureka!
 
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Building Applications with Scala 1st Edition Pacheco
waldalowey4n
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Logging in Scala
John Nestor
 
Apache Spark & Scala
Edureka!
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Big data apache spark + scala
Juantomás García Molina
 
Ad

More from John Nestor (7)

PDF
LambdaFlow: Scala Functional Message Processing
John Nestor
 
PDF
LambdaTest
John Nestor
 
PDF
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
PDF
Messaging patterns
John Nestor
 
PDF
Experience Converting from Ruby to Scala
John Nestor
 
PDF
Scala Json Features and Performance
John Nestor
 
PPT
Neutronium
John Nestor
 
LambdaFlow: Scala Functional Message Processing
John Nestor
 
LambdaTest
John Nestor
 
Type Checking Scala Spark Datasets: Dataset Transforms
John Nestor
 
Messaging patterns
John Nestor
 
Experience Converting from Ruby to Scala
John Nestor
 
Scala Json Features and Performance
John Nestor
 
Neutronium
John Nestor
 
Ad

Recently uploaded (20)

PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
PDF
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Simplify React app login with asgardeo-sdk
vaibhav289687
 
PDF
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
Add Background Images to Charts in IBM SPSS Statistics Version 31.pdf
Version 1 Analytics
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
OpenChain @ OSS NA - In From the Cold: Open Source as Part of Mainstream Soft...
Shane Coughlan
 
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Simplify React app login with asgardeo-sdk
vaibhav289687
 
UITP Summit Meep Pitch may 2025 MaaS Rebooted
campoamor1
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Add Background Images to Charts in IBM SPSS Statistics Version 31.pdf
Version 1 Analytics
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 

Scala and Spark are Ideal for Big Data

  • 1. Scala and Spark are Ideal for Big Data John Nestor 47 Degrees Seattle Unstructured Data Science Pop-Up October 7, 2015 www.47deg.com 147deg.com
  • 2. 47deg.com Why Scala? • Strong typing • Concise elegant syntax • Runs on JVM (Java Virtual Machine) • Supports both object-oriented and functional • Small simple programs through large parallel distributed systems • Easy to cleanly extend with new libraries and DSL’s • Ideal for parallel and distributed systems 2
  • 3. 47deg.com Scala: Strong Typing and Concise Syntax • Strong typing like Java. • Compile time checks • Better modularity via strongly typed interfaces • Easier maintenance: types make code easier to understand • Concise syntax like Python. • Type inference. Compiler infers most types that had to be explicit in Java. • Powerful syntax that avoid much of the boilerplate of Java code (see next slide). • Best of both worlds: safety of strong typing with conciseness (like Python). 3
  • 4. 47deg.com Scala Case Class • Java version class User { private String name; private Int age; public User(String name, Int age) { this.name = name; this.age = age; } public getAge() { return age; } public setAge(Int age) { this.age = age;} } User joe = new User(“Joe”, 30); • Scala version case class User(name:String, var age:Int) val joe = User(“Joe”, 30) 4
  • 5. 47deg.com Functional Scala • Anonymous functions. (a:Int,b:Int) => a+b • Functions that take and return other functions. • Rarely need variables or loops • Immutable collections: Seq[T], Map[K,V], … • Works well with concurrent or distributed systems • Natural for functional programming • Functional collection operations (a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 5
  • 6. 47deg.com Scala Availability and Support • Open Source • Typesafe provides support. Founded my Martin Odersky who designed Scala. • IDEs: Intellij IDEA and Eclipse • Libraries: lots now and more every day • ScalaNLP - Epic (natural language processing) • Major Scala users: LinkedIn, Twitter, Goldman Sachs, Coursera, Angies List, Whitepages • Major systems written in Scala: Spark, Kafka 6
  • 7. 47deg.com Typesafe Scala Components • Scala Compiler (includes REPL) • Scala Standard Libraries • SBT - Scala Build Tool • Play - scaleable web applications • Scala JS - compiles Scala to JavaScript • Akka - for parallel and distributed computation • Spray - high performance asynchronous TCP/ HTTP library • Spark - Typesafe also supports Spark • Slick - for SQL database access • ConductR - Scala deployment/devops tool • Reactive Monitoring (Beta) 7
  • 8. 47deg.com Why Spark? • Support for not only batch but also (near) real-time • Fast - keeps data in memory as much as possible • Often 10X to 100X Hadoop speed • A clean easy-to-use API • A richer set of functional operations than just map and reduce • A foundation for a wide set of integrated data applications • Can recover from failures - recompute or (optional) replication • Scalable for very large data sets and reduced time 8
  • 9. 47deg.com Spark RDDs • RDD[T] - resilient distributed data set • typed (must be serializable) • immutable • ordered • can be processed in parallel • lazy evaluation - permits more global optimizations • Rich set of functional operations ( a small sample) • map, flatMap, reduce, … • filter, groupBy, sortBy, take, drop, … 9
  • 10. 47deg.com Spark Components • Spark Core • Scalable multi-node cluster • Failure detection and recovery • RDDs and functional operations • MLLib - for machine learning • linear regression, SVMs, clustering, collaborative filtering, dimension reduction • more on the way! • GraphX - for graph computation • Streaming - for near real-time • Dataframes - for SQL and Json 10
  • 11. 47deg.com Spark Availability and Support • Open Source - top level Apache project • Over 750 contributors from over 200 organizations • Can process multiple petabytes on clusters of over 8000 nodes • Databricks. Matei Zaharia who wrote the original Spark is a founder and CTO • Packages (more every day) • Zeppelin - Scala notebooks • Cassandra, Kafka connectors 11
  • 12. 47deg.com Clusters and Scalability • Scala Akka clusters (process distribution, micro services) • message passing • remote Actors • Spark clusters (data distribution) • local • Stand alone (optionally with ZooKeeper) • Apache Mesos • Hadoop Yarn • can run above on Amazon and Google clouds 12
  • 13. 47deg.com Why Scala for Spark? • Why not Python, R, or Java for Spark? • Spark is written in Scala • Scala source code is important Spark documentation • Spark is best extended in Scala • The primary API for Spark is Scala • The functional features of Scala and Spark are a natural fit and easiest to use in Scala • If you want to build scalable high performance production code based on Spark, R by itself is too specialized, Python is too slow and Java is tedious to write and maintain 13
  • 15. 47deg.com Seattle Resources • Seattle Meetups • Scala at the Sea Meetup http://www.meetup.com/Seattle-Scala-User-Group/ • Seattle Spark Meetup http://www.meetup.com/Seattle-Spark-Meetup/ • Seattle Training: Spark and Typesafe Scala Classes http://www.47deg.com/events#training • UW Scala Professional Certificate Program http://www.pce.uw.edu/certificates/scala-functional-reactive-programming.html 15