SlideShare a Scribd company logo
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Myself Databricks
2
About the speaker
• Company founded by
creators of Apache Spark
• Remains the largest
contributor to Spark and
builds the platform to make
working with Spark easy.
• Raised over 100M USD in
funding
• Software Engineer at
Databricks
• Previously interned at a
Facebook, LinkedIn, etc.
• Competitive programmer,
red on TopCoder, 13th at
ACM ICPC finals
Big Data - why you should care
• Data grows faster than computing power
3
Some big data use cases
• Log mining and processing.
• Recommendation systems.
• Palantir’s solution for small businesses.
4
How it all started
• In 2004 Google published the MapReduce paper.
• In 2006 Hadoop was started, soon adopted by Yahoo.
5
MapReduce
6
MapReduce
7
Map
8
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Reduce
9
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Recommender systems at LinkedIn
• Pipeline of nearly 80 individual jobs.
• Various data formats: json, binary json,
avro, etc.
• Entire pipeline took around 7 hours.
• LinkedIn used in-house solutions (e.g.
Azkaban for scheduling, own HDFS).
Problems
• Interactively checking data was
inconvenient.
• Slow - not even close to realtime.
• Problems working with some formats - as a
result an extra job was required to convert
them.
• Some jobs were a “one-liner” and could
have been avoided.
How it all started
• In 2012 Spark was created as a research project at
Berkeley to address shortcomings of Hadoop
MapReduce.
12
What’s Apache Spark?
Spark is a
framework for
doing distributed
computations on
a cluster.
13
Large Scale Usage:
Largest
cluster:
8000
nodes
Largest single
job:
1
petabyte
Top streaming
intake:
1
TB/hour
2014 on-disk 100 TB sort record: 23 mins /
207 EC2
nodes
Writing Spark programs - RDD
• Resilient Distributed Dataset
• Basically a collection of data that is spread across
many computers.
• Can be thought of as list that doesn’t allow random
access.
• RDDs built and manipulated through a diverse
set of parallel transformations (map, filter, join)
and actions (count, collect, save)
• RDDs automatically rebuilt on machine failure
15
map() intersection() cartesian()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
Transformations (lazy)
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
Actions
Writing Spark programs - RDD
scala> val rdd = sc.parallelize(List(1,
2, 3))
rdd: org.apache.spark.rdd.RDD[Int] =
ParallelCollectionRDD[1] at parallelize
at <console>:27
scala> rdd.count()
res1: Long = 3
18
Writing Spark programs - RDD
scala> rdd.collect()
res8: Array[Int] = Array(1, 2, 3)
scala> rdd.map(x => 2 * x).collect()
res2: Array[Int] = Array(2, 4, 6)
scala> rdd.filter(x => x % 2 ==
0).collect()
res3: Array[Int] = Array(2)
19
1) Create some input RDDs from external data or parallelize a collection in
your driver program.
1) Lazily transform them to define new RDDs using transformations like
filter() or map()
1) Ask Spark to cache() any intermediate RDDs that will need to be
reused.
1) Launch actions such as count() and collect() to kick off a parallel
computation, which is then optimized and executed by Spark.
Lifecycle of a Spark Program
Problem #1: Hadoop MR is verbose
21
text_file = sc.textFile("hdfs://...")
counts = text_file.flatMap(lambda line:
line.split(" ")) 
.map(lambda word: (word, 1)) 
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
Writing Spark programs: ML
22
# Every record of this DataFrame contains the label and
# features represented by a vector.
df = sqlContext.createDataFrame(data, ["label", "features"])
# Set parameters for the algorithm.
lr = LogisticRegression(maxIter=10)
# Fit the model to the data.
model = lr.fit(df)
# Given a dataset, predict each point's label, and show the results.
model.transform(df).show()
Problem #2: Hadoop MR is slow
•Spark is 10-100x times faster than MR
•Hadoop MR uses checkpointing to achieve resiliency,
Spark uses lineage.
23
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Other Spark optimizations: DataFrame
API
val users = spark.sql(“select * from users”)
val massUsers = users(users(“country”) === “NL”)
massUsers.count()
massUsers.groupBy(“name”).avg(“age”)
^ Expression AST
Other Spark optimizations: DataFrame
API
Other Spark optimizations: DataFrame
API
Other Spark optimizations
• Dataframe operations are executed in Scala even if
you run them in Python/R.
Other Spark optimizations
• Project Tungsten
Other Spark optimizations
• Project Tungsten (simple aggregation)
Other Spark optimizations
• Query optimization (taking advantage of lazy
computation)
32
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= "2015-01-01")
Plan Optimization & Execution
logical plan
filter
join
scan
(users)
scan
(events)
33
logical plan
filter
join
scan
(users)
scan
(events)
this join is expensive →
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= "2015-01-01")
Plan Optimization & Execution
logical
plan
filter
join
scan
(users)
scan
(events)
optimized
plan
join
scan
(users)
filter
scan
(events)
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= "2015-01-01")
Plan Optimization & Execution
Other Spark optimizations
In Spark 1.3:
myRdd.toDF() or myDataframe.rdd()
Convert Rows that contain Scala types to Rows that have Catalyst-
approved types (e.g. Seq for arrays) and back.
35
Other Spark optimizations: toDF, rdd
Approach:
• Construct converter functions
• Avoid using map() and etc. for operations that will be executed
for each row when possible.
36
Other Spark optimizations: toDF, rdd
37
Other Spark optimizations: toDF, rdd
38
Other Spark optimizations: toDF, rdd
39
Other Spark optimizations: toDF, rdd
40
Hands-on Spark: Analyzing Brexit
tweets
• Let’s do some simple tweets analysis with Spark on
databricks.
• Try Databricks community edition at
databricks.com/try-databricks
41
What Spark is used for
• Interactive analysis
• Extract Transform Load
• Machine Learning
• Streaming
Spark Caveats
• collect()-ing large amount of data OOM’s the driver
• Avoid cartesian products in SQL (join!)
• Don’t overuse cache
• If you’re using S3, don’t use s3n:// (use s3a://)
• Don’t use spot instances for the driver node
• Data format matters a lot
Today you’ve learned
• What’s Apache Spark, what it’s used for.
• How to write simple programs in Spark: what’s RDDs
and dataframes.
• Optimizations in Apache Spark.
44
Дзякуй за ўвагу!
Volodymyr Lyubinets, vlad@databricks.com
07/04/2017

More Related Content

What's hot (20)

PDF
Google App Engine (GAE) 演進史
Simon Su
 
PDF
Node.js and couchbase Full Stack JSON - Munich NoSQL
Philipp Fehre
 
KEY
Google App Engine/ Java Application Development
Shuji Watanabe
 
PPTX
Containerizing a Data Warehouse for Kubernetes
VMware Tanzu
 
PPTX
Windows Azure HDInsight Service
Neil Mackenzie
 
PPTX
Google Cloud Platform, Compute Engine, and App Engine
Csaba Toth
 
PDF
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI
 
PDF
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
PDF
Speedment - Reactive programming for Java8
Speedment, Inc.
 
PPTX
Create HDInsight Cluster in Azure Portal (February 2015)
Cindy Gross
 
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
PPTX
Think Distributed: The Hazelcast Way
Rahul Gupta
 
PPTX
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
MongoDB
 
PDF
An Introduction to Using PostgreSQL with Docker & Kubernetes
Jonathan Katz
 
PPT
Using Google Compute Engine
Lynn Langit
 
PDF
Nomad Multi-Cloud
Nic Jackson
 
PDF
My "Perfect" Toolchain Setup for Grails Projects
GR8Conf
 
PDF
High Availability PostgreSQL on OpenShift...and more!
Jonathan Katz
 
PPTX
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
PDF
Introduction to Google Compute Engine
Colin Su
 
Google App Engine (GAE) 演進史
Simon Su
 
Node.js and couchbase Full Stack JSON - Munich NoSQL
Philipp Fehre
 
Google App Engine/ Java Application Development
Shuji Watanabe
 
Containerizing a Data Warehouse for Kubernetes
VMware Tanzu
 
Windows Azure HDInsight Service
Neil Mackenzie
 
Google Cloud Platform, Compute Engine, and App Engine
Csaba Toth
 
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI
 
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
Speedment - Reactive programming for Java8
Speedment, Inc.
 
Create HDInsight Cluster in Azure Portal (February 2015)
Cindy Gross
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
Think Distributed: The Hazelcast Way
Rahul Gupta
 
Webinar: Enabling Microservices with Containers, Orchestration, and MongoDB
MongoDB
 
An Introduction to Using PostgreSQL with Docker & Kubernetes
Jonathan Katz
 
Using Google Compute Engine
Lynn Langit
 
Nomad Multi-Cloud
Nic Jackson
 
My "Perfect" Toolchain Setup for Grails Projects
GR8Conf
 
High Availability PostgreSQL on OpenShift...and more!
Jonathan Katz
 
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
Introduction to Google Compute Engine
Colin Su
 

Similar to Volodymyr Lyubinets "Introduction to big data processing with Apache Spark" (20)

PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
Spark what's new what's coming
Databricks
 
PPTX
Apache Spark
SugumarSarDurai
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PPTX
Big Data training
vishal192091
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Apache spark - Spark's distributed programming model
Martin Zapletal
 
PPTX
Dive into spark2
Gal Marder
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Spark what's new what's coming
Databricks
 
Apache Spark
SugumarSarDurai
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
Big Data training
vishal192091
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to Spark - DataFactZ
DataFactZ
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Spark real world use cases and optimizations
Gal Marder
 
Apache Spark Fundamentals
Zahra Eskandari
 
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Dive into spark2
Gal Marder
 
Ad

More from IT Event (20)

PDF
Denis Radin - "Applying NASA coding guidelines to JavaScript or airspace is c...
IT Event
 
PDF
Sara Harkousse - "Web Components: It's all rainbows and unicorns! Is it?"
IT Event
 
PDF
Max Voloshin - "Organization of frontend development for products with micros...
IT Event
 
PDF
Roman Romanovsky, Sergey Rak - "JavaScript в IoT "
IT Event
 
PDF
Konstantin Krivlenia - "Continuous integration for frontend"
IT Event
 
PPTX
Illya Klymov - "Vue.JS: What did I swap React for in 2017 and why?"
IT Event
 
PDF
Evgeny Gusev - "A circular firing squad: How technologies drag frontend down"
IT Event
 
PDF
Vladimir Grinenko - "Dependencies in component web done right"
IT Event
 
PDF
Dmitry Bartalevich - "How to train your WebVR"
IT Event
 
PDF
Aleksey Bogachuk - "Offline Second"
IT Event
 
PDF
James Allardice - "Building a better login with the credential management API"
IT Event
 
PDF
Fedor Skuratov "Dark Social: as messengers change the market of social media ...
IT Event
 
PPTX
Андрей Зайчиков "Архитектура распределенных кластеров NoSQL на AWS"
IT Event
 
PPTX
Алексей Рагозин "Java и linux борьба за микросекунды"
IT Event
 
PDF
Наш ответ Uber’у
IT Event
 
PDF
Александр Крашенинников "Hadoop High Availability: опыт Badoo"
IT Event
 
PDF
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
IT Event
 
PDF
Анатолий Пласковский "Миллионы карточных платежей за месяц, или как потерять ...
IT Event
 
PDF
Mete Atamel "Resilient microservices with kubernetes"
IT Event
 
PDF
Andrew Stain "User acquisition"
IT Event
 
Denis Radin - "Applying NASA coding guidelines to JavaScript or airspace is c...
IT Event
 
Sara Harkousse - "Web Components: It's all rainbows and unicorns! Is it?"
IT Event
 
Max Voloshin - "Organization of frontend development for products with micros...
IT Event
 
Roman Romanovsky, Sergey Rak - "JavaScript в IoT "
IT Event
 
Konstantin Krivlenia - "Continuous integration for frontend"
IT Event
 
Illya Klymov - "Vue.JS: What did I swap React for in 2017 and why?"
IT Event
 
Evgeny Gusev - "A circular firing squad: How technologies drag frontend down"
IT Event
 
Vladimir Grinenko - "Dependencies in component web done right"
IT Event
 
Dmitry Bartalevich - "How to train your WebVR"
IT Event
 
Aleksey Bogachuk - "Offline Second"
IT Event
 
James Allardice - "Building a better login with the credential management API"
IT Event
 
Fedor Skuratov "Dark Social: as messengers change the market of social media ...
IT Event
 
Андрей Зайчиков "Архитектура распределенных кластеров NoSQL на AWS"
IT Event
 
Алексей Рагозин "Java и linux борьба за микросекунды"
IT Event
 
Наш ответ Uber’у
IT Event
 
Александр Крашенинников "Hadoop High Availability: опыт Badoo"
IT Event
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
IT Event
 
Анатолий Пласковский "Миллионы карточных платежей за месяц, или как потерять ...
IT Event
 
Mete Atamel "Resilient microservices with kubernetes"
IT Event
 
Andrew Stain "User acquisition"
IT Event
 
Ad

Recently uploaded (20)

PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Digital Circuits, important subject in CS
contactparinay1
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

  • 2. Myself Databricks 2 About the speaker • Company founded by creators of Apache Spark • Remains the largest contributor to Spark and builds the platform to make working with Spark easy. • Raised over 100M USD in funding • Software Engineer at Databricks • Previously interned at a Facebook, LinkedIn, etc. • Competitive programmer, red on TopCoder, 13th at ACM ICPC finals
  • 3. Big Data - why you should care • Data grows faster than computing power 3
  • 4. Some big data use cases • Log mining and processing. • Recommendation systems. • Palantir’s solution for small businesses. 4
  • 5. How it all started • In 2004 Google published the MapReduce paper. • In 2006 Hadoop was started, soon adopted by Yahoo. 5
  • 8. Map 8 public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 9. Reduce 9 public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 10. Recommender systems at LinkedIn • Pipeline of nearly 80 individual jobs. • Various data formats: json, binary json, avro, etc. • Entire pipeline took around 7 hours. • LinkedIn used in-house solutions (e.g. Azkaban for scheduling, own HDFS).
  • 11. Problems • Interactively checking data was inconvenient. • Slow - not even close to realtime. • Problems working with some formats - as a result an extra job was required to convert them. • Some jobs were a “one-liner” and could have been avoided.
  • 12. How it all started • In 2012 Spark was created as a research project at Berkeley to address shortcomings of Hadoop MapReduce. 12
  • 13. What’s Apache Spark? Spark is a framework for doing distributed computations on a cluster. 13
  • 14. Large Scale Usage: Largest cluster: 8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record: 23 mins / 207 EC2 nodes
  • 15. Writing Spark programs - RDD • Resilient Distributed Dataset • Basically a collection of data that is spread across many computers. • Can be thought of as list that doesn’t allow random access. • RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join) and actions (count, collect, save) • RDDs automatically rebuilt on machine failure 15
  • 16. map() intersection() cartesian() flatMap() distinct() pipe() filter() groupByKey() coalesce() mapPartitions() reduceByKey() repartition() mapPartitionsWithIndex() sortByKey() partitionBy() sample() join() ... union() cogroup() ... Transformations (lazy)
  • 17. reduce() takeOrdered() collect() saveAsTextFile() count() saveAsSequenceFile() first() saveAsObjectFile() take() countByKey() takeSample() foreach() saveToCassandra() ... Actions
  • 18. Writing Spark programs - RDD scala> val rdd = sc.parallelize(List(1, 2, 3)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:27 scala> rdd.count() res1: Long = 3 18
  • 19. Writing Spark programs - RDD scala> rdd.collect() res8: Array[Int] = Array(1, 2, 3) scala> rdd.map(x => 2 * x).collect() res2: Array[Int] = Array(2, 4, 6) scala> rdd.filter(x => x % 2 == 0).collect() res3: Array[Int] = Array(2) 19
  • 20. 1) Create some input RDDs from external data or parallelize a collection in your driver program. 1) Lazily transform them to define new RDDs using transformations like filter() or map() 1) Ask Spark to cache() any intermediate RDDs that will need to be reused. 1) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark. Lifecycle of a Spark Program
  • 21. Problem #1: Hadoop MR is verbose 21 text_file = sc.textFile("hdfs://...") counts = text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
  • 22. Writing Spark programs: ML 22 # Every record of this DataFrame contains the label and # features represented by a vector. df = sqlContext.createDataFrame(data, ["label", "features"]) # Set parameters for the algorithm. lr = LogisticRegression(maxIter=10) # Fit the model to the data. model = lr.fit(df) # Given a dataset, predict each point's label, and show the results. model.transform(df).show()
  • 23. Problem #2: Hadoop MR is slow •Spark is 10-100x times faster than MR •Hadoop MR uses checkpointing to achieve resiliency, Spark uses lineage. 23
  • 25. Other Spark optimizations: DataFrame API val users = spark.sql(“select * from users”) val massUsers = users(users(“country”) === “NL”) massUsers.count() massUsers.groupBy(“name”).avg(“age”) ^ Expression AST
  • 26. Other Spark optimizations: DataFrame API
  • 27. Other Spark optimizations: DataFrame API
  • 28. Other Spark optimizations • Dataframe operations are executed in Scala even if you run them in Python/R.
  • 29. Other Spark optimizations • Project Tungsten
  • 30. Other Spark optimizations • Project Tungsten (simple aggregation)
  • 31. Other Spark optimizations • Query optimization (taking advantage of lazy computation)
  • 32. 32 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= "2015-01-01") Plan Optimization & Execution logical plan filter join scan (users) scan (events)
  • 33. 33 logical plan filter join scan (users) scan (events) this join is expensive → joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= "2015-01-01") Plan Optimization & Execution
  • 34. logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= "2015-01-01") Plan Optimization & Execution
  • 35. Other Spark optimizations In Spark 1.3: myRdd.toDF() or myDataframe.rdd() Convert Rows that contain Scala types to Rows that have Catalyst- approved types (e.g. Seq for arrays) and back. 35
  • 36. Other Spark optimizations: toDF, rdd Approach: • Construct converter functions • Avoid using map() and etc. for operations that will be executed for each row when possible. 36
  • 41. Hands-on Spark: Analyzing Brexit tweets • Let’s do some simple tweets analysis with Spark on databricks. • Try Databricks community edition at databricks.com/try-databricks 41
  • 42. What Spark is used for • Interactive analysis • Extract Transform Load • Machine Learning • Streaming
  • 43. Spark Caveats • collect()-ing large amount of data OOM’s the driver • Avoid cartesian products in SQL (join!) • Don’t overuse cache • If you’re using S3, don’t use s3n:// (use s3a://) • Don’t use spot instances for the driver node • Data format matters a lot
  • 44. Today you’ve learned • What’s Apache Spark, what it’s used for. • How to write simple programs in Spark: what’s RDDs and dataframes. • Optimizations in Apache Spark. 44