SlideShare a Scribd company logo
Spark Performance
Past, Future, and Present
Kay Ousterhout
Joint work with Christopher Canel, Ryan Rasti,
Sylvia Ratnasamy, Scott Shenker, Byung-Gon
Chun
About Me
Apache Spark PMC Member
Recent PhD graduate from UC Berkeley
Thesis work on performance of large-scale data analytics
Co-founder at Kelda (kelda.io)
How can I make
this faster?
How can I make
this faster?
Should I use a
different cloud
instance type?
Should I trade
more CPU for less
I/O by using
better
compression?
How can I make
this faster?
???
How can I make
this faster?
???
How can I make
this faster?
???
Major performance improvements
possible via tuning, configuration
…if only you knew which knobs to turn
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
spark.textFile(“hdfs://…”) 
.flatMap(lambda l: l.split(“ “)) 
.map(lambda w: (w, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(“hdfs://…”)
Example Spark Job
Split input file into words
and emit count of 1 for each
Word Count:
Example Spark Job
Split input file into words
and emit count of 1 for each
Word Count:
For each word, combine the
counts, and save the output
spark.textFile(“hdfs://…”) 
.flatMap(lambda l: l.split(“ “)) 
.map(lambda w: (w, 1)) 
.reduceByKey(lambda a, b: a + b) 
.saveAsTextFile(“hdfs://…”)
spark.textFile(“hdfs://…”)
.flatMap(lambda l: l.split(“ “))
.map(lambda w: (w, 1))
Map Stage: Split input file into words
and emit count of 1 for each
Reduce Stage: For each word, combine
the counts, and save the output
Spark Word Count Job:
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…	
Worker 1
Worker n
Tasks
…	
Worker 1
Worker n
Spark Word Count Job:
Reduce Stage: For each word, combine
the counts, and save the output
.reduceByKey(lambda a, b: a + b)
.saveAsTextFile(“hdfs://…”)
…	
Worker 1
Worker n
compute
network
time
(1) Request a few
shuffle blocks
disk
(5) Continue fetching
remote data
: time to handle one shuffle block
(2) Process local
data
What happens in a reduce task?
(4) Process data fetched remotely
(3) Write output to disk
compute
network
time
disk
: time to handle one shuffle block
What happens in a reduce task?
Bottlenecked on
network and disk
Bottlenecked on network
Bottlenecked on
CPU
compute
network
time
disk
: time to handle one shuffle block
What happens in a reduce task?
Bottlenecked on
network and disk
Bottlenecked on network
Bottlenecked on
CPU
compute
network
time
disk
What instrumentation exists today?
Instrumentation centered on single, main task thread
: shuffle read blocked time
: executor
computing time (!)
actual
What instrumentation exists today?
timeline version
What instrumentation exists today?
Instrumentation centered on single, main task thread
Shuffle read and shuffle write blocked time
Input read and output write blocked time not instrumented
Possible to add!
compute
disk
Instrumenting read and write time
Process shuffle block This is a lie
compute
Reality:
Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
compute
Spark processes and then writes one record at a time
Most writes get buffered Occasionally the buffer is flushed
Challenges with reality:
Record-level instrumentation is too high overhead
Spark doesn’t know when buffers get flushed
(HDFS does!)
Tasks use fine-grained pipelining to parallelize
resources
Instrumented times are blocked times only
(task is doing other things in background)
Opportunities to improve instrumentation
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
4 concurrent tasks
on a worker
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
Concurrent tasks may
contend for
the same resource
(e.g., network)
What’s the bottleneck?
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
Time t: different
tasks may be
bottlenecked on
different resources
Single task may be
bottlenecked on
different resources
at different times
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
How much faster
would my job be with
2x disk throughput?
How would runtimes for these
disk writes change?
How would that change timing of
(and contention for) other resources?
Today: tasks use pipelining to parallelize
multiple resources
Proposal: build systems using monotasks
that each consume just one resource
Monotasks: Each task uses one resource
Network
monotask Disk monotask
Compute
monotask
Today’s task:
Monotasks don’t start until all dependencies complete
Task 1
Network read
CPU
Disk write
Dedicated schedulers control contention
Network
scheduler
CPU scheduler:
1 monotask / core
Disk drive scheduler:
1 monotask / disk
Monotasks for one of today’s tasks:
Spark today:
Tasks have non-
uniform resource
use
4 multi-resource
tasks run
concurrently
Single-resource
monotasks
scheduled by
per-resource
schedulers
Monotasks:
API-compatible, performance
parity with Spark
Performance telemetry trivial!
How much faster would job run if...
4x more machines
Input stored in-memory
No disk read
No CPU time to deserialize
Flash drives instead of disks
Faster shuffle read/write time 10x improvement predicted
with at most 23% error
�
���
���
���
���
����
����
����
����
��� ����� �������� �� ����� �� �����
����������
��������
�������� ������� �� ��������� ���� ���� ������
��������� ��� ������� ��� ��������� ���� ���� ������
������ ��� ������� ��� ��������� ���� ���� ������
Monotasks: Break jobs into single-resource tasks
Using single-resource monotasks provides clarity
without sacrificing performance
Massive change to Spark internals (>20K lines of code)
Past: Performance instrumentation in Spark
Future: New architecture that provides performance clarity
Present: Improving Spark’s performance instrumentation
This talk
Spark today:
Task resource use
changes at fine
time granularity
4 multi-resource
tasks run
concurrently
Monotasks:
Single-resource tasks lead
to complete, trivial
performance metrics
Can we get monotask-like per-resource metrics for each
task today?
compute
network
time
disk
Can we get per-task resource use?
Measure machine
resource utilization?
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
4 concurrent tasks
on a worker
Task 1
Task 2
Task 5
Task 3
Task 4
Task 7
Task 6
Task 8
time
Concurrent tasks may
contend for
the same resource
(e.g., network)
Contention controlled by
lower layers (e.g.,
operating system)
compute
network
time
disk
Can we get per-task resource use?
Machine utilization
includes other tasks
Can’t directly measure
per-task I/O:
in background, mixed
with other tasks
compute
network
time
disk
Can we get per-task resource use?
Existing metrics: total
data read
How long did it take?
Use machine utilization
metrics to get
bandwidth!
Existing per-task I/O counters (e.g., shuffle bytes read)
+
Machine-level utilization (and bandwidth) metrics
=
Complete metrics about time spent using each resource
Goal: provide performance clarity
Only way to improve performance is to know what to speed up
Why do we care about performance clarity?
Typical performance eval:
group of experts Practical performance: 1 novice
Goal: provide performance clarity
Only way to improve performance is to know what to speed up
Some instrumentation exists already
Focuses on blocked times in the main task thread
Many opportunities to improve instrumentation
(1) Add read/write instrumentation to lower level (e.g., HDFS)
(2) Add machine-level utilization info
(3) Calculate per-resource time
More details at kayousterhout.org
Ad

More Related Content

What's hot (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Jim Dowling
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
MongoDB
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark Parallelize R Code Using Apache Spark
Parallelize R Code Using Apache Spark
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Jim Dowling
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
MongoDB
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaSpeeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark EcosystemSpark Summit 2016: Connecting Python to the Spark Ecosystem
Spark Summit 2016: Connecting Python to the Spark Ecosystem
Daniel Rodriguez
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015Spark Community Update - Spark Summit San Francisco 2015
Spark Community Update - Spark Summit San Francisco 2015
Databricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 

Similar to Apache Spark Performance: Past, Future and Present (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
Doug Burns
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Making Apache Kafka Even Faster And More Scalable
Making Apache Kafka Even Faster And More ScalableMaking Apache Kafka Even Faster And More Scalable
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
2040.io
 
Software and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : NotesSoftware and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : Notes
Subhajit Sahu
 
Gears of Perforce: AAA Game Development Challenges
Gears of Perforce: AAA Game Development ChallengesGears of Perforce: AAA Game Development Challenges
Gears of Perforce: AAA Game Development Challenges
Perforce
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Spark Summit
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
Doug Burns
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
Haoyuan Li
 
Making Apache Kafka Even Faster And More Scalable
Making Apache Kafka Even Faster And More ScalableMaking Apache Kafka Even Faster And More Scalable
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
Don Marti
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
confluent
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
2040.io
 
Software and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : NotesSoftware and the Concurrency Revolution : Notes
Software and the Concurrency Revolution : Notes
Subhajit Sahu
 
Gears of Perforce: AAA Game Development Challenges
Gears of Perforce: AAA Game Development ChallengesGears of Perforce: AAA Game Development Challenges
Gears of Perforce: AAA Game Development Challenges
Perforce
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& ConsiderationsDesigning AI-Powered APIs on Azure: Best Practices& Considerations
Designing AI-Powered APIs on Azure: Best Practices& Considerations
Dinusha Kumarasiri
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
Expand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchangeExpand your AI adoption with AgentExchange
Expand your AI adoption with AgentExchange
Fexle Services Pvt. Ltd.
 
Adobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest VersionAdobe Illustrator Crack FREE Download 2025 Latest Version
Adobe Illustrator Crack FREE Download 2025 Latest Version
kashifyounis067
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]Get & Download Wondershare Filmora Crack Latest [2025]
Get & Download Wondershare Filmora Crack Latest [2025]
saniaaftab72555
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
Interactive odoo dashboards for sales, CRM , Inventory, Invoice, Purchase, Pr...
AxisTechnolabs
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Maxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINKMaxon CINEMA 4D 2025 Crack FREE Download LINK
Maxon CINEMA 4D 2025 Crack FREE Download LINK
younisnoman75
 
Douwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License codeDouwan Crack 2025 new verson+ License code
Douwan Crack 2025 new verson+ License code
aneelaramzan63
 
FL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full VersionFL Studio Producer Edition Crack 2025 Full Version
FL Studio Producer Edition Crack 2025 Full Version
tahirabibi60507
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 

Apache Spark Performance: Past, Future and Present

  • 1. Spark Performance Past, Future, and Present Kay Ousterhout Joint work with Christopher Canel, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun
  • 2. About Me Apache Spark PMC Member Recent PhD graduate from UC Berkeley Thesis work on performance of large-scale data analytics Co-founder at Kelda (kelda.io)
  • 3. How can I make this faster?
  • 4. How can I make this faster?
  • 5. Should I use a different cloud instance type?
  • 6. Should I trade more CPU for less I/O by using better compression?
  • 7. How can I make this faster? ???
  • 8. How can I make this faster? ???
  • 9. How can I make this faster? ???
  • 10. Major performance improvements possible via tuning, configuration …if only you knew which knobs to turn
  • 11. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 12. spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) Example Spark Job Split input file into words and emit count of 1 for each Word Count:
  • 13. Example Spark Job Split input file into words and emit count of 1 for each Word Count: For each word, combine the counts, and save the output spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”)
  • 14. spark.textFile(“hdfs://…”) .flatMap(lambda l: l.split(“ “)) .map(lambda w: (w, 1)) Map Stage: Split input file into words and emit count of 1 for each Reduce Stage: For each word, combine the counts, and save the output Spark Word Count Job: .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n Tasks … Worker 1 Worker n
  • 15. Spark Word Count Job: Reduce Stage: For each word, combine the counts, and save the output .reduceByKey(lambda a, b: a + b) .saveAsTextFile(“hdfs://…”) … Worker 1 Worker n
  • 16. compute network time (1) Request a few shuffle blocks disk (5) Continue fetching remote data : time to handle one shuffle block (2) Process local data What happens in a reduce task? (4) Process data fetched remotely (3) Write output to disk
  • 17. compute network time disk : time to handle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  • 18. compute network time disk : time to handle one shuffle block What happens in a reduce task? Bottlenecked on network and disk Bottlenecked on network Bottlenecked on CPU
  • 19. compute network time disk What instrumentation exists today? Instrumentation centered on single, main task thread : shuffle read blocked time : executor computing time (!)
  • 20. actual What instrumentation exists today? timeline version
  • 21. What instrumentation exists today? Instrumentation centered on single, main task thread Shuffle read and shuffle write blocked time Input read and output write blocked time not instrumented Possible to add!
  • 22. compute disk Instrumenting read and write time Process shuffle block This is a lie compute Reality: Spark processes and then writes one record at a time Most writes get buffered Occasionally the buffer is flushed
  • 23. compute Spark processes and then writes one record at a time Most writes get buffered Occasionally the buffer is flushed Challenges with reality: Record-level instrumentation is too high overhead Spark doesn’t know when buffers get flushed (HDFS does!)
  • 24. Tasks use fine-grained pipelining to parallelize resources Instrumented times are blocked times only (task is doing other things in background) Opportunities to improve instrumentation
  • 25. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 26. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  • 27. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network)
  • 28. What’s the bottleneck? Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 Time t: different tasks may be bottlenecked on different resources Single task may be bottlenecked on different resources at different times
  • 29. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 How much faster would my job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources?
  • 30. Today: tasks use pipelining to parallelize multiple resources Proposal: build systems using monotasks that each consume just one resource
  • 31. Monotasks: Each task uses one resource Network monotask Disk monotask Compute monotask Today’s task: Monotasks don’t start until all dependencies complete Task 1 Network read CPU Disk write
  • 32. Dedicated schedulers control contention Network scheduler CPU scheduler: 1 monotask / core Disk drive scheduler: 1 monotask / disk Monotasks for one of today’s tasks:
  • 33. Spark today: Tasks have non- uniform resource use 4 multi-resource tasks run concurrently Single-resource monotasks scheduled by per-resource schedulers Monotasks: API-compatible, performance parity with Spark Performance telemetry trivial!
  • 34. How much faster would job run if... 4x more machines Input stored in-memory No disk read No CPU time to deserialize Flash drives instead of disks Faster shuffle read/write time 10x improvement predicted with at most 23% error � ��� ��� ��� ��� ���� ���� ���� ���� ��� ����� �������� �� ����� �� ����� ���������� �������� �������� ������� �� ��������� ���� ���� ������ ��������� ��� ������� ��� ��������� ���� ���� ������ ������ ��� ������� ��� ��������� ���� ���� ������
  • 35. Monotasks: Break jobs into single-resource tasks Using single-resource monotasks provides clarity without sacrificing performance Massive change to Spark internals (>20K lines of code)
  • 36. Past: Performance instrumentation in Spark Future: New architecture that provides performance clarity Present: Improving Spark’s performance instrumentation This talk
  • 37. Spark today: Task resource use changes at fine time granularity 4 multi-resource tasks run concurrently Monotasks: Single-resource tasks lead to complete, trivial performance metrics Can we get monotask-like per-resource metrics for each task today?
  • 38. compute network time disk Can we get per-task resource use? Measure machine resource utilization?
  • 39. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time 4 concurrent tasks on a worker
  • 40. Task 1 Task 2 Task 5 Task 3 Task 4 Task 7 Task 6 Task 8 time Concurrent tasks may contend for the same resource (e.g., network) Contention controlled by lower layers (e.g., operating system)
  • 41. compute network time disk Can we get per-task resource use? Machine utilization includes other tasks Can’t directly measure per-task I/O: in background, mixed with other tasks
  • 42. compute network time disk Can we get per-task resource use? Existing metrics: total data read How long did it take? Use machine utilization metrics to get bandwidth!
  • 43. Existing per-task I/O counters (e.g., shuffle bytes read) + Machine-level utilization (and bandwidth) metrics = Complete metrics about time spent using each resource
  • 44. Goal: provide performance clarity Only way to improve performance is to know what to speed up Why do we care about performance clarity? Typical performance eval: group of experts Practical performance: 1 novice
  • 45. Goal: provide performance clarity Only way to improve performance is to know what to speed up Some instrumentation exists already Focuses on blocked times in the main task thread Many opportunities to improve instrumentation (1) Add read/write instrumentation to lower level (e.g., HDFS) (2) Add machine-level utilization info (3) Calculate per-resource time More details at kayousterhout.org