SlideShare a Scribd company logo
Performance Optimization Case
Study: Shattering Hadoop's Sort
Record with Spark and Scala
Reynold Xin (@rxin)
March 17, 2015
Who am I?
Reynold Xin (@rxin)
Co-founder, Databricks
Apache Spark committer & maintainer on core
API, shuffle, network, block manager, SQL, GraphX
On leave from PhD @ UC Berkeley AMPLab
Spark is In-Memory and Fast
0.96
110
0 25 50 75 100 125
Logistic Regression
4.1
155
0 30 60 90 120 150 180
K-Means Clustering
Hadoop MR
Spark
Time per Iteration (s)
“Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
Common Misconception About Spark
“Spark is in-memory. It doesn’t work with BIG
DATA.”
“It is too expensive to buy a cluster with enough
memory to fit our data.”
Spark Project Goals
Works well with GBs, TBs, or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
Spark Project Goals
Works well with KBs, MBs, … or PBs or data
Works well with different storage media (RAM,
HDDs, SSDs)
Works well from low-latency streaming jobs to
long batch jobs
Sort Benchmark
Originally sponsored by Jim Gray to measure
advancements in software and hardware in 1987
Participants often used purpose-built hardware/
software to compete
–  Large companies: IBM, Microsoft, Yahoo, …
–  Academia: UC Berkeley, UCSD, MIT, …
Sort Benchmark
1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
Sorting 100 TB and 1 PB
Participated in the most challenging category
–  Daytona GraySort
–  Sort 100TB in a fault-tolerant manner
–  1 trillion records generated by a 3rd-party harness (10B
key + 90B payload)
Set a new world record (tied with UCSD)
–  Saturated throughput of 8 SSDs and 10Gbps network
–  1st time public cloud + open source won
Sorted 1PB on our own to push scalability
11
On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
Why Sorting?
Sorting stresses “shuffle”, which underpins
everything from SQL to machine learning (group
by, join, sort, ALS, etc)
Sorting is challenging because there is no
reduction in data.
Sort 100TB = 500TB disk I/O + 200TB network
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Sort-based Shuffle
Old hash-based shuffle: require R
(#reduce tasks) concurrent streams
with buffers, i.e. limits R.
New sort-based shuffle: sort
records by partition first, and then
write them out. One active stream
at a time.
Went up to 250,000 tasks in PB sort
Network Transport
High performance event-driven architecture using
Netty (SEDA-like)
Explicitly manages pools of memory outside JVM
garbage collection
Zero-copy (avoid user space buffer through
FileChannel.transferTo)
Network Transport
Sustaining ~1.1GB/node/s
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Warning
The guidelines here apply only to hyper
performance optimizations
–  Majority of the code should not be perf sensitive
Measure before you optimize
–  It is ridiculously hard to write good benchmarks
–  Use jmh if you want to do that
http://github.com/databricks/style-style-guide
Prefer while loops
array.zipWithIndex.foreach { case (x, i) =>
…
}
vs
var i = 0
while (i < array.length) {
...
i += 1
}
Avoid Collection Library
Avoid Scala collection library
–  Generally slower than java.util.*
Avoid Java collection library
–  Not specialized (too many objects)
–  Bad cache locality
Prefer private[this]
A.scala:
class A {
private val field = 0
def accessor = field
}
scalac –print A.scala
Class A extends Object {
private[this] val field: Int = _;
<stable> <accessor> private def field(): Int = A.this.field;
def accessor(): Int = A.this.field();
def <init>(): A = {
A.super.<init>();
A.this.field = 0;
()
}
};
JIT might not be able to inline this
Prefer private[this]
B.scala:
class B {
private[this] val field = 0
def accessor = field
}
scalac –print B.scala
Class B extends Object {
private[this] val field: Int = _;
def accessor(): Int = B.this.field;
def <init>(): B = {
B.super.<init>();
B.this.field = 0;
()
}
};
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Garbage Collection
Hard to tune GC when memory utilization > 60%
Do away entirely with GC: off-heap allocation
GC Take 1: DirectByteBuffer
java.nio.ByteBuffer.allocateDirect(long size);
Limited size: 2GB only
Hard to recycle
–  only claimed by GC
–  possible to control with reflection, but there is a
static synchronize!
GC Take 2: sun.misc.Unsafe
Available on most JVM implementations
(frequently exploited by trading platforms)
Provide C-like explicit memory allocation
class Unsafe {
public native long allocateMemory(long bytes);
public native void freeMemory(long address);
…
}
GC Take 2: sun.misc.Unsafe
class Unsafe {
…
public native copyMemory(
long srcAddress, long destAddress,
long bytes);
public native long getLong(long address);
public native int getInt(long address);
...
}
Caveats when using Unsafe
Maintain a thread local pool of memory blocks to
avoid allocation overhead and fragmentation
(Netty uses jemalloc)
-ea during debugging, -da in benchmark mode
Cache Friendly Sorting
Naïve Scala/Java collection sort is extremely slow
–  object dereferences (random memory access)
–  polymorphic comparators (cannot be inlined)
Spark’s TimSort allows operating on custom data
memory layout
–  Be careful with inlining again
Data Layout Take 1
100B …
Consecutive Memory Block (size N * 100B)
Comparison: uses the 10B key to compare.
Swap: requires swapping 100B.
Too much memcpy!
Record format: 10B key + 90B payload
Data Layout Take 2
Swap: cheap (8B or 4B)
Comparison: requires
dereference 2 pointers
(random access)
Bad cache locality!
…
100bytearray
100bytearray
100bytearray
100bytearray
100bytearray
Array[Array[Byte]]
Data Layout Take 3
14 bytes …
Consecutive Memory Block (size N * 14B)
10B key prefix + 4B position
Sort comparison: using the 10B prefix inline
Swap: only swap 14B
Good cache locality & low memcpy overhead!
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
What made this possible?
Power of the Cloud
Engineering Investment in Spark
–  Sort-based shuffle (SPARK-2045)
–  Netty native network transport (SPARK-2468)
–  External shuffle service (SPARK-3796)
Clever Application Level Techniques
–  Avoid Scala/JVM gotchas
–  Cache-friendly & GC-friendly sorting
–  Pipelining
Pipelining
Add a semaphore to throttle concurrent “reads”
What did we do exactly?
Improve single-thread sorting speed
–  Better sort algorithm
–  Better data layout for cache locality
Reduce garbage collection overhead
–  Explicitly manage memory using Unsafe
Reduce resource contention via pipelining
Increase network transfer throughput
–  Netty (zero-copy, jemalloc)
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Extra Readings
Databricks Scala Guide (Performance Section)
https://github.com/databricks/scala-style-guide
Blog post about this benchmark entry
http://tinyurl.com/spark-sort
Thanks!
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Ad

More Related Content

What's hot (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Josef A. Habdank
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 

Viewers also liked (20)

5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)
Abhishek Datta
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
Girish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
Apache hive
Apache hiveApache hive
Apache hive
pradipbajpai68
 
Python to scala
Python to scalaPython to scala
Python to scala
kao kuo-tung
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
Mario Gleichmann
 
Indexed Hive
Indexed HiveIndexed Hive
Indexed Hive
NikhilDeshpande
 
Fun[ctional] spark with scala
Fun[ctional] spark with scalaFun[ctional] spark with scala
Fun[ctional] spark with scala
David Vallejo Navarro
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
MICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
Linda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
DataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
Martin Odersky
 
5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)
Abhishek Datta
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
Girish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
Mario Gleichmann
 
Scala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and ImplementationsScala: Pattern matching, Concepts and Implementations
Scala: Pattern matching, Concepts and Implementations
MICHRAFY MUSTAFA
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
Alexander Dean
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
ragho
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
Spark Meetup TensorFrames
Spark Meetup TensorFramesSpark Meetup TensorFrames
Spark Meetup TensorFrames
Jen Aman
 
Apache Spark and Online Analytics
Apache Spark and Online Analytics Apache Spark and Online Analytics
Apache Spark and Online Analytics
Databricks
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
Linda Tillman
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
DataWorks Summit
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
Martin Odersky
 
Ad

Similar to Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala (20)

Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Jen Aman
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Very Large Data Files, Object Stores, and Deep Learning—Lessons Learned While...
Databricks
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
How can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptxHow can one start with crypto wallet development.pptx
How can one start with crypto wallet development.pptx
laravinson24
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRYLEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
LEARN SEO AND INCREASE YOUR KNOWLDGE IN SOFTWARE INDUSTRY
NidaFarooq10
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025Avast Premium Security Crack FREE Latest Version 2025
Avast Premium Security Crack FREE Latest Version 2025
mu394968
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
The Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdfThe Significance of Hardware in Information Systems.pdf
The Significance of Hardware in Information Systems.pdf
drewplanas10
 
Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)Who Watches the Watchmen (SciFiDevCon 2025)
Who Watches the Watchmen (SciFiDevCon 2025)
Allon Mureinik
 
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Requirements in Engineering AI- Enabled Systems: Open Problems and Safe AI Sy...
Lionel Briand
 
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)
Andre Hora
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...
Eric D. Schabell
 
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software DevelopmentSecure Test Infrastructure: The Backbone of Trustworthy Software Development
Secure Test Infrastructure: The Backbone of Trustworthy Software Development
Shubham Joshi
 
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
How Valletta helped healthcare SaaS to transform QA and compliance to grow wi...
Egor Kaleynik
 

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala

  • 1. Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala Reynold Xin (@rxin) March 17, 2015
  • 2. Who am I? Reynold Xin (@rxin) Co-founder, Databricks Apache Spark committer & maintainer on core API, shuffle, network, block manager, SQL, GraphX On leave from PhD @ UC Berkeley AMPLab
  • 3. Spark is In-Memory and Fast 0.96 110 0 25 50 75 100 125 Logistic Regression 4.1 155 0 30 60 90 120 150 180 K-Means Clustering Hadoop MR Spark Time per Iteration (s)
  • 4. “Spark is in-memory. It doesn’t work with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”
  • 5. Common Misconception About Spark “Spark is in-memory. It doesn’t work with BIG DATA.” “It is too expensive to buy a cluster with enough memory to fit our data.”
  • 6. Spark Project Goals Works well with GBs, TBs, or PBs or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs
  • 7. Spark Project Goals Works well with KBs, MBs, … or PBs or data Works well with different storage media (RAM, HDDs, SSDs) Works well from low-latency streaming jobs to long batch jobs
  • 8. Sort Benchmark Originally sponsored by Jim Gray to measure advancements in software and hardware in 1987 Participants often used purpose-built hardware/ software to compete –  Large companies: IBM, Microsoft, Yahoo, … –  Academia: UC Berkeley, UCSD, MIT, …
  • 9. Sort Benchmark 1MB -> 100MB -> 1TB (1998) -> 100TB (2009)
  • 10. Sorting 100 TB and 1 PB Participated in the most challenging category –  Daytona GraySort –  Sort 100TB in a fault-tolerant manner –  1 trillion records generated by a 3rd-party harness (10B key + 90B payload) Set a new world record (tied with UCSD) –  Saturated throughput of 8 SSDs and 10Gbps network –  1st time public cloud + open source won Sorted 1PB on our own to push scalability
  • 11. 11 On-Disk Sort Record: Time to sort 100TB 2100 machines2013 Record: Hadoop 2014 Record: Spark Source: Daytona GraySort benchmark, sortbenchmark.org 72 minutes 207 machines 23 minutes Also sorted 1PB in 4 hours
  • 12. Why Sorting? Sorting stresses “shuffle”, which underpins everything from SQL to machine learning (group by, join, sort, ALS, etc) Sorting is challenging because there is no reduction in data. Sort 100TB = 500TB disk I/O + 200TB network
  • 13. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 14. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 15. Sort-based Shuffle Old hash-based shuffle: require R (#reduce tasks) concurrent streams with buffers, i.e. limits R. New sort-based shuffle: sort records by partition first, and then write them out. One active stream at a time. Went up to 250,000 tasks in PB sort
  • 16. Network Transport High performance event-driven architecture using Netty (SEDA-like) Explicitly manages pools of memory outside JVM garbage collection Zero-copy (avoid user space buffer through FileChannel.transferTo)
  • 18. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 19. Warning The guidelines here apply only to hyper performance optimizations –  Majority of the code should not be perf sensitive Measure before you optimize –  It is ridiculously hard to write good benchmarks –  Use jmh if you want to do that http://github.com/databricks/style-style-guide
  • 20. Prefer while loops array.zipWithIndex.foreach { case (x, i) => … } vs var i = 0 while (i < array.length) { ... i += 1 }
  • 21. Avoid Collection Library Avoid Scala collection library –  Generally slower than java.util.* Avoid Java collection library –  Not specialized (too many objects) –  Bad cache locality
  • 22. Prefer private[this] A.scala: class A { private val field = 0 def accessor = field } scalac –print A.scala Class A extends Object { private[this] val field: Int = _; <stable> <accessor> private def field(): Int = A.this.field; def accessor(): Int = A.this.field(); def <init>(): A = { A.super.<init>(); A.this.field = 0; () } }; JIT might not be able to inline this
  • 23. Prefer private[this] B.scala: class B { private[this] val field = 0 def accessor = field } scalac –print B.scala Class B extends Object { private[this] val field: Int = _; def accessor(): Int = B.this.field; def <init>(): B = { B.super.<init>(); B.this.field = 0; () } };
  • 24. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 25. Garbage Collection Hard to tune GC when memory utilization > 60% Do away entirely with GC: off-heap allocation
  • 26. GC Take 1: DirectByteBuffer java.nio.ByteBuffer.allocateDirect(long size); Limited size: 2GB only Hard to recycle –  only claimed by GC –  possible to control with reflection, but there is a static synchronize!
  • 27. GC Take 2: sun.misc.Unsafe Available on most JVM implementations (frequently exploited by trading platforms) Provide C-like explicit memory allocation class Unsafe { public native long allocateMemory(long bytes); public native void freeMemory(long address); … }
  • 28. GC Take 2: sun.misc.Unsafe class Unsafe { … public native copyMemory( long srcAddress, long destAddress, long bytes); public native long getLong(long address); public native int getInt(long address); ... }
  • 29. Caveats when using Unsafe Maintain a thread local pool of memory blocks to avoid allocation overhead and fragmentation (Netty uses jemalloc) -ea during debugging, -da in benchmark mode
  • 30. Cache Friendly Sorting Naïve Scala/Java collection sort is extremely slow –  object dereferences (random memory access) –  polymorphic comparators (cannot be inlined) Spark’s TimSort allows operating on custom data memory layout –  Be careful with inlining again
  • 31. Data Layout Take 1 100B … Consecutive Memory Block (size N * 100B) Comparison: uses the 10B key to compare. Swap: requires swapping 100B. Too much memcpy! Record format: 10B key + 90B payload
  • 32. Data Layout Take 2 Swap: cheap (8B or 4B) Comparison: requires dereference 2 pointers (random access) Bad cache locality! … 100bytearray 100bytearray 100bytearray 100bytearray 100bytearray Array[Array[Byte]]
  • 33. Data Layout Take 3 14 bytes … Consecutive Memory Block (size N * 14B) 10B key prefix + 4B position Sort comparison: using the 10B prefix inline Swap: only swap 14B Good cache locality & low memcpy overhead!
  • 35. What made this possible? Power of the Cloud Engineering Investment in Spark –  Sort-based shuffle (SPARK-2045) –  Netty native network transport (SPARK-2468) –  External shuffle service (SPARK-3796) Clever Application Level Techniques –  Avoid Scala/JVM gotchas –  Cache-friendly & GC-friendly sorting –  Pipelining
  • 37. Add a semaphore to throttle concurrent “reads”
  • 38. What did we do exactly? Improve single-thread sorting speed –  Better sort algorithm –  Better data layout for cache locality Reduce garbage collection overhead –  Explicitly manage memory using Unsafe Reduce resource contention via pipelining Increase network transfer throughput –  Netty (zero-copy, jemalloc)
  • 40. Extra Readings Databricks Scala Guide (Performance Section) https://github.com/databricks/scala-style-guide Blog post about this benchmark entry http://tinyurl.com/spark-sort