SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Getting Spark Customers to Production
Kostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Our customers
• Various degrees of sophistication with Spark
• In all stages of development
• From POC to production deployments
• 95% use Spark on YARN*
• Biweekly analysis of tickets
4© Cloudera, Inc. All rights reserved.
WARING: This is biased!
5© Cloudera, Inc. All rights reserved.
Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
6© Cloudera, Inc. All rights reserved.
“Why is my job failing?”
7© Cloudera, Inc. All rights reserved.
“Why is my job slow?”
8© Cloudera, Inc. All rights reserved.
Misconfiguration
accounts for 20% of
job failures
Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg
9© Cloudera, Inc. All rights reserved.
Resource Declaration
• Not easy knowing what you need and how to specify it
• Compute:
• --num-executors vs. --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
10© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Let Spark do the work for you
• Available since Spark 1.2*
• No need to specify compute a priori
• Limitation: Still required to specify cores
• In future:
• Allow specification of “task size”
• Dynamically allocate cores
11© Cloudera, Inc. All rights reserved.
YARN Configuration mismatch
• Compute:
• yarn.nodemanager.resource.cpu-vcores
• yarn.scheduler.maximum-allocation.vcores
• Memory:
• yarn.nodemanager.resource.memory-mb
• yarn.scheduler.maximum-allocation-mb
12© Cloudera, Inc. All rights reserved.
YARN Configuration mismatch
• Common to ask for more resources than allowed
• Future work:
• Exposing relevant YARN configurations in Spark UI
• Requires changes to YARN itself
13© Cloudera, Inc. All rights reserved.
Container
[pid=63375,containerID=container_1388158490598_0001_01_00
0003] is running beyond physical memory limits. Current
usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2
GB virtual memory used. Killing container.
[...]
Another YARN goodie…
14© Cloudera, Inc. All rights reserved.
yarn.nodemanager.resource.memory-mb
Executor Container
spark.yarn.executor.memoryOverhead (7%) (10% in 1.4)
spark.executor.memory
spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6)
Memory allocation
15© Cloudera, Inc. All rights reserved.
YARN Overhead
• Future work:
• Better understanding of off heap allocations
• Improve memory usage visibility
16© Cloudera, Inc. All rights reserved.
Run program
through all our
data
Courtesy of:https://conniehallscott.files.wordpress.com/2013/01/411748_538971446114753_1125606225_o.jpg
17© Cloudera, Inc. All rights reserved.
Data dependent tuning
• As data rates change, re-tuning Spark is usually necessary
• Spark is sensitive to shuffle spills
• The most common knob we modify is…
18© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions!
19© Cloudera, Inc. All rights reserved.
GC Stalls
20© Cloudera, Inc. All rights reserved.
Partitions
• Smaller is often better
• Parameterized partition size
• reduceByKey(…, nPartitions)
• Parameterize application
• Future work:
• Dynamically determine # of partitions (SPARK-4630)
21© Cloudera, Inc. All rights reserved.
But for now?
• Easy answer:
• Keep multiplying by 1.5 and see what works
• Harder answer:
22© Cloudera, Inc. All rights reserved.
Shuffle less!
23© Cloudera, Inc. All rights reserved.
Shuffles
Wide DependencyNarrow Dependencies
24© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
•ReduceByKey allows a map-side-combine
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
•GroupByKey transfers all the data
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
25© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
•ReduceByKey
•GroupByKey
26© Cloudera, Inc. All rights reserved.
Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
27© Cloudera, Inc. All rights reserved.
Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
28© Cloudera, Inc. All rights reserved.
Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682
29© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
30© Cloudera, Inc. All rights reserved.
Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
31© Cloudera, Inc. All rights reserved.
Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager
32© Cloudera, Inc. All rights reserved.
Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
33© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Allows jobs to scale to size according to load
• Knobs to control min, max and initial size
• Future Work:
• Target: Dynamic allocation enabled by default
• Data locality & Caching
• Open question with Streaming
34© Cloudera, Inc. All rights reserved.
Thank you
We’re Hiring!

More Related Content

What's hot (20)

PDF
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Mary Kypreos
 
PPT
Apache Accumulo Overview
Bill Havanki
 
PPTX
Security implementation on hadoop
Wei-Chiu Chuang
 
PPTX
Hadoop on Docker
Rakesh Saha
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PDF
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
PPTX
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 
PPTX
Road to Cloudera certification
Cloudera, Inc.
 
ODP
Farming hadoop in_the_cloud
Steve Loughran
 
PPTX
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
PPTX
HPC and cloud distributed computing, as a journey
Peter Clapham
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
PPTX
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
PPTX
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
PPTX
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 
Spark in YARN-managed Multi-tenant Clusters by Pravin Mittal and Rajesh Iyer
Mary Kypreos
 
Apache Accumulo Overview
Bill Havanki
 
Security implementation on hadoop
Wei-Chiu Chuang
 
Hadoop on Docker
Rakesh Saha
 
Intro to Apache Spark
Cloudera, Inc.
 
Running Hadoop as Service in AltiScale Platform
InMobi Technology
 
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Cloudera, Inc.
 
Road to Cloudera certification
Cloudera, Inc.
 
Farming hadoop in_the_cloud
Steve Loughran
 
Where to Deploy Hadoop: Bare Metal or Cloud?
DataWorks Summit
 
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
 
Five Tips for Running Cloudera on AWS
Cloudera, Inc.
 

Viewers also liked (20)

PPTX
Why Your Apache Spark Job is Failing
Cloudera, Inc.
 
PDF
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
PDF
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
PPTX
How Totango uses Apache Spark
Oren Raboy
 
PPTX
Kodu Game Lab e Project Spark
Fabrício Catae
 
PDF
Fighting Fraud with Apache Spark
Miklos Christine
 
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PDF
Lambda Architectures in Practice
C4Media
 
PPT
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PDF
Real Time BOM Explosions with Apache Solr and Spark
QAware GmbH
 
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
PDF
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Why Your Apache Spark Job is Failing
Cloudera, Inc.
 
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
How Totango uses Apache Spark
Oren Raboy
 
Kodu Game Lab e Project Spark
Fabrício Catae
 
Fighting Fraud with Apache Spark
Miklos Christine
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Lambda Architectures in Practice
C4Media
 
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Real Time BOM Explosions with Apache Solr and Spark
QAware GmbH
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Apache Spark Model Deployment
Databricks
 
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Ad

Similar to Getting Apache Spark Customers to Production (20)

PPTX
Apache Spark Operations
Cloudera, Inc.
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
PPTX
Empower Hive with Spark
DataWorks Summit
 
PDF
Yarns About Yarn
Cloudera, Inc.
 
PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PPTX
Spark etl
Imran Rashid
 
PPTX
Kafka for DBAs
Gwen (Chen) Shapira
 
PDF
The Kubernetes WebLogic revival (part 2)
Simon Haslam
 
PPTX
Decoupling Decisions with Apache Kafka
Grant Henke
 
PPTX
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 
PPTX
20191201 kubernetes managed weblogic revival - part 2
makker_nl
 
PDF
Hadoop security implementationon 20171003
lee tracie
 
PDF
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
PPTX
Elastic build environment
Cachet Software Solutions Ltd
 
PDF
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
PDF
AWS實際架構實踐演化與解決方案
CKmates
 
PPTX
實際架構實踐演化與解決方案
Camel Camel
 
PDF
實際架構實踐演化與解決方案
CKmates
 
Apache Spark Operations
Cloudera, Inc.
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Empower Hive with Spark
DataWorks Summit
 
Yarns About Yarn
Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Spark etl
Imran Rashid
 
Kafka for DBAs
Gwen (Chen) Shapira
 
The Kubernetes WebLogic revival (part 2)
Simon Haslam
 
Decoupling Decisions with Apache Kafka
Grant Henke
 
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
 
20191201 kubernetes managed weblogic revival - part 2
makker_nl
 
Hadoop security implementationon 20171003
lee tracie
 
Hadoop on Cloud: Why and How?
Cloudera, Inc.
 
Elastic build environment
Cachet Software Solutions Ltd
 
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
AWS實際架構實踐演化與解決方案
CKmates
 
實際架構實踐演化與解決方案
Camel Camel
 
實際架構實踐演化與解決方案
CKmates
 
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
PPTX
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
PPTX
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
PPTX
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
PPTX
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
PPTX
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
PPTX
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
PPTX
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

Recently uploaded (20)

PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
Tally software_Introduction_Presentation
AditiBansal54083
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 

Getting Apache Spark Customers to Production

  • 1. 1© Cloudera, Inc. All rights reserved. Getting Spark Customers to Production Kostas Sakellis
  • 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  • 3. 3© Cloudera, Inc. All rights reserved. Our customers • Various degrees of sophistication with Spark • In all stages of development • From POC to production deployments • 95% use Spark on YARN* • Biweekly analysis of tickets
  • 4. 4© Cloudera, Inc. All rights reserved. WARING: This is biased!
  • 5. 5© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  • 6. 6© Cloudera, Inc. All rights reserved. “Why is my job failing?”
  • 7. 7© Cloudera, Inc. All rights reserved. “Why is my job slow?”
  • 8. 8© Cloudera, Inc. All rights reserved. Misconfiguration accounts for 20% of job failures Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg
  • 9. 9© Cloudera, Inc. All rights reserved. Resource Declaration • Not easy knowing what you need and how to specify it • Compute: • --num-executors vs. --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself
  • 10. 10© Cloudera, Inc. All rights reserved. Dynamic Allocation • Let Spark do the work for you • Available since Spark 1.2* • No need to specify compute a priori • Limitation: Still required to specify cores • In future: • Allow specification of “task size” • Dynamically allocate cores
  • 11. 11© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Compute: • yarn.nodemanager.resource.cpu-vcores • yarn.scheduler.maximum-allocation.vcores • Memory: • yarn.nodemanager.resource.memory-mb • yarn.scheduler.maximum-allocation-mb
  • 12. 12© Cloudera, Inc. All rights reserved. YARN Configuration mismatch • Common to ask for more resources than allowed • Future work: • Exposing relevant YARN configurations in Spark UI • Requires changes to YARN itself
  • 13. 13© Cloudera, Inc. All rights reserved. Container [pid=63375,containerID=container_1388158490598_0001_01_00 0003] is running beyond physical memory limits. Current usage: 2.1 GB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...] Another YARN goodie…
  • 14. 14© Cloudera, Inc. All rights reserved. yarn.nodemanager.resource.memory-mb Executor Container spark.yarn.executor.memoryOverhead (7%) (10% in 1.4) spark.executor.memory spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6) Memory allocation
  • 15. 15© Cloudera, Inc. All rights reserved. YARN Overhead • Future work: • Better understanding of off heap allocations • Improve memory usage visibility
  • 16. 16© Cloudera, Inc. All rights reserved. Run program through all our data Courtesy of:https://conniehallscott.files.wordpress.com/2013/01/411748_538971446114753_1125606225_o.jpg
  • 17. 17© Cloudera, Inc. All rights reserved. Data dependent tuning • As data rates change, re-tuning Spark is usually necessary • Spark is sensitive to shuffle spills • The most common knob we modify is…
  • 18. 18© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions!
  • 19. 19© Cloudera, Inc. All rights reserved. GC Stalls
  • 20. 20© Cloudera, Inc. All rights reserved. Partitions • Smaller is often better • Parameterized partition size • reduceByKey(…, nPartitions) • Parameterize application • Future work: • Dynamically determine # of partitions (SPARK-4630)
  • 21. 21© Cloudera, Inc. All rights reserved. But for now? • Easy answer: • Keep multiplying by 1.5 and see what works • Harder answer:
  • 22. 22© Cloudera, Inc. All rights reserved. Shuffle less!
  • 23. 23© Cloudera, Inc. All rights reserved. Shuffles Wide DependencyNarrow Dependencies
  • 24. 24© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey allows a map-side-combine parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() •GroupByKey transfers all the data parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  • 25. 25© Cloudera, Inc. All rights reserved. ReduceByKey when Possible •ReduceByKey •GroupByKey
  • 26. 26© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  • 27. 27© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  • 28. 28© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  • 29. 29© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin
  • 30. 30© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  • 31. 31© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  • 32. 32© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  • 33. 33© Cloudera, Inc. All rights reserved. Dynamic Allocation • Allows jobs to scale to size according to load • Knobs to control min, max and initial size • Future Work: • Target: Dynamic allocation enabled by default • Data locality & Caching • Open question with Streaming
  • 34. 34© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

Editor's Notes

  • #2: Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
  • #3: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #4: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #6: Spark makes building a proof of concept with a subset of data relatively easy. But then things go wrong Plug for my talk at Hadoop Summit
  • #10: num-executors vs. num-cores? 10 executors with 1 core, or 5 executors with 2 cores? Memory: - this is the aggregate across all cores.
  • #14: This shows up in the YARN NodeManager logs
  • #17: Spark makes building a proof of concept with a subset of data relatively easy.
  • #21: Max partition size is 2GB Small partitions help deal w/ stragglers Small partitions avoid overhead
  • #23: Fastest way to shuffle a lot of data: Don’t shuffle Second fastest way to shuffle a lot of data: Shuffle a small amount of data
  • #24: Data is merged together before its serialized & sent over network Vs. Higher serialization and network transfer costs
  • #25: Data is merged together before its serialized & sent over network Vs. Higher serialization and network transfer costs
  • #26: Data is merged togethe before its serialized & sent over network Vs. Higher serialization and network transfer costs
  • #27: Spark makes building a proof of concept with a subset of data relatively easy.
  • #29: Control plane File distribution Block Manager User UI / REST API Data-at-rest (shuffle files)
  • #31: Spark makes building a proof of concept with a subset of data relatively easy.
  • #34: Dynamic allocation: - streaming - locality (worked on) - making it even better.