SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Spark: Usage and
Roadmap in Hadoop
Jai Ranganathan
2© Cloudera, Inc. All rights reserved.
Spark will replace MapReduce
To become the standard execution engine for Hadoop
3© Cloudera, Inc. All rights reserved.
The Future of Data Processing on Hadoop
Spark complemented by specialized fit-for-purpose engines
General Data Processing
w/Spark
Fast Batch Processing, Machine Learning,
and Stream Processing
Analytic
Database
w/Impala
Low-Latency
Massively Concurrent
Queries
Full-Text Search w/Solr
Querying textual data
On-Disk Processing
w/MapReduce
Jobs at extreme scale and
extremely disk IO intensive
Shared:
• Data Storage
• Metadata
• Resource
Management
• Administration
• Security
• Governance
4© Cloudera, Inc. All rights reserved.
Cloudera Leading the Spark Movement
2013 2014 2015 2016
Identified Spark’s
early potential
Ships and
Supports
Spark with
CDH 4.4
Spark on YARN
integration
Announces initiative to
make Spark the standard
execution engine
Launches first
Spark training
Added security
integration
Cloudera engineers
publish O’Reilly Spark
book
Leading effort to
further performance,
usability, and
enterprise-readiness
5© Cloudera, Inc. All rights reserved.
Community Initiative: Spark Supersedes MapReduce
Stage 1
• Crunch on Spark
• Search on Spark
Stage 2
• Hive on Spark (beta)
• Spark on HBase (beta)
Stage 3
• Pig on Spark (alpha)
• Sqoop on Spark
Community development to port components to Spark:
6© Cloudera, Inc. All rights reserved.
Cloudera Customer Use Cases
Core Spark Spark Streaming
• Portfolio Risk Analysis
• ETL Pipeline Speed-Up
• 20+ years of stock dataFinancial
Services
Health
• Identify disease-causing genes
in the full human genome
• Calculate Jaccard scores on
health care data sets
ERP
• Optical Character Recognition and
Bill Classification
• Trend analysis
• Document classification (LDA)
• Fraud analyticsData
Services
1010
• Online Fraud Detection
Financial
Services
Health
• Incident Prediction for Sepsis
Retail
• Online Recommendation Systems
• Real-Time Inventory Management
Ad Tech
• Real-Time Ad Performance Analysis
7© Cloudera, Inc. All rights reserved.
Apache Spark
Flexible, in-memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
• Rich APIs for Scala,
Java, and Python
• Interactive shell
• APIs for different
types of workloads:
• Batch
• Streaming
• Machine Learning
• Graph
• In-Memory
processing and
caching
8© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & Hadoop
Hadoop Integration
• Spark-on-YARN integration
• Shares data, metadata,
administration, security, &
governance
STORAGE
HDFS, HBase
RESOURCE MANAGEMENT
YARN
Spark Impala MR Others
Spark
Streamin
g
MLlib SparkSQL GraphX
Data-
frames
SparkR
9© Cloudera, Inc. All rights reserved.
Logistic Regression Performance
(Data Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
# of Iterations
MapReduce
Spark
110 s/iteration
First iteration = 80s
Further iterations 1s
due to caching
10© Cloudera, Inc. All rights reserved.
Apache Spark Streaming
What is it?
• Run continuous processing of data using
Spark’s core API
• Extends Spark concepts to fault-tolerant,
transformable streams
• Adds “rolling window” operations
• Example: Compute rolling averages or counts
for data over last five minutes
Benefits:
• Reuse knowledge and code in both contexts
• Same programming paradigm for streaming and
batch
• Simplicity of development
• High-level API with automatic DAG generation
• Excellent throughput
• Scale easily to support large volumes of data
ingest
• Combine elements like MLlib and Oryx into
streaming applications
Common Use Cases:
• “On-the-fly” ETL as data is ingested into
Hadoop/HDFS
• Detect anomalous behavior and trigger alerts
• Continuous reporting of summary metrics for
incoming data
11© Cloudera, Inc. All rights reserved.
Spark Streaming Architectures
Data Sources
Ingest
Integration
Layer
• Flume
• Kafka
Spark Stream Processing
Data Prep
Aggregation /
Scoring
HDFS
Spark Long-Term Analytics/
Model Building
HBase
Real-Time Result
Serving
12© Cloudera, Inc. All rights reserved.
SparkSQL + Dataframes
Machine Learning Applications
• Goal:
• Spark/Java Developers and Data
Scientists can inline SQL into Spark apps
• Designed for:
• Ease of development for Spark
developers
• Handful of concurrent Spark jobs
• Strengths:
• Ease of embedding SQL into Java or Scala
applications
• SQL for common functionality in
developer flow (eg. aggregations, filters,
samples)
13© Cloudera, Inc. All rights reserved.
Execution Pipeline
SQL AST Logical Plan
Optimized
Logical Plan
Logical
Plan
Physical
Plans
CBO
Selected
Plan
RDDsRDDsRDDs
Dataframes
14© Cloudera, Inc. All rights reserved.
Uniting Spark and Hadoop
The One Platform Initiative
Management
Leverage Hadoop-native
resource management.
Security
Full support for Hadoop security
and beyond.
Scale
Enable 10k-node clusters.
Streaming
Support for 80% of common stream
processing workloads.
15© Cloudera, Inc. All rights reserved.
Management Security Scale Streaming
• Spark on YARN Integration
• HBase integration
• Improved metrics for
monitoring/troubleshooting
• Dynamic Resource Allocation
• Spark on YARN:
• Container resizing
• Dynamic Resource
Allocation for Streaming
• Simplified resource
configuration
• Improved WebUI for
debugging
• Improved metrics for visibility
into resource utilization
• Smart auto-tuning of job
parameters
• Kerberos Integration
• HDFS Sync (Sentry)
• Secure data at rest
• Secure data over the wire
• Audit/Lineage (Navigator)
• Spark PCI compliance
• Integration with Intel’s
advanced encryption libraries
• Enable column and view level
security
• Revamp Scheduler handling of
node failure
• Sort based shuffle
improvements
• Task Scheduling based on
HDFS data locality and caching
• Scheduler improvements for
performance at scale
• Stress test at scale with mixed
multi-tenant workloads
• HDFS DDM Integration
• Dynamic resource utilization &
prioritization
• Scale Spark History Server for
1000s of jobs
• Zero Data Loss with Spark
Streaming Resilience
• Flume integration
• Kafka integration
• SQL semantics for expressing
streaming jobs (Business
Users)
• New streaming specific API
extensions
• Streaming application
management (pause, update,
redeploy) via CM
• Optimized state updates:
efficient point lookups and
delta updates
Detailed Roadmap: One Platform Initiative
= Completed Work
= Planned Future Work
16© Cloudera, Inc. All rights reserved.
Spark Resources
• Learn Spark
• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)
• Cloudera Developer Blog
• cloudera.com/spark
• Get Trained
• Cloudera Spark Training
• Try it Out
• Cloudera Live Spark Tutorial
17© Cloudera, Inc. All rights reserved.
Try It With Cloudera Live
cloudera.com/live
Featuring tutorials on:
CDH
18© Cloudera, Inc. All rights reserved.
Thank You
Jairam Ranganathan
jairam@cloudera.com

More Related Content

Viewers also liked (20)

PPTX
東急ハンズのクラウドデザインパターン アーキテクチャー編
一成 田部井
 
PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
PDF
Neural Networks and Deep Learning
Asim Jalis
 
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
PDF
#cwt2016 Apache Kudu 構成とテーブル設計
Cloudera Japan
 
PDF
Cloud Native Hadoop #cwt2016
Cloudera Japan
 
PPTX
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
sugiyama koki
 
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
PDF
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
NTT DATA OSS Professional Services
 
PDF
IoT時代におけるストリームデータ処理と急成長の Apache Flink
Takanori Suzuki
 
PDF
Apache kudu
Asim Jalis
 
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PPTX
goa Design first API Generation
yoshinori sugiyama
 
PDF
サルでもわかるMesos schedulerの作り方
wallyqs
 
PDF
The First Class Integration of Solr with Hadoop
lucenerevolution
 
PDF
大規模データに対するデータサイエンスの進め方 #CWT2016
Cloudera Japan
 
PDF
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
PDF
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
NTT DATA OSS Professional Services
 
東急ハンズのクラウドデザインパターン アーキテクチャー編
一成 田部井
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
Neural Networks and Deep Learning
Asim Jalis
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
#cwt2016 Apache Kudu 構成とテーブル設計
Cloudera Japan
 
Cloud Native Hadoop #cwt2016
Cloudera Japan
 
Spark Streamingを使ってみた ~Twitterリアルタイムトレンドランキング~
sugiyama koki
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Sparkコミュニティに飛び込もう!(Spark Meetup Tokyo 2015 講演資料、NTTデータ 猿田 浩輔)
NTT DATA OSS Professional Services
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
Takanori Suzuki
 
Apache kudu
Asim Jalis
 
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
goa Design first API Generation
yoshinori sugiyama
 
サルでもわかるMesos schedulerの作り方
wallyqs
 
The First Class Integration of Solr with Hadoop
lucenerevolution
 
大規模データに対するデータサイエンスの進め方 #CWT2016
Cloudera Japan
 
Apache Spark の紹介(前半:Sparkのキホン)
NTT DATA OSS Professional Services
 
Apache Spark超入門 (Hadoop / Spark Conference Japan 2016 講演資料)
NTT DATA OSS Professional Services
 

Similar to Apache Spark: Usage and Roadmap in Hadoop (20)

PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
PPTX
Empower Hive with Spark
DataWorks Summit
 
PDF
Hive on spark berlin buzzwords
Szehon Ho
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PPTX
Apache Spark Operations
Cloudera, Inc.
 
PDF
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
PDF
Hive Now Sparks
DataWorks Summit
 
PPTX
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PDF
Cloudera 5.3 Update
Cloudera, Inc.
 
Spark One Platform Webinar
Cloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Empower Hive with Spark
DataWorks Summit
 
Hive on spark berlin buzzwords
Szehon Ho
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Intro to Apache Spark
Cloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Apache Spark Operations
Cloudera, Inc.
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
Hive Now Sparks
DataWorks Summit
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Cloudera 5.3 Update
Cloudera, Inc.
 
Ad

More from Cloudera Japan (20)

PPTX
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Cloudera Japan
 
PPTX
機械学習の定番プラットフォームSparkの紹介
Cloudera Japan
 
PPTX
HDFS Supportaiblity Improvements
Cloudera Japan
 
PDF
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
Cloudera Japan
 
PDF
Apache Impalaパフォーマンスチューニング #dbts2018
Cloudera Japan
 
PDF
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
PDF
HBase Across the World #LINE_DM
Cloudera Japan
 
PDF
Cloudera のサポートエンジニアリング #supennight
Cloudera Japan
 
PDF
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
PDF
Apache Kuduを使った分析システムの裏側
Cloudera Japan
 
PDF
Cloudera in the Cloud #CWT2017
Cloudera Japan
 
PDF
先行事例から学ぶ IoT / ビッグデータの始め方
Cloudera Japan
 
PPTX
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Cloudera Japan
 
PDF
How to go into production your machine learning models? #CWT2017
Cloudera Japan
 
PDF
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 
PPTX
Hue 4.0 / Hue Meetup Tokyo #huejp
Cloudera Japan
 
PDF
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Cloudera Japan
 
PDF
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Japan
 
PDF
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera Japan
 
PDF
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
Cloudera Japan
 
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Cloudera Japan
 
機械学習の定番プラットフォームSparkの紹介
Cloudera Japan
 
HDFS Supportaiblity Improvements
Cloudera Japan
 
分散DB Apache Kuduのアーキテクチャ DBの性能と一貫性を両立させる仕組み 「HybridTime」とは
Cloudera Japan
 
Apache Impalaパフォーマンスチューニング #dbts2018
Cloudera Japan
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
HBase Across the World #LINE_DM
Cloudera Japan
 
Cloudera のサポートエンジニアリング #supennight
Cloudera Japan
 
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
Apache Kuduを使った分析システムの裏側
Cloudera Japan
 
Cloudera in the Cloud #CWT2017
Cloudera Japan
 
先行事例から学ぶ IoT / ビッグデータの始め方
Cloudera Japan
 
Clouderaが提供するエンタープライズ向け運用、データ管理ツールの使い方 #CW2017
Cloudera Japan
 
How to go into production your machine learning models? #CWT2017
Cloudera Japan
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 
Hue 4.0 / Hue Meetup Tokyo #huejp
Cloudera Japan
 
Apache Kuduは何がそんなに「速い」DBなのか? #dbts2017
Cloudera Japan
 
Cloudera Data Science WorkbenchとPySparkで 好きなPythonライブラリを 分散で使う #cadeda
Cloudera Japan
 
Cloudera + MicrosoftでHadoopするのがイイらしい。 #CWT2016
Cloudera Japan
 
#cwt2016 Cloudera Managerを用いた Hadoop のトラブルシューティング
Cloudera Japan
 
Ad

Recently uploaded (20)

PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 

Apache Spark: Usage and Roadmap in Hadoop

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Spark: Usage and Roadmap in Hadoop Jai Ranganathan
  • 2. 2© Cloudera, Inc. All rights reserved. Spark will replace MapReduce To become the standard execution engine for Hadoop
  • 3. 3© Cloudera, Inc. All rights reserved. The Future of Data Processing on Hadoop Spark complemented by specialized fit-for-purpose engines General Data Processing w/Spark Fast Batch Processing, Machine Learning, and Stream Processing Analytic Database w/Impala Low-Latency Massively Concurrent Queries Full-Text Search w/Solr Querying textual data On-Disk Processing w/MapReduce Jobs at extreme scale and extremely disk IO intensive Shared: • Data Storage • Metadata • Resource Management • Administration • Security • Governance
  • 4. 4© Cloudera, Inc. All rights reserved. Cloudera Leading the Spark Movement 2013 2014 2015 2016 Identified Spark’s early potential Ships and Supports Spark with CDH 4.4 Spark on YARN integration Announces initiative to make Spark the standard execution engine Launches first Spark training Added security integration Cloudera engineers publish O’Reilly Spark book Leading effort to further performance, usability, and enterprise-readiness
  • 5. 5© Cloudera, Inc. All rights reserved. Community Initiative: Spark Supersedes MapReduce Stage 1 • Crunch on Spark • Search on Spark Stage 2 • Hive on Spark (beta) • Spark on HBase (beta) Stage 3 • Pig on Spark (alpha) • Sqoop on Spark Community development to port components to Spark:
  • 6. 6© Cloudera, Inc. All rights reserved. Cloudera Customer Use Cases Core Spark Spark Streaming • Portfolio Risk Analysis • ETL Pipeline Speed-Up • 20+ years of stock dataFinancial Services Health • Identify disease-causing genes in the full human genome • Calculate Jaccard scores on health care data sets ERP • Optical Character Recognition and Bill Classification • Trend analysis • Document classification (LDA) • Fraud analyticsData Services 1010 • Online Fraud Detection Financial Services Health • Incident Prediction for Sepsis Retail • Online Recommendation Systems • Real-Time Inventory Management Ad Tech • Real-Time Ad Performance Analysis
  • 7. 7© Cloudera, Inc. All rights reserved. Apache Spark Flexible, in-memory data processing for Hadoop Easy Development Flexible Extensible API Fast Batch & Stream Processing • Rich APIs for Scala, Java, and Python • Interactive shell • APIs for different types of workloads: • Batch • Streaming • Machine Learning • Graph • In-Memory processing and caching
  • 8. 8© Cloudera, Inc. All rights reserved. The Spark Ecosystem & Hadoop Hadoop Integration • Spark-on-YARN integration • Shares data, metadata, administration, security, & governance STORAGE HDFS, HBase RESOURCE MANAGEMENT YARN Spark Impala MR Others Spark Streamin g MLlib SparkSQL GraphX Data- frames SparkR
  • 9. 9© Cloudera, Inc. All rights reserved. Logistic Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) # of Iterations MapReduce Spark 110 s/iteration First iteration = 80s Further iterations 1s due to caching
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Spark Streaming What is it? • Run continuous processing of data using Spark’s core API • Extends Spark concepts to fault-tolerant, transformable streams • Adds “rolling window” operations • Example: Compute rolling averages or counts for data over last five minutes Benefits: • Reuse knowledge and code in both contexts • Same programming paradigm for streaming and batch • Simplicity of development • High-level API with automatic DAG generation • Excellent throughput • Scale easily to support large volumes of data ingest • Combine elements like MLlib and Oryx into streaming applications Common Use Cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detect anomalous behavior and trigger alerts • Continuous reporting of summary metrics for incoming data
  • 11. 11© Cloudera, Inc. All rights reserved. Spark Streaming Architectures Data Sources Ingest Integration Layer • Flume • Kafka Spark Stream Processing Data Prep Aggregation / Scoring HDFS Spark Long-Term Analytics/ Model Building HBase Real-Time Result Serving
  • 12. 12© Cloudera, Inc. All rights reserved. SparkSQL + Dataframes Machine Learning Applications • Goal: • Spark/Java Developers and Data Scientists can inline SQL into Spark apps • Designed for: • Ease of development for Spark developers • Handful of concurrent Spark jobs • Strengths: • Ease of embedding SQL into Java or Scala applications • SQL for common functionality in developer flow (eg. aggregations, filters, samples)
  • 13. 13© Cloudera, Inc. All rights reserved. Execution Pipeline SQL AST Logical Plan Optimized Logical Plan Logical Plan Physical Plans CBO Selected Plan RDDsRDDsRDDs Dataframes
  • 14. 14© Cloudera, Inc. All rights reserved. Uniting Spark and Hadoop The One Platform Initiative Management Leverage Hadoop-native resource management. Security Full support for Hadoop security and beyond. Scale Enable 10k-node clusters. Streaming Support for 80% of common stream processing workloads.
  • 15. 15© Cloudera, Inc. All rights reserved. Management Security Scale Streaming • Spark on YARN Integration • HBase integration • Improved metrics for monitoring/troubleshooting • Dynamic Resource Allocation • Spark on YARN: • Container resizing • Dynamic Resource Allocation for Streaming • Simplified resource configuration • Improved WebUI for debugging • Improved metrics for visibility into resource utilization • Smart auto-tuning of job parameters • Kerberos Integration • HDFS Sync (Sentry) • Secure data at rest • Secure data over the wire • Audit/Lineage (Navigator) • Spark PCI compliance • Integration with Intel’s advanced encryption libraries • Enable column and view level security • Revamp Scheduler handling of node failure • Sort based shuffle improvements • Task Scheduling based on HDFS data locality and caching • Scheduler improvements for performance at scale • Stress test at scale with mixed multi-tenant workloads • HDFS DDM Integration • Dynamic resource utilization & prioritization • Scale Spark History Server for 1000s of jobs • Zero Data Loss with Spark Streaming Resilience • Flume integration • Kafka integration • SQL semantics for expressing streaming jobs (Business Users) • New streaming specific API extensions • Streaming application management (pause, update, redeploy) via CM • Optimized state updates: efficient point lookups and delta updates Detailed Roadmap: One Platform Initiative = Completed Work = Planned Future Work
  • 16. 16© Cloudera, Inc. All rights reserved. Spark Resources • Learn Spark • O’Reilly Advanced Analytics with Spark eBook (written by Clouderans) • Cloudera Developer Blog • cloudera.com/spark • Get Trained • Cloudera Spark Training • Try it Out • Cloudera Live Spark Tutorial
  • 17. 17© Cloudera, Inc. All rights reserved. Try It With Cloudera Live cloudera.com/live Featuring tutorials on: CDH
  • 18. 18© Cloudera, Inc. All rights reserved. Thank You Jairam Ranganathan [email protected]