SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Oryx 2 Overview
Sean Owen | Cloudera | @sean_r_owen
2© Cloudera, Inc. All rights reserved.
Consider the Music Recommender
Collect
Play Data
& Do Data
Science
Build
Taste
Model
Offline
Learn
Quickly
from
New Plays
Recommend
New Songs
Now
3© Cloudera, Inc. All rights reserved.
From Exploratory to Operational?
 Exploratory Analytics Operational Analytics 
Explore Data
Pick Model
Build Model
at Scale, Offline
Continuously
Update Model
?
Score Model in
Real-Time
?
4© Cloudera, Inc. All rights reserved.
Large Scale or Real-Time?
Large-Scale
Offline
Batch
Real-Time
Online
Streaming
vs
Why Don’t We Have Both?
λ!
5© Cloudera, Inc. All rights reserved.
• Batch Layer
• High latency, high throughput
• Compute official result
• Speed Layer
• Low latency
• Compute approximate update to
last known result
• Serving Layer
• Real-time
• Merge batch/speed results
The Lambda Architecture
www.ymc.ch/en/lambda-architecture-part-1
6© Cloudera, Inc. All rights reserved.
• Batch Layer
• Train, evaluate, tune model
over all data in hours
• Speed Layer
• Update model approximately in
minutes or seconds
• Serving Layer
• Make prediction, recommendation
from model in milliseconds
λ Architecture fits ML + Hadoop
Streaming MLlib
7© Cloudera, Inc. All rights reserved.
www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
8© Cloudera, Inc. All rights reserved.
History (or: 5th time’s a charm)
Taste
2005 – 2009
- Recommender
toolkit in Java
- Local only
- Serves results
Apache Mahout
2009 – 2014
- Adds Hadoop-based
model building
at scale
- But no serving
Myrrix
2011-2013
- Mahout recs
reimagined
- Adds serving to
Hadoop-based
model build
Oryx 1
2013 –
- Extends to
classification,
clustering
- PMML
- Merge with
cloudera/ml
Oryx 2
2014 –
- Same APIs / goals
- Rewrite
- Full lambda
architecture
- Kafka + Spark + YARN
9© Cloudera, Inc. All rights reserved.
Complementary, Not Competitive
Most ML-on-Hadoop tools are
for building models only, and
excel at this.
Oryx and similar projects do
everything else around this:
continuous update, serving
10© Cloudera, Inc. All rights reserved.
Architecture
HDFS
Input (Kafka topic)
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Models + Updates (Kafka topic)
Model
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model
Input
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input
11© Cloudera, Inc. All rights reserved.
• Input Kafka topic
• Any type; usually strings
• From external or Serving Layer
• Update Kafka topic
• Serialized models (PMML)
produced by Batch Layer
• Model updates / deltas
produced by Speed Layer
Data Transport
Input (Kafka topic)
Recent
Input
Models + Updates (Kafka topic)
Model
Recent
Input
Model
Updates
Model
Input
Input
Model
Updates
Model
12© Cloudera, Inc. All rights reserved.
• Spark Streaming
• Persists input topic data
to HDFS from Kafka
• Builds “model” occasionally from
historical and new data
• Hours
• ML: can use MLlib
• ML: tunes hyperparameters
• Publishes models as PMML to
update topic
Batch Layer
HDFS
Spark
Streaming
Batch
Layer
Recent
Input
Historical
Input
Model
13© Cloudera, Inc. All rights reserved.
• Spark Streaming
• Listens for new PMML models
• Listens to input topic too
• Computes approximate updates to
model implied by input and publishes
to update topic
• Seconds
Speed Layer
Spark
Streaming
Speed
Layer
Recent
Input
Model
Updates
Model
14© Cloudera, Inc. All rights reserved.
• Tomcat + JAX-RS
• (Can deploy on YARN)
• REST API
• Listens for new PMML models and
updates from update topic
• Scores model / answers queries
• Writes to input topic too
• No shared state; scales horizontally
• Milliseconds
Serving Layer
Serving
Layer
Input
Serving
LayerServing
LayerServing
Layer
Model
Updates
Model
Serving
LayerQueries
Input
15© Cloudera, Inc. All rights reserved.
Logical Architecture
Serving Layer Speed Layer Batch Layer
App Tier oryx-app-serving oryx-app-mllib
oryx-app
oryx-app-mllib
oryx-app
ML Tier oryx-ml oryx-ml
Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda
Generic Lambda-Architecture support
ML-specific specialization
Prebuilt recommender, clustering,
classification implementations
16© Cloudera, Inc. All rights reserved.
• Scoring on the fly is not cheap
• 1M user/items ≈ 1GB heap
at scale (≈ 200 features)
• Feature, item count determines
latency, throughput
• Java 8 + 16-core 2.3GHz Xeon
• Smallish models ≈
100s QPS, 10s ms latency
• Huge models ≈
Single digit QPS, 100s ms latency
Recommendation Benchmarks
17© Cloudera, Inc. All rights reserved.
• Spark 1.3.1
• MLlib
• Streaming
• Kafka 0.8.2.1
• Hadoop 2.6
• HDFS
• YARN
• JavaEE 7
• JAX-RS 2
• Jersey 2
• Servlet 3.1
• Tomcat 8
• JPMML + PMML 4.2.1
Key Technology Roster
CDH 5.4+
18© Cloudera, Inc. All rights reserved.
• Cloudera Labs project
• Partial collaboration with Intel
• Not shipped with CDH
• Not supported, no plans to yet
• 2.0.0 beta 3
• Suitable for POCs
• 2.0.0 by end of year
• Best For
• Recommender engines
• Real-time anomaly detection
• Real-time classification
• Problems where both scale and
latency are important
• CDH users
Status
19© Cloudera, Inc. All rights reserved.
Get Started in ~1 Hour
http://oryx.io
20© Cloudera, Inc. All rights reserved.
Thank you
@sean_r_owen
sowen@cloudera.com
21© Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprise
wrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce,
Uber, Pinterest, and more
When: Thursday, October 22, 2015
Where: Broadway Studios, San Francisco

More Related Content

What's hot (20)

PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PPTX
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
Container Orchestrator Smackdown @ContinousLifecycle
Michael Mueller
 
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
Tachyon and Apache Spark
rhatr
 
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
PDF
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Timothy Spann
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Spark Summit
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Robert "Chip" Senkbeil
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Container Orchestrator Smackdown @ContinousLifecycle
Michael Mueller
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Tachyon and Apache Spark
rhatr
 
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Databricks
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Databricks
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Jen Aman
 
Intro to Apache Spark
Mammoth Data
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 

Similar to Lambda architecture on Spark, Kafka for real-time large scale ML (20)

PPTX
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
PPTX
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PDF
Machine Learning and Hadoop: Present and Future
Data Science London
 
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
PPTX
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
PPTX
Unlocking data science in the enterprise - with Oracle and Cloudera
Cloudera, Inc.
 
PPTX
Supercharge Splunk with Cloudera

Cloudera, Inc.
 
PPTX
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
PPTX
Hadoop and Machine Learning
joshwills
 
PPTX
Data Science and CDSW
Jason Hubbard
 
PPTX
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
DataWorks Summit
 
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PDF
The state of Spark in the cloud
Nicolas Poggi
 
PDF
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
PPTX
MLflow Model Serving - DAIS 2021
amesar0
 
PDF
MLflow Model Serving
Databricks
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Apache Spark: Usage and Roadmap in Hadoop
Cloudera Japan
 
Spark One Platform Webinar
Cloudera, Inc.
 
Machine Learning and Hadoop: Present and Future
Data Science London
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Cloudera, Inc.
 
Supercharge Splunk with Cloudera

Cloudera, Inc.
 
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
Hadoop and Machine Learning
joshwills
 
Data Science and CDSW
Jason Hubbard
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
DataWorks Summit
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
The state of Spark in the cloud
Nicolas Poggi
 
Train, predict, serve: How to go into production your machine learning model
Cloudera Japan
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
MLflow Model Serving - DAIS 2021
amesar0
 
MLflow Model Serving
Databricks
 
Ad

More from huguk (20)

PDF
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
PDF
ether.camp - Hackathon & ether.camp intro
huguk
 
PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
PDF
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
PDF
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
PDF
Streaming Dataflow with Apache Flink
huguk
 
PDF
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
PDF
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
PDF
Signal Media: Real-Time Media & News Monitoring
huguk
 
PDF
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
PDF
Peter Karney: Intro to the Digital catapult
huguk
 
PDF
Cytora: Real-Time Political Risk Analysis
huguk
 
PDF
Cubitic: Predictive Analytics
huguk
 
PDF
Bird.i: Earth Observation Data Made Social
huguk
 
PDF
Aiseedo: Real Time Machine Intelligence
huguk
 
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
PDF
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
PPTX
Hadoop - Looking to the Future By Arun Murthy
huguk
 
PDF
Fast real-time approximations using Spark streaming
huguk
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
ether.camp - Hackathon & ether.camp intro
huguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
Streaming Dataflow with Apache Flink
huguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
Signal Media: Real-Time Media & News Monitoring
huguk
 
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
Peter Karney: Intro to the Digital catapult
huguk
 
Cytora: Real-Time Political Risk Analysis
huguk
 
Cubitic: Predictive Analytics
huguk
 
Bird.i: Earth Observation Data Made Social
huguk
 
Aiseedo: Real Time Machine Intelligence
huguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
Hadoop - Looking to the Future By Arun Murthy
huguk
 
Fast real-time approximations using Spark streaming
huguk
 
Ad

Recently uploaded (20)

PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Digital Circuits, important subject in CS
contactparinay1
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Lambda architecture on Spark, Kafka for real-time large scale ML

  • 1. 1© Cloudera, Inc. All rights reserved. Oryx 2 Overview Sean Owen | Cloudera | @sean_r_owen
  • 2. 2© Cloudera, Inc. All rights reserved. Consider the Music Recommender Collect Play Data & Do Data Science Build Taste Model Offline Learn Quickly from New Plays Recommend New Songs Now
  • 3. 3© Cloudera, Inc. All rights reserved. From Exploratory to Operational?  Exploratory Analytics Operational Analytics  Explore Data Pick Model Build Model at Scale, Offline Continuously Update Model ? Score Model in Real-Time ?
  • 4. 4© Cloudera, Inc. All rights reserved. Large Scale or Real-Time? Large-Scale Offline Batch Real-Time Online Streaming vs Why Don’t We Have Both? λ!
  • 5. 5© Cloudera, Inc. All rights reserved. • Batch Layer • High latency, high throughput • Compute official result • Speed Layer • Low latency • Compute approximate update to last known result • Serving Layer • Real-time • Merge batch/speed results The Lambda Architecture www.ymc.ch/en/lambda-architecture-part-1
  • 6. 6© Cloudera, Inc. All rights reserved. • Batch Layer • Train, evaluate, tune model over all data in hours • Speed Layer • Update model approximately in minutes or seconds • Serving Layer • Make prediction, recommendation from model in milliseconds λ Architecture fits ML + Hadoop Streaming MLlib
  • 7. 7© Cloudera, Inc. All rights reserved. www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
  • 8. 8© Cloudera, Inc. All rights reserved. History (or: 5th time’s a charm) Taste 2005 – 2009 - Recommender toolkit in Java - Local only - Serves results Apache Mahout 2009 – 2014 - Adds Hadoop-based model building at scale - But no serving Myrrix 2011-2013 - Mahout recs reimagined - Adds serving to Hadoop-based model build Oryx 1 2013 – - Extends to classification, clustering - PMML - Merge with cloudera/ml Oryx 2 2014 – - Same APIs / goals - Rewrite - Full lambda architecture - Kafka + Spark + YARN
  • 9. 9© Cloudera, Inc. All rights reserved. Complementary, Not Competitive Most ML-on-Hadoop tools are for building models only, and excel at this. Oryx and similar projects do everything else around this: continuous update, serving
  • 10. 10© Cloudera, Inc. All rights reserved. Architecture HDFS Input (Kafka topic) Spark Streaming Batch Layer Recent Input Historical Input Models + Updates (Kafka topic) Model Spark Streaming Speed Layer Recent Input Model Updates Model Input Serving Layer Input Serving LayerServing LayerServing Layer Model Updates Model Serving LayerQueries Input
  • 11. 11© Cloudera, Inc. All rights reserved. • Input Kafka topic • Any type; usually strings • From external or Serving Layer • Update Kafka topic • Serialized models (PMML) produced by Batch Layer • Model updates / deltas produced by Speed Layer Data Transport Input (Kafka topic) Recent Input Models + Updates (Kafka topic) Model Recent Input Model Updates Model Input Input Model Updates Model
  • 12. 12© Cloudera, Inc. All rights reserved. • Spark Streaming • Persists input topic data to HDFS from Kafka • Builds “model” occasionally from historical and new data • Hours • ML: can use MLlib • ML: tunes hyperparameters • Publishes models as PMML to update topic Batch Layer HDFS Spark Streaming Batch Layer Recent Input Historical Input Model
  • 13. 13© Cloudera, Inc. All rights reserved. • Spark Streaming • Listens for new PMML models • Listens to input topic too • Computes approximate updates to model implied by input and publishes to update topic • Seconds Speed Layer Spark Streaming Speed Layer Recent Input Model Updates Model
  • 14. 14© Cloudera, Inc. All rights reserved. • Tomcat + JAX-RS • (Can deploy on YARN) • REST API • Listens for new PMML models and updates from update topic • Scores model / answers queries • Writes to input topic too • No shared state; scales horizontally • Milliseconds Serving Layer Serving Layer Input Serving LayerServing LayerServing Layer Model Updates Model Serving LayerQueries Input
  • 15. 15© Cloudera, Inc. All rights reserved. Logical Architecture Serving Layer Speed Layer Batch Layer App Tier oryx-app-serving oryx-app-mllib oryx-app oryx-app-mllib oryx-app ML Tier oryx-ml oryx-ml Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda Generic Lambda-Architecture support ML-specific specialization Prebuilt recommender, clustering, classification implementations
  • 16. 16© Cloudera, Inc. All rights reserved. • Scoring on the fly is not cheap • 1M user/items ≈ 1GB heap at scale (≈ 200 features) • Feature, item count determines latency, throughput • Java 8 + 16-core 2.3GHz Xeon • Smallish models ≈ 100s QPS, 10s ms latency • Huge models ≈ Single digit QPS, 100s ms latency Recommendation Benchmarks
  • 17. 17© Cloudera, Inc. All rights reserved. • Spark 1.3.1 • MLlib • Streaming • Kafka 0.8.2.1 • Hadoop 2.6 • HDFS • YARN • JavaEE 7 • JAX-RS 2 • Jersey 2 • Servlet 3.1 • Tomcat 8 • JPMML + PMML 4.2.1 Key Technology Roster CDH 5.4+
  • 18. 18© Cloudera, Inc. All rights reserved. • Cloudera Labs project • Partial collaboration with Intel • Not shipped with CDH • Not supported, no plans to yet • 2.0.0 beta 3 • Suitable for POCs • 2.0.0 by end of year • Best For • Recommender engines • Real-time anomaly detection • Real-time classification • Problems where both scale and latency are important • CDH users Status
  • 19. 19© Cloudera, Inc. All rights reserved. Get Started in ~1 Hour http://oryx.io
  • 20. 20© Cloudera, Inc. All rights reserved. Thank you @sean_r_owen [email protected]
  • 21. 21© Cloudera, Inc. All rights reserved. The conference for and by Data Scientists, from startup to enterprise wrangleconf.com Public registration is now open! Who: Featuring data scientists from Salesforce, Uber, Pinterest, and more When: Thursday, October 22, 2015 Where: Broadway Studios, San Francisco