SlideShare a Scribd company logo
Making Netflix Machine
Learning Algorithms
Reliable
Tony Jebara & Justin Basilico
ICML Reliable ML in the Wild Workshop
2017-08-11
2
2006
6M members
US only
3
2017
> 100M members
> 190 countries
Goal:
Help members find content to watch and enjoy to
maximize member satisfaction and retention
Algorithm Areas
▪ Personalized Ranking
▪ Top-N Ranking
▪ Trending Now
▪ Continue Watching
▪ Video-Video Similarity
▪ Personalized Page Generation
▪ Search
▪ Personalized Image Selection
▪ ...
Models & Algorithms
▪ Regression (linear, logistic, ...)
▪ Matrix Factorization
▪ Factorization Machines
▪ Clustering & Topic Models
▪ Bayesian Nonparametrics
▪ Tree Ensembles (RF, GBDTs, …)
▪ Neural Networks (Deep, RBMs, …)
▪ Gaussian Processes
▪ Bandits
▪ …
A/B tests validate an overall approach works in
expectation
But they run in online production, so every A/B
tested model needs to be reliable
Innovation Cycle
Idea
Offline
Experiment
Full Online
Deployment
Online A/B
Test
1) Collect massive data sets
2) Try billions of hypotheses to find* one(s) with support
*Find with computational and statistical efficiency
Batch Learning
USERS
TIME
Collect Learn A/B Test Roll-out
Data Model
A
B
Batch Learning
USERS
TIME
A
BREGRET
Batch Learning
Collect Learn A/B Test Roll-out
Data Model
Explore and exploit → Less regret
Helps cold start models in the wild
Maintain some exploration for nonstationarity
Adapt reliably to changing and new data
Epsilon-greedy, UCB, Thompson Sampling, etc.
USERS
TIME
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Bandit Learning
USERS
TIME
Bandit Learning
1) Uniform population of hypotheses
2) Choose a random hypothesis h
3) Act according to h and observe outcome
4) Re-weight hypotheses
5) Go to 2
TS
Bandits for selecting images
Bandits for selecting images
Different preferences for genre/theme portrayed
Contextual bandits to personalize images
Different preferences for cast members
Contextual bandits to personalize images
Putting Machine
Learning In Production
18
Training Pipeline
Application
“Typical” ML Pipeline: A tale of two worlds
Historical
Data
Generate
Features
Train Models
Validate &
Select
Models
Application
Logic
Live
Data
Load
Model
Offline
Online
Collect
Labels
Publish
Model
Evaluate
Model
Experimentation
Offline: Model trains reliably
Online: Model performs reliably
… plus all involved systems and data
What needs to be reliable?
Detection
Response
Prevention
Reliability Approach
Reliability in Training
To be reliable, first learning must be repeatable
Automate retraining of models
Akin to continuous deployment in software engineering
How often depends on application; typically daily
Detect problems and fail fast to avoid using a bad model
Retraining
Example Training Pipeline
● Periodic retraining to refresh models
● Workflow system to manage pipeline
● Each step is a Spark or Docker job
● Each step has checks in place
○ Response: Stop workflow and send alerts
What to check?
● For both absolute and relative to previous runs...
● Offline metrics on hold-out set
● Data size for train and test
● Feature distributions
● Feature importance
● Large (unexpected) changes in output between models
● Model coverage (e.g. number of items it can predict)
● Model integrity
● Error counters
● ...
Example 1: number of samples
Numberoftrainingsamples
Time
Example 1: number of samples
Numberoftrainingsamples
Time
Alarm fires and
model is not
published due to
anomaly
Example 2: offline metrics
Liftw.r.tbaseline
Time
Example 2: offline metrics
Liftw.r.tbaseline
Time
Alarm fires and model is
not published due to
anomaly
Absolute threshold: if lift
< X then alarm also fires
Testing
- Unit testing and code coverage
- Integration testing
Improve quality and reliability of upstream data feeds
Reuse same data and code offline and online to avoid
training/serving skew
Running from multiple random seeds
Preventing offline training failures
Reliability Online
Check for invariants in inputs and outputs
Can catch many unexpected changes, bad assumptions
and engineering issues
Examples:
- Feature values are out-of-range
- Output is NaN or Infinity
- Probabilities are < 0 or > 1 or don’t sum to 1
Basic sanity checks
Online staleness checks can cover a wide range of failures in the training
and publishing
Example checks:
- How long ago was the most recent model published?
- How old is the data it was trained on?
- How old are the inputs it is consuming?
Model and input staleness
Stale
Model
Ageofmodel
Track the quality of the model
- Compare prediction to actual
behavior
- Online equivalents of offline metrics
For online learning or bandits reserve a
fraction of data for a simple policy (e.g.
epsilon-greedy) as a sanity check
Online metrics
Your model isn’t working right or your input data
is bad: Now what?
Common approaches in personalization space:
- Use previous model?
- Use previous output?
- Use simplified model or heuristic?
- If calling system is resilient: turn off that
subsystem?
Response: Graceful Degradation
Want to choose personalized images per profile
- Image lookup has O(10M) requests per second (e.g.
“House of Cards” -> Image URL)
Approach:
- Precompute show to image mapping per user
near-line system
- Store mapping in fast distributed cache
- At request time
- Lookup user-specific mapping in cache
- Fallback to unpersonalized results
- Store mapping for request
- Secondary fallback to default image for a
missing show
Example: Image Precompute Personalized Selection
(Precompute)
Unpersonalized
Default Image
Netflix runs 100% in AWS cloud
Needs to be reliable on unreliable infrastructure
Want a service to operate when an AWS instance
fails?
Randomly terminate instances (Chaos Monkey)
Want Netflix to operate when an entire AWS region
is having a problem?
Disable entire AWS regions (Chaos Kong)
Prevention: Failure Injection
What failure scenarios do you want your model to be
robust to?
Models are very sensitive to their input data
- Can be noisy, corrupt, delayed, incomplete, missing,
unknown
Train model to be resilient by injecting these conditions
into the training and testing data
Failure Injection for ML
Want to add a feature for type of row on homepage
Genre, Because you watched, Top picks, ...
Problem: Model may see new types online before in training data
Solution: Add unknown category and perturb a small fraction of
training data to that type
Rule-of-thumb: Apply for all categorical features unless new
category isn’t possible (Days of week -> OK, countries -> Not)
Example: Categorical features
Conclusions.
● Consider online learning and bandits
● Build off best practices from software engineering
● Automate as much as possible
● Adjust your data to cover the conditions you want
your model to be resilient to
● Detect problems and degrade gracefully
Take aways
Thank you.
Tony Jebara & Justin Basilico
Yes, we are hiring!

More Related Content

What's hot (20)

PDF
Time, Context and Causality in Recommender Systems
Yves Raimond
 
PDF
Deep Learning for Recommender Systems
Yves Raimond
 
PDF
Contextualization at Netflix
Linas Baltrunas
 
PDF
Past, Present & Future of Recommender Systems: An Industry Perspective
Justin Basilico
 
PDF
Context Aware Recommendations at Netflix
Linas Baltrunas
 
PDF
Deep Learning for Recommender Systems
Justin Basilico
 
PDF
Calibrated Recommendations
Harald Steck
 
PDF
Recommending for the World
Yves Raimond
 
PPTX
Learning a Personalized Homepage
Justin Basilico
 
PDF
Artwork Personalization at Netflix
Justin Basilico
 
PDF
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Sudeep Das, Ph.D.
 
PDF
Sequential Decision Making in Recommendations
Jaya Kawale
 
PPTX
Personalized Page Generation for Browsing Recommendations
Justin Basilico
 
PDF
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Anoop Deoras
 
PDF
A Multi-Armed Bandit Framework For Recommendations at Netflix
Jaya Kawale
 
PDF
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
PDF
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
PDF
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain
 
PDF
Personalizing the listening experience
Mounia Lalmas-Roelleke
 
PDF
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
Zachary Schendel
 
Time, Context and Causality in Recommender Systems
Yves Raimond
 
Deep Learning for Recommender Systems
Yves Raimond
 
Contextualization at Netflix
Linas Baltrunas
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Justin Basilico
 
Context Aware Recommendations at Netflix
Linas Baltrunas
 
Deep Learning for Recommender Systems
Justin Basilico
 
Calibrated Recommendations
Harald Steck
 
Recommending for the World
Yves Raimond
 
Learning a Personalized Homepage
Justin Basilico
 
Artwork Personalization at Netflix
Justin Basilico
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Sudeep Das, Ph.D.
 
Sequential Decision Making in Recommendations
Jaya Kawale
 
Personalized Page Generation for Browsing Recommendations
Justin Basilico
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Anoop Deoras
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
Jaya Kawale
 
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Xavier Amatriain
 
Netflix Recommendations - Beyond the 5 Stars
Xavier Amatriain
 
Personalizing the listening experience
Mounia Lalmas-Roelleke
 
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
Zachary Schendel
 

Similar to Making Netflix Machine Learning Algorithms Reliable (20)

PDF
Machine learning systems for engineers
Cameron Joannidis
 
PDF
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
PDF
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf
 
PDF
10 more lessons learned from building Machine Learning systems
Xavier Amatriain
 
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
PDF
Demystifying ML/AI
Matthew Reynolds
 
PPTX
Ml2 production
Nikhil Ketkar
 
PDF
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
PDF
Pitfalls of machine learning in production
Antoine Sauray
 
PDF
Paris ML meetup
Yves Raimond
 
PDF
Parismlmeetupfinalslides 151209190037-lva1-app6892
mercedes calderon
 
PDF
CD in Machine Learning Systems
Thoughtworks
 
PPTX
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
PPTX
Implementing Machine Learning in the Real World
Jesus Rodriguez
 
PDF
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Xavier Amatriain
 
PDF
Making Machine Learning Work in Practice - StampedeCon 2014
StampedeCon
 
PDF
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
 
PDF
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
PDF
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
PPTX
Machine Learning In Production
Samir Bessalah
 
Machine learning systems for engineers
Cameron Joannidis
 
10 more lessons learned from building Machine Learning systems - MLConf
Xavier Amatriain
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
MLconf
 
10 more lessons learned from building Machine Learning systems
Xavier Amatriain
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Xavier Amatriain
 
Demystifying ML/AI
Matthew Reynolds
 
Ml2 production
Nikhil Ketkar
 
Machine learning for IoT - unpacking the blackbox
Ivo Andreev
 
Pitfalls of machine learning in production
Antoine Sauray
 
Paris ML meetup
Yves Raimond
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
mercedes calderon
 
CD in Machine Learning Systems
Thoughtworks
 
Model Development And Evaluation in ML.pptx
bismayabaliarsingh00
 
Implementing Machine Learning in the Real World
Jesus Rodriguez
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
Xavier Amatriain
 
Making Machine Learning Work in Practice - StampedeCon 2014
StampedeCon
 
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
 
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
A survey on Machine Learning In Production (July 2018)
Arnab Biswas
 
Machine Learning In Production
Samir Bessalah
 
Ad

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Ad

Making Netflix Machine Learning Algorithms Reliable

  • 1. Making Netflix Machine Learning Algorithms Reliable Tony Jebara & Justin Basilico ICML Reliable ML in the Wild Workshop 2017-08-11
  • 3. 3 2017 > 100M members > 190 countries
  • 4. Goal: Help members find content to watch and enjoy to maximize member satisfaction and retention
  • 5. Algorithm Areas ▪ Personalized Ranking ▪ Top-N Ranking ▪ Trending Now ▪ Continue Watching ▪ Video-Video Similarity ▪ Personalized Page Generation ▪ Search ▪ Personalized Image Selection ▪ ...
  • 6. Models & Algorithms ▪ Regression (linear, logistic, ...) ▪ Matrix Factorization ▪ Factorization Machines ▪ Clustering & Topic Models ▪ Bayesian Nonparametrics ▪ Tree Ensembles (RF, GBDTs, …) ▪ Neural Networks (Deep, RBMs, …) ▪ Gaussian Processes ▪ Bandits ▪ …
  • 7. A/B tests validate an overall approach works in expectation But they run in online production, so every A/B tested model needs to be reliable Innovation Cycle Idea Offline Experiment Full Online Deployment Online A/B Test
  • 8. 1) Collect massive data sets 2) Try billions of hypotheses to find* one(s) with support *Find with computational and statistical efficiency Batch Learning
  • 9. USERS TIME Collect Learn A/B Test Roll-out Data Model A B Batch Learning
  • 11. Explore and exploit → Less regret Helps cold start models in the wild Maintain some exploration for nonstationarity Adapt reliably to changing and new data Epsilon-greedy, UCB, Thompson Sampling, etc. USERS TIME x x x x x x x x x x x x x x x x x Bandit Learning
  • 12. USERS TIME Bandit Learning 1) Uniform population of hypotheses 2) Choose a random hypothesis h 3) Act according to h and observe outcome 4) Re-weight hypotheses 5) Go to 2 TS
  • 15. Different preferences for genre/theme portrayed Contextual bandits to personalize images
  • 16. Different preferences for cast members Contextual bandits to personalize images
  • 18. 18 Training Pipeline Application “Typical” ML Pipeline: A tale of two worlds Historical Data Generate Features Train Models Validate & Select Models Application Logic Live Data Load Model Offline Online Collect Labels Publish Model Evaluate Model Experimentation
  • 19. Offline: Model trains reliably Online: Model performs reliably … plus all involved systems and data What needs to be reliable?
  • 22. To be reliable, first learning must be repeatable Automate retraining of models Akin to continuous deployment in software engineering How often depends on application; typically daily Detect problems and fail fast to avoid using a bad model Retraining
  • 23. Example Training Pipeline ● Periodic retraining to refresh models ● Workflow system to manage pipeline ● Each step is a Spark or Docker job ● Each step has checks in place ○ Response: Stop workflow and send alerts
  • 24. What to check? ● For both absolute and relative to previous runs... ● Offline metrics on hold-out set ● Data size for train and test ● Feature distributions ● Feature importance ● Large (unexpected) changes in output between models ● Model coverage (e.g. number of items it can predict) ● Model integrity ● Error counters ● ...
  • 25. Example 1: number of samples Numberoftrainingsamples Time
  • 26. Example 1: number of samples Numberoftrainingsamples Time Alarm fires and model is not published due to anomaly
  • 27. Example 2: offline metrics Liftw.r.tbaseline Time
  • 28. Example 2: offline metrics Liftw.r.tbaseline Time Alarm fires and model is not published due to anomaly Absolute threshold: if lift < X then alarm also fires
  • 29. Testing - Unit testing and code coverage - Integration testing Improve quality and reliability of upstream data feeds Reuse same data and code offline and online to avoid training/serving skew Running from multiple random seeds Preventing offline training failures
  • 31. Check for invariants in inputs and outputs Can catch many unexpected changes, bad assumptions and engineering issues Examples: - Feature values are out-of-range - Output is NaN or Infinity - Probabilities are < 0 or > 1 or don’t sum to 1 Basic sanity checks
  • 32. Online staleness checks can cover a wide range of failures in the training and publishing Example checks: - How long ago was the most recent model published? - How old is the data it was trained on? - How old are the inputs it is consuming? Model and input staleness Stale Model Ageofmodel
  • 33. Track the quality of the model - Compare prediction to actual behavior - Online equivalents of offline metrics For online learning or bandits reserve a fraction of data for a simple policy (e.g. epsilon-greedy) as a sanity check Online metrics
  • 34. Your model isn’t working right or your input data is bad: Now what? Common approaches in personalization space: - Use previous model? - Use previous output? - Use simplified model or heuristic? - If calling system is resilient: turn off that subsystem? Response: Graceful Degradation
  • 35. Want to choose personalized images per profile - Image lookup has O(10M) requests per second (e.g. “House of Cards” -> Image URL) Approach: - Precompute show to image mapping per user near-line system - Store mapping in fast distributed cache - At request time - Lookup user-specific mapping in cache - Fallback to unpersonalized results - Store mapping for request - Secondary fallback to default image for a missing show Example: Image Precompute Personalized Selection (Precompute) Unpersonalized Default Image
  • 36. Netflix runs 100% in AWS cloud Needs to be reliable on unreliable infrastructure Want a service to operate when an AWS instance fails? Randomly terminate instances (Chaos Monkey) Want Netflix to operate when an entire AWS region is having a problem? Disable entire AWS regions (Chaos Kong) Prevention: Failure Injection
  • 37. What failure scenarios do you want your model to be robust to? Models are very sensitive to their input data - Can be noisy, corrupt, delayed, incomplete, missing, unknown Train model to be resilient by injecting these conditions into the training and testing data Failure Injection for ML
  • 38. Want to add a feature for type of row on homepage Genre, Because you watched, Top picks, ... Problem: Model may see new types online before in training data Solution: Add unknown category and perturb a small fraction of training data to that type Rule-of-thumb: Apply for all categorical features unless new category isn’t possible (Days of week -> OK, countries -> Not) Example: Categorical features
  • 40. ● Consider online learning and bandits ● Build off best practices from software engineering ● Automate as much as possible ● Adjust your data to cover the conditions you want your model to be resilient to ● Detect problems and degrade gracefully Take aways
  • 41. Thank you. Tony Jebara & Justin Basilico Yes, we are hiring!