SlideShare a Scribd company logo
The Forces That Disrupt Netflix
Nov. 7, 2016Haley Tucker
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-resilience
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
ACROBAT
FLEA
our world
parallel world
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
ACROBAT
FLEA
our world
parallel world
computing
ENGINEER
Stranger Things: The Forces that Disrupt Netflix
PROLOGUE
DISTRIBUTED SYSTEMS
Stranger Things: The Forces that Disrupt Netflix
Proxy/Routing
DECOMPOSING THE MONOLITH
Devices
Netflix
Service
Netflix
Service
Edge
Service
Traffic
Netflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Notes on Distributed Systems for Young Bloods
# Distributed
systems are
different because
they fail often.
--Jeff Hodges
TABLE OF CONTENTS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Metadata impacts on availability
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Crashing services and cascading failures
CHAPTER 3: THE THROTTLE
• Latency spikes and the impact of fallbacks
FORCES AT WORK
Stranger Things: The Forces that Disrupt Netflix
Whoops, something went wrong…
Netflix Streaming Error
We’re having trouble playing this title right now. Please try again later or select a different title.
CHAPTER ONE
THE WEIRD DATA IN THE CATALOG
Stranger Things: The Forces that Disrupt Netflix
45 MINUTES!!
Clock, by heyyobecky4lyfe, Tumblr
Stranger Things: The Forces that Disrupt Netflix
VIDEO METADATA
ARCHITECTURE
Video
Metadata
Service
Amazon S3
Source
System
Source
System
Netflix
Services
Netflix
Services
Netflix
Services
Netflix
ServicesNetflix
Service
Traffic
Amazon S3
Netflix
Playback
Service
{
String msg = “This should never
happen!”;
throw new IllegalStateException(msg);
}
MITIGATION
BLAST RADIUS
Explosion, CC BY 2.0, Andrew Kuznetsov 2008, Flikr
Amazon WS Global Infrastructure
Amazon WS Global Infrastructure
STAGGERED ROLLOUT
Pager
Diagnosis?
Stranger Things: The Forces that Disrupt Netflix
Canary, CC BY 2.0, Steve P2008 2014, Flikr
PREVENTION
CANARIES
TRADITIONAL CANARY
Canary
(New Code)
Baseline
(Old Code)
TrafficTraffic
Video
Metadata
Service
Amazon S3
Netflix
Services
Netflix
Services
Netflix
Services
Netflix
ServicesNetflix
Service
Source
System
Source
System
Traffic
CONSISTENCY
VALID STATE TRANSITIONS
DATA CANARY
Netflix
Services
Netflix
Services
Netflix
Services
Netflix
Services
Video
Metadata
Service
Amazon S3
Source
System
Source
System
Netflix
Service
Netflix
Data
Canary
Service
Data
Tester
Netflix
Service
Traffic
Australia with AAT, CC BY-SA 2.0, Ssolbergj 2010, Wikimedia
SEEING RETURNS
Verify consistency prior to
applying state changes.
…one tool is a data canary.
CHAPTER TWO
THE VANISHING OF CRITICAL SERVICES
Stranger Things: The Forces that Disrupt Netflix
Stranger Things: The Forces that Disrupt Netflix
# A distributed system is
one in which the failure
of a computer you didn't
even know existed can
render your own computer
unusable.
--Leslie Lamport
Proxy/Routing
Devices
LOG DATA
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Stranger Things: The Forces that Disrupt Netflix
Proxy/Routing
Devices
Proxy
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
Stranger Things: The Forces that Disrupt Netflix
Proxy/Routing
Devices
CASCADING FAILURE
Log Data Service
Traffic
Cassandra
Playback ServiceNetflix
Playback
Service
Netflix
Playback
Service
Edge
Service
Edge
Service
Edge
Service
Playback
Service
Traffic
{
throw new OutOfMemoryError();
}
Log Data Service
Cassandra
Playback Service
PREVENTION
MANAGING RESOURCE CONSTRAINTS
Whatever you ask, CC BY-SA 2.0, Kreg Steppe 2008, Flikr
Astronomical Clock, CC BY 2.0, Andrew Fleming 2011, Flikr
REDUCE SURFACE AREA
1
Keep Only
Dependencies
which are
Necessary
SO MANY JARS!!
2 LIMIT “MAGIC”
Magic, CC BY-ND 2.0, Daniel Lee 2013, Flikr
3
Medusa Kill Switch, CC BY-NC-ND 2.0, Scott Hart 2013, Flikr
ADD KILL SWITCHES
DEV
FAVOR IMMUTABILITY
Playback Service
TEST
Playback Service
PROD
Playback Service
4
try {
remoteService.call();
} catch( Throwable t ){
//Oops!
System.exit(1);
}
Log Data
Service
Cassandra
Playback
Service
Proxy/Routing
Traffic
It's Electric, CC BY ND 2.0, Alan Hochberg 2008, Flikr
MITIGATION
CIRCUIT BREAKERS
Stranger Things: The Forces that Disrupt Netflix
Wrecking Ball in Building, CC BY 2.0, Jason Eppink 2008, Flikr
FAILURE TESTING
Proxy/Routing
Devices
FAILURE TESTING
Log Data Service
Traffic
Cassandra
Playback Service Automating Chaos
Experiments in
Production
by Ali Basiri
Applying Failure
Testing Research
@Netflix
by Kolton Andrus and
Peter Alvaro
Manage resource constraints by
reducing surface area.
Leverage circuit breakers and
rigorously test failures.
CHAPTER THREE
THE THROTTLE
Stranger Things: The Forces that Disrupt Netflix
Proxy/Routing
Devices
PLAYBACK ARCHITECTURE
Edge Service Edge Service Edge Service
Playback Service
Traffic Traffic
URL Service
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers and Fallbacks
Metrics Retries and Timeouts
RPCService Discovery
Stranger Things: The Forces that Disrupt Netflix
Playback
Service
Traffic Concurrent
Requests
Throttled Requests
(HTTP 503)
THROTTLING
Stranger Things: The Forces that Disrupt Netflix
}
System.gc();
}
URL
Service
Playback Service
Edge Service
Proxy/Routing
Traffic
NETFLIX CLIENT JARS
Playback Service
URL
Service
URL Client
Circuit-breakers
Metrics
Retries and
Timeouts
RPCService Discovery
Heavy
Fallback
FALLBACK TESTING
With 100% Fallback,
CPU held at 90%
15 RPSNo fallback, CPU held at 90%
58 RPS
Siege: https://github.com/JoeDog/siege
SELECTING FALLBACKS
CACHESTATIC
FALLBACK
SERVICE
URL
Service
Playback Service
Edge Service
Proxy/Routing
Traffic
}
return Response
.status(503)
.build();
}
REQUEST BUCKETING
NON-CRITICAL
Experience or
Performance
Impact
CRITICAL
Customer
Streaming
Impact
Fire Buckets at Oakworth Statione, CC BY 2.0, Tim Greene 2015, Flikr
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/Routing
Devices
Edge Service Edge Service Edge Service
Traffic Traffic
URL Service
Critical Playback
Service
Non-Critical URL
Service
CRITICAL
Country Road at Sunrisee, CC BY-SA 2.0, Susanne Nilssone 2014, Flikr
NON-CRITICAL
Traffice, CC BY-NC 2.0, jonbgeme 2008, Flikr
APPLICATION SHARDING
Non-Critical
Playback Service
Proxy/Routing
Devices
Edge Service Edge Service Edge Service
Traffic Traffic
Critical Playback
Service
URL Service
Non-Critical URL
Service
No heavy fallbacks!!
Fallbacks should be light and fast.
Shard your application based on
operational characteristics.
EPILOGUE
KEY TAKEAWAYS
KEY TAKEAWAYS
CHAPTER 1: THE WEIRD DATA IN THE CATALOG
• Verify consistency prior to applying state changes.
• One tool is a data canary.
CHAPTER 2: THE VANISHING OF CRITICAL SERVICES
• Manage resource constraints by reducing surface area.
• Leverage circuit breakers and rigorously test failures.
CHAPTER 3: THE THROTTLE
• No heavy fallbacks!! Fallbacks should be light and fast.
• Shard your application based on operational characteristics.
The unexpected will happen.
Plan to fail.
PARTING THOUGHT
DISTRIBUTED SYSTEMS
SOCIAL
Questions?
Haley Tucker
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-resilience

More Related Content

PPTX
DockerCon EU 2015: What's New with Docker Trusted Registry
PDF
Netflix Container Runtime - Titus - for Container Camp 2016
PDF
It takes a Village to do the Impossible - Jeff Lindsay
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
Netflix Cloud Architecture and Open Source
PPTX
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
PDF
Activision's Skypilot: Delivering Amazing Game Experiences Through Containeri...
PDF
Containers & Security
DockerCon EU 2015: What's New with Docker Trusted Registry
Netflix Container Runtime - Titus - for Container Camp 2016
It takes a Village to do the Impossible - Jeff Lindsay
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
Netflix Cloud Architecture and Open Source
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
Activision's Skypilot: Delivering Amazing Game Experiences Through Containeri...
Containers & Security

What's hot (20)

PDF
Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...
PDF
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
PDF
Netflix Open Source Meetup Season 3 Episode 2
PPTX
Using the SDACK Architecture on Security Event Inspection by Yu-Lun Chen and ...
PPTX
Netflix OSS Meetup Season 5 Episode 1
PDF
DockerCon 2017 - General Session Day 1 - Solomon Hykes
PDF
Windows container security
PDF
DockerCon EU 2015: Day 1 General Session
PDF
Proactive ops for container orchestration environments
ODP
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
PDF
Container World 2017 - Characterizing and Contrasting Container Orchestrators
PPTX
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
PPTX
Enabling Production Grade Containerized Applications through Policy Based Inf...
PDF
Develop and Deploy Cloud-Native Apps as Resilient Microservice Architectures
PDF
Docker for Ops - Scott Coulton, Puppet
PDF
Netflix and Containers: Not A Stranger Thing
PPTX
DockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker Brings
PDF
Java one kubernetes, jenkins and microservices
PDF
The Docker Ecosystem
PDF
Building your production tech stack for docker container platform
Taking Docker from Local to Production at Intuit JanJaap Lahpor, Intuit and H...
Why I wish I'd Heard of Docker when I was 12 - Finnian Anderson
Netflix Open Source Meetup Season 3 Episode 2
Using the SDACK Architecture on Security Event Inspection by Yu-Lun Chen and ...
Netflix OSS Meetup Season 5 Episode 1
DockerCon 2017 - General Session Day 1 - Solomon Hykes
Windows container security
DockerCon EU 2015: Day 1 General Session
Proactive ops for container orchestration environments
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Container World 2017 - Characterizing and Contrasting Container Orchestrators
DockerCon EU 2015: Docker Universal Control Plane (Gordon's Special Session)
Enabling Production Grade Containerized Applications through Policy Based Inf...
Develop and Deploy Cloud-Native Apps as Resilient Microservice Architectures
Docker for Ops - Scott Coulton, Puppet
Netflix and Containers: Not A Stranger Thing
DockerCon EU 2015: Cultural Revolution - How to Mange the Change Docker Brings
Java one kubernetes, jenkins and microservices
The Docker Ecosystem
Building your production tech stack for docker container platform
Ad

Similar to Stranger Things: The Forces that Disrupt Netflix (20)

PDF
How Netflix Directs 1/3rd of Internet Traffic
PDF
Evolution of the Netflix API
PDF
Scaling Push Messaging for Millions of Devices @Netflix
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PDF
Edge architecture ieee international conference on cloud engineering
PPTX
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
PDF
Mastering Chaos - A Netflix Guide to Microservices
PPTX
OpenStack: Changing the Face of Service Delivery
PPTX
OpenStack: Changing the Face of Service Delivery
PPTX
Move Fast;Stay Safe:Developing & Deploying the Netflix API
PDF
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
PDF
Managing microservices with istio on OpenShift - Meetup
PDF
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
PPTX
Mastering Chaos - A Netflix Guide to Microservices
PDF
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
PPTX
Building a scalable microservice architecture with envoy, kubernetes and istio
PDF
API World 2013 - Transforming the Netflix API
PPTX
Scaling the Netflix API
PDF
Devoxx university - Kafka de haut en bas
PDF
Microservices: State of the Union
How Netflix Directs 1/3rd of Internet Traffic
Evolution of the Netflix API
Scaling Push Messaging for Millions of Devices @Netflix
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Edge architecture ieee international conference on cloud engineering
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Mastering Chaos - A Netflix Guide to Microservices
OpenStack: Changing the Face of Service Delivery
OpenStack: Changing the Face of Service Delivery
Move Fast;Stay Safe:Developing & Deploying the Netflix API
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Managing microservices with istio on OpenShift - Meetup
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
Mastering Chaos - A Netflix Guide to Microservices
Microservice 4.0 Journey - From Spring NetFlix OSS to Istio Service Mesh and ...
Building a scalable microservice architecture with envoy, kubernetes and istio
API World 2013 - Transforming the Netflix API
Scaling the Netflix API
Devoxx university - Kafka de haut en bas
Microservices: State of the Union
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
PDF
Next Generation Client APIs in Envoy Mobile
PDF
Software Teams and Teamwork Trends Report Q1 2020
PDF
Understand the Trade-offs Using Compilers for Java Applications
PDF
Kafka Needs No Keeper
PDF
High Performing Teams Act Like Owners
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
PDF
Service Meshes- The Ultimate Guide
PDF
Shifting Left with Cloud Native CI/CD
PDF
CI/CD for Machine Learning
PDF
Fault Tolerance at Speed
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
PDF
Build Your Own WebAssembly Compiler
PDF
User & Device Identity for Microservices @ Netflix Scale
PDF
Scaling Patterns for Netflix's Edge
PDF
Make Your Electron App Feel at Home Everywhere
PDF
The Talk You've Been Await-ing For
PDF
Future of Data Engineering
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Next Generation Client APIs in Envoy Mobile
Software Teams and Teamwork Trends Report Q1 2020
Understand the Trade-offs Using Compilers for Java Applications
Kafka Needs No Keeper
High Performing Teams Act Like Owners
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Service Meshes- The Ultimate Guide
Shifting Left with Cloud Native CI/CD
CI/CD for Machine Learning
Fault Tolerance at Speed
Architectures That Scale Deep - Regaining Control in Deep Systems
ML in the Browser: Interactive Experiences with Tensorflow.js
Build Your Own WebAssembly Compiler
User & Device Identity for Microservices @ Netflix Scale
Scaling Patterns for Netflix's Edge
Make Your Electron App Feel at Home Everywhere
The Talk You've Been Await-ing For
Future of Data Engineering
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 2 Digital Image Fundamentals.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced IT Governance
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
GamePlan Trading System Review: Professional Trader's Honest Take
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 2 Digital Image Fundamentals.pdf
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
Advanced IT Governance

Stranger Things: The Forces that Disrupt Netflix