SlideShare a Scribd company logo
A NETFLIX ORIGINAL SERVICE
Netflix Keystone - Cloud scale event processing pipeline
[ O’reilly - Turning big data into knowledge ]
Monal Daxini
● Season 1
○ Keystone pipeline - Why? How? What?
● Season 2
○ Trailer
What to expect?
Monal Daxini
I lead the Stream Processing effort in the
Real Time Data Infrastructure team @ Netflix
@monaldax #Netflix #Keystone
Who?
Netflix Is a Data Driven Company
Content
Product
Marketing
Finance
Business Development
Talent
Infrastructure
←CultureofAnalytics→
Netflix Service
Truly Global Internet TV Network
Over 80 Million Members
190 Countries
1000+ devices
125,000,000 hours/day
14,269 years worth / day
37% of Internet traffic at peak
Over
1,000,000,000,000
That’s a huge number!
Keystone Events Processed Every Day
Events Trend
1/2014 80B / day 1/2015 300B / day 1/2016 1T / day
Keystone Scale
Daily Averages
● 700B unique events ingested
● 1T events processed
● 1.5PB / day
● 4K / event
Peak
● 1T unique events ingested/day
● 12.5M / sec
● 35GB / sec
● 10MB / message
● 99.99% / Four 9s
● Initial numbers with VPC
○ 99.9999% / Six 9s
Keystone Availability
Keystone Pipeline Evolution
In the beginning… Chukwa
EMR
Event
Producers
Q4 2014 - Chukwa / Suro
Event
Producer
Druid
Stream
Consumers
EMR
Consumer
Kafka
Suro Router
Event
Producer
Suro
Kafka
Suro
Proxy
● Support at-least-once processing
● Scale, Multi-tenancy, Ease of Operations
● Enable future value adds - Stream Processing As a Service
● Replace dormant open source software - Chukwa
Why a new pipeline?
Goal - Migrate 1.3 PB of event data to a new Pipeline
in flight, while not losing more that 0.1% of them
Q4 2015 - Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Consumer
Kafka
Control Plane
Event
Producer
KSProxy
Want to know more...
Netflix Tech Blog - Pipeline Evolution
Event flow
Keystone Pipeline As a Service
Keystone
Stream
Consumers
Samza
Routers
EMR
Fronting
Kafka
Clusters
Event
Producer
Consumer
Kafka
Control Plane
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Self Service UI (Keystone Management)
Events & Producers
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Event Payload is Immutable
At-least-once semantics*
* Once the event makes it to Kafka, there are disaster scenarios where this breaks.
Injected Event Metadata
● GUID
● Timestamp
● Host
● App
`
Custom Extensible Wire Protocol
● Backwards and forwards compatibility
● Support multiple serialization formats
○ JSON, AVRO, Protobuf in the works
● Additional metadata
● Efficient - 10 bytes overhead per message
Netflix Kafka Producer
● Configurable - topic to Kafka clusters routing
● Sticky partitioner
● Prefer event drop than disrupt producer app
● Best effort delivery, ack = 1
● Buffer size tuning based on traffic
Kafka Clusters
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Kafka (prod) Footprint
Fronting
Kafka
Front Standby Kafka Consumer Kafka
Number of Clusters 24 24 8
Number of Instances 3000+ 72 1000+
Retention Period 8 to 24 hrs 1 hr 2 to 4 hrs
● Independent zookeeper cluster per Kafka cluster
● 5 nodes per ensemble - 160 zookeeper nodes
● 3 ASGs per cluster, 1 ASG per zone
Kafka (prod) Footprint
● Pioneer Tax
● Started with 0.7, went live with 0.8.2
● Done moving to 0.9 & VPC
● Work closely with Confluent to get patches through
○ OSS contributions
Kafka in the Cloud
● No dynamic topic creation
● Two copies
● Rack / Zone aware partition assignment
● Per Cluster Stay under 10k partitions & 200 brokers
● Leave approx. 40% free disk space on each broker
Fronting Kafka Topics
Want to know more...
Netflix Tech Blog - Kafka in Keystone Pipeline
Routing Service
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Event
Producer
Consumer
Kafka
Control Plane
Samza Job Deployment
● Multiple Samza jobs for one Kafka source topic
● Each job processes messages for one sink
● Each job processes partitions only from one topic
● One checkpoint topic per Kafka source topic and
multiple samza jobs
● Job starts with Immutable Config
● Over 13,000+ docker containers (samza jobs)
● 1,300+ AWS C3-4XL instances
Routing Service Footprint
S3 ElasticSearch Consumer Kafka
Number of containers 7000+ 1500+ 4500+
Routing Latency
Fronting Kafka to Sinks
S3 ElasticSearch Consumer Kafka
1 sec 13 sec 800 ms
Routing Infrastructure
+
Checkpointing
Cluster
+
0.9.1
Go
Router Job Manager
(Control Plane)
EC2 Instances
Zookeeper
(Instance Id assignment)
Job
Job
Job
ksnode
Checkpointing Cluster
ASG
Custom Go
Executor
./runJob
Logs
Snapshots
Attach Volumes
./runJob
./runJob
Reconcile Loop - 1 min
Health Check
What’s running in ksnode?
Zookeeper
(Instance Id assignment)
Logs ZFS Volume
Snapshots
Custom Go
Executor
.
/runJo
b
.
/runJo
b
.
/runJo
b
Go Tools Server
Client
Tools
Stream Logs
Browse through
rotated logs by date
Ksnode Tooling
Yes! You inferred right!
No Mesos & No Yarn
Samza Tweaks to ver 0.9.1
● Using ThreadJobFactory - Simplifies deployment and reduces overhead
● SAMZA-41 - range based static partition range assignment
● SAMZA-775- size based Prefetch buffer
○ Default was count based (OOM), and not bytes based
○ Set per topic per job to 60 * peak bytes / sec over the past week.
Samza Tweaks to ver 0.9.1
● Backported from 0.10
○ SAMZA-655 - environment variable configuration rewriter
■ Pass config from RDS to executor to Docker to Samza Job
○ SAMZA-540 - expose latency related metrics in OffsetManager
■ checkpointed offset gauge
More Info - Monal’s Samza Meetup Slides (10/2015)
Netflix Samza ver 0.9.1 Contributions
Metrics & Monitoring
Keystone
Stream
Consumers
Samza
Router
EMR
Fronting
Kafka
Consumer
Kafka
Control Plane
Event
Producer
KSProxy
Customer Facing per topic end-to-end dashboard
Birth of Kafka Kong
A True Story
Keystone went live 10.27.2015
2 days later...
A True Story
● 80% Loss over 6 hour period
● Large Kafka clusters were impacts, smaller ones were fine
● At times things go wrong, and there’s no turning back
● Reduce complexity
● Minimize blast radius
● Find a way to start over fresh
Lessons Learned
Samza Router
Fronting
Kafka
Event
Producer
X
Failover
Stand-In
Kafka
● Cold standby Kafka cluster with 3 instances and different instance type
● Different ZooKeeper cluster with no state
● When failover occur
○ Scaling up cluster
○ Creating topic
○ Creating new routing jobs for failover cluster
○ Switch producer traffic!
Failover
Kafka Kong
At least once
a week
Fronting Kafka Failover
Self Service Tool
Fronting Kafka Failover
Fully
Automated
● Time is the essence - failover as fast as 5 minutes
Culture
What does it have to do with
building a pipeline?
"It may well be the most important document
ever to come out of the Valley." 1
Sheryl Sandberg
COO, Facebook
1 Business Insider, 2013
Netflix Culture Deck
Netflix Culture
Freedom & Responsibility
Open source and community participation is
an integral part of our strategy and culture
Not DevOps, but move towards NoOps
You build it! You run it!
My team has
● No dedicated product or project managers
● No separate devops or operations team
We build and run what you saw today!
● This does not mean we are constantly overworked
○ we make wise and simple choices and
○ lean towards automation & self-healing systems
We build and run what you saw today!
We built a pipeline in a year with
A very small team,
Relevant new technology,
Contributed back to OSS, and
Processed over 1 Trillion messages / day
Culture Impact
Season 2
Trailer
Create DuploⓇ
Blocks:
Let reusability drive new value
Our Philosophy
Evolution
Keystone
Stream Processing
(SPaaS)
Keystone
Management
Keystone
Messaging
Schema Support
Simple and intuitive interface to
manage all Keystone services
Keystone Management
Keystone Stream Processing
Stream Processing As a Service (SPaaS)
Multi-tenant polyglot support for stream processing engines
BETA
Big Data Systems - streaming
Data Pipeline & Stream processing - Keystone - Samza / Flink (poc)
Playback & edge Operations insight - Mantis
Stream Processing - Spark Streaming
* Metrics & monitoring - Atlas *
SPaaS Architecture
SPaaS
Manager
Container
Runtime
Beam API
or
Framework Specific API
[ Dockerized Job ]
1. Create 2. Submit 3. Launch
Runner
Flink / Spark /
Mantis
Running
Job
1. Submit Job
DSL (SQL)
Job Dashboard
BETA
● Starting out Proof Of Concept with Apache Flink
● Exploring Apache Beam
SPaaS - init( )
BETA
Why Apache Beam?
○ Portable API layer for building sophisticated data processing
applications
○ Unified APi for processing bounded and unbounded data sources
○ Google lineage - Dataflow model, and reflects Google’s current work
SPaaS - “Beam Me Up, Scotty ! "
Why Flink?
○ Flink implements the dataflow model
○ Correctness of results and powerful features for reasoning about time
○ Checkpoints, exactly-once processing
○ Event time, processing time, watermarks, triggers, aligned windows
(fixed, sliding), unaligned windows (dynamic or session windows)
○ Flink’s core is a streaming engine
SPaaS - Flink
More brain food...
Netflix OSS
Samza Meetup Presentation
Netflix Tech Blog
Spark Summit 2015 Talk
Thanks!
Q & A
You can find me at
@monaldax
mdaxini@netflix.com
● Special thanks to all the people who made and released these awesome
resources for free:
○ Photographs by Unsplash
Credits

More Related Content

What's hot (18)

PPTX
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Gigaom
 
PDF
Scalable complex event processing on samza @UBER
Shuyi Chen
 
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
PPTX
From a kafkaesque story to The Promised Land
Ran Silberman
 
PDF
Uber Real Time Data Analytics
Ankur Bansal
 
PDF
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
PPTX
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler
 
PDF
Keystone - ApacheCon 2016
Peter Bakas
 
PPTX
Lambda Architecture in Practice
Navneet kumar
 
PDF
Event Stream Processing with Kafka and Samza
Zach Cox
 
PDF
netflix-real-time-data-strata-talk
Danny Yuan
 
PDF
#lspe Q1 2013 dynamically scaling netflix in the cloud
Coburn Watson
 
PPTX
Building Stream Processing as a Service
Steven Wu
 
PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
PDF
Keystone - Leverage Big Data 2016
Peter Bakas
 
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
Netflix viewing data architecture evolution - QCon 2014
Philip Fisher-Ogden
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Gigaom
 
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
From a kafkaesque story to The Promised Land
Ran Silberman
 
Uber Real Time Data Analytics
Ankur Bansal
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
confluent
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
blueboxtraveler
 
Keystone - ApacheCon 2016
Peter Bakas
 
Lambda Architecture in Practice
Navneet kumar
 
Event Stream Processing with Kafka and Samza
Zach Cox
 
netflix-real-time-data-strata-talk
Danny Yuan
 
#lspe Q1 2013 dynamically scaling netflix in the cloud
Coburn Watson
 
Building Stream Processing as a Service
Steven Wu
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
Keystone - Leverage Big Data 2016
Peter Bakas
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 

Similar to Netflix Keystone—Cloud scale event processing pipeline (20)

PDF
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
PDF
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
PDF
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
PDF
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Nick Mahilani
 
PPTX
Keystone event processing pipeline on a dockerized microservices architecture
Zhenzhong Xu
 
PDF
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
PPTX
Netflix Data Pipeline With Kafka
Steven Wu
 
PDF
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
PPTX
Running a Massively Parallel Self-serve Distributed Data System At Scale
Zhenzhong Xu
 
PDF
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
PDF
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
HostedbyConfluent
 
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
PDF
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
PDF
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
PDF
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
 
PPTX
AWS summit 2016: Scale to 12,000,000 users with AWS
Chun-Chiao Li
 
PPTX
Kafkha real time analytics platform.pptx
dummyuseage1
 
PDF
Kafka At Scale in the Cloud
confluent
 
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Monal Daxini
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Monal Daxini
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Nick Mahilani
 
Keystone event processing pipeline on a dockerized microservices architecture
Zhenzhong Xu
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Flink Forward
 
Netflix Data Pipeline With Kafka
Steven Wu
 
From Three Nines to Five Nines - A Kafka Journey
Allen (Xiaozhong) Wang
 
Running a Massively Parallel Self-serve Distributed Data System At Scale
Zhenzhong Xu
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Monal Daxini
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Monal Daxini
 
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
HostedbyConfluent
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
confluent
 
Connect K of SMACK:pykafka, kafka-python or?
Micron Technology
 
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
 
AWS summit 2016: Scale to 12,000,000 users with AWS
Chun-Chiao Li
 
Kafkha real time analytics platform.pptx
dummyuseage1
 
Kafka At Scale in the Cloud
confluent
 
Ad

Recently uploaded (20)

PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
July Patch Tuesday
Ivanti
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
July Patch Tuesday
Ivanti
 
Ad

Netflix Keystone—Cloud scale event processing pipeline