SlideShare a Scribd company logo
Running Containers at
Scale at Netflix
@aspyker @corindwyer
The Titus Team
● Develop
● Operate
● Support
Netflix’s Container Management Platform
Titus
Scheduling
● Service & batch jobs
● Resource management
Container Execution
● Docker/AWS Integration
● Netflix Infra Support
Service
Job and Fleet Management
Resource Management & Optimization
Container Execution
Integration
Batch
● 1000+ Applications
● Netflix API, NodeJS Backend UI Scripts
● Machine Learning (GPUs) for personalization
● Encoding and Content use cases
● Netflix Studio use cases
● CDN tracking and planning
● Massively parallel CI system
● Data Pipeline routing and SPaaS
● Big Data platform use cases
Growing set of container use cases
Batch
Q4 15
Basic
Service
1Q 16
Production
Service
4Q 16
Customer
Facing
Service
2Q 17
shadow
High Level Titus Architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Agents
Mesos agent
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Mesos
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Q1 2018 Container Usage
Common
Jobs Launched 176K jobs / day
Different applications 1K+ different images
Regional isolated Titus stacks 7
Services
Single App Cluster Size 5K (real), 12K containers (benchmark)
Agents managed 16K VMs
Batch
Containers launched 430K / day
Agents autoscaled 350K VMs / month
Leveraging existing Netflix and AWS Infrastructure
Single consistent cloud environment between VMs and containers
VMVM
EC2
AWSAutoScaler
VMs
Service App
Cloud Platform
(metrics, IPC, health)
VPC
VMVM
Atlas
TitusJobControl
Containers
Service App
Cloud Platform
(metrics, IPC, health)
Eureka Edda
VMVMContainers
Batch App
Cloud Platform
(metrics, IPC, health)
Most Native AWS Container Platform
IP per container
● VPC IP, ENI and security group
● Optimized to share ENIs
● ENI pre-attaching, opportunistic batching of IPs (bursty deploys)
IAM Roles and Metadata Endpoint per container
● Container view of 169.254.169.254
Cryptographic identity per container
● Using Amazon instance identity document
Service job container autoscaling
● Using Native AWS Cloudwatch and Autoscaling policies and engine
Application Load Balancing (ALB)
Advanced Scheduling and
Control Plane Technologies
Scheduling / Placement
Considering the realities of …
● Docker, Linux, Image Pulling, etc.
● Complex resources (ENIs)
● Amazon rate limiting
● Filtering (constraints) and ranking (fitness)
● Different profiles for service | batch, critical | operational, etc.
Reliability
Provisioning
Time
Cost
Trade
offs
Capacity Management
User configures “capacity groups” based on workload type
Critical (RIs)
● Preallocated instances in order to achieve low provisioning time
● Buffer to support temporary extra capacity needs for deployments
Flex (On-Demand)
● Autoscaled instances based on demand
Opportunistic (Spot) - Coming
● Utilize extra instances with the ability to preempt or evict the workload
Centralized Agent Management
Agent Management
Other subsystems
Health checks
Cluster lifecycle
Other signals
Unified component for tracking agent
information, Powers other systems like task
migration, canaries, agent remediations
Cluster B
Agents states =
schedulable
For example: Task Migration
Cluster A
Agent state =
non-schedulable,
drain tasks
Agent Management
Task Migration
Cluster state
Integrations
Titus
External Resources
Operational VisibilityInterfaces
Load Balancers
Autoscaling
Spinnaker (Task Migration)
Telemetry - Atlas
Event Storage - Elasticsearch
REST / gRPC
Streaming updates
Advanced Container
Runtime Technologies
Multi-tenant networking is hard
Decided early on we wanted full IP stacks per container
But what about?
● Security group support
● IAM role support
● Network bandwidth isolation
● Leverage VPC
Virtual Machine Host
Titus Networking
sg=A,B
IP 2
sg=B,C
IP 3
Metadata
service
IPVlan, BPF, IFBs to route app traffic
Container 1 Container 2
sg=A,B
IP 4
Container 3
eth 0
sg=Titus
control plane
eth1
sg=A,B
eth2
sg=B,C
eth-mdeth-md
Titus executor
eth0eth0eth0
IP 2
IP 4
IP 3
IP 1
Metadata
service
eth-md
Metadata
service
169.254.169.254
Next challenge: Speed limits of EC2 Networking
Largest EC2 challenge: speed of networking reconfiguration
Changes in how we work with EC2 API’s
● Work with Amazon to redefine networking related API rate limits, buckets
● Pre-attach all networking interfaces
● IPs are asked for in bulk opportunistically
Also, coordination with scheduler
● Prefer instances with containers already in the same security group
For large scale failovers
● Before … hours, after ... minutes
● Detection - health checks
○ Linux subsystems (systemd, filesystems)
○ Docker aspects (runtime health, registry pulls)
○ Titus processes (networking, GPU, security drivers)
○ Mesos aspects (agent, executor)
● Remediation
○ Local reconciliation
○ Docker image cleanup
Overcoming failures on each agent
Process Model Evolution
Single process containers
● Worked for some time, until we needed system services
System services
● Telemetry, IAM support, log uploading
● Added as host installed daemons; isolation & multi-tenancy concerns
● Currently injecting system services into containers
Composing system services into containers
● Considered pods; lifecycle and usage complexities limited value
● Considering future of both systemd and docker image composability
Resource Isolation
● CPU
○ Started with bursting; was interfering with predictability
○ Resource tiers to avoid interference problems
● Memory
○ Hard limit, OOM kills entire container
● Network
○ Bandwidth throttling
● Disk space
● GPUs
Security Isolation
● Deployed user namespaces
○ Challenging due to shared systems without UID shifting
● Needed ad hoc debugging
○ Titus-ssh for user level access to their container
○ Still required power user access for kernel functions
○ Working to automate through tools like Vector (NetflixOSS)
● Seccomp overhead and complexity is prohibitive
○ Working towards automated policies and BPF driven implementations
Open Sourcing
Currently in private open source collaboration with those who want ...
● The NetflixOSS container solution (Spinnaker + Titus + Netflix RPC)
● A unified batch and service Mesos scheduler
● More robust & native AWS container platform
Hope to fully open source in early Q2
● If you want access now, let us know
● Looking for collaborators, feedback
Q&A
Ad

More Related Content

What's hot (20)

Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
Bob Killen
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs
 
Kubernetes
KubernetesKubernetes
Kubernetes
erialc_w
 
Red Hat OpenStack - Open Cloud Infrastructure
Red Hat OpenStack - Open Cloud InfrastructureRed Hat OpenStack - Open Cloud Infrastructure
Red Hat OpenStack - Open Cloud Infrastructure
Alex Baretto
 
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOpsMeetup 23 - 03 - Application Delivery on K8S with GitOps
Meetup 23 - 03 - Application Delivery on K8S with GitOps
Vietnam Open Infrastructure User Group
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
The {code} Team
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
Weaveworks
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
Etsuji Nakai
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
Animesh Singh
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
Brice Fernandes
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101
Weaveworks
 
Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOps
Brice Fernandes
 
Docker, LinuX Container
Docker, LinuX ContainerDocker, LinuX Container
Docker, LinuX Container
Araf Karsh Hamid
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for Beginners
Oktay Esgul
 
Linux Container Technology 101
Linux Container Technology 101Linux Container Technology 101
Linux Container Technology 101
inside-BigData.com
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes Architecture
Knoldus Inc.
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
Integrating microservices with apache camel on kubernetes
Integrating microservices with apache camel on kubernetesIntegrating microservices with apache camel on kubernetes
Integrating microservices with apache camel on kubernetes
Claus Ibsen
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
InfluxData
 
Kubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive OverviewKubernetes - A Comprehensive Overview
Kubernetes - A Comprehensive Overview
Bob Killen
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs
 
Kubernetes
KubernetesKubernetes
Kubernetes
erialc_w
 
Red Hat OpenStack - Open Cloud Infrastructure
Red Hat OpenStack - Open Cloud InfrastructureRed Hat OpenStack - Open Cloud Infrastructure
Red Hat OpenStack - Open Cloud Infrastructure
Alex Baretto
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
The {code} Team
 
The Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps ToolkitThe Power of GitOps with Flux & GitOps Toolkit
The Power of GitOps with Flux & GitOps Toolkit
Weaveworks
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
Etsuji Nakai
 
KFServing - Serverless Model Inferencing
KFServing - Serverless Model InferencingKFServing - Serverless Model Inferencing
KFServing - Serverless Model Inferencing
Animesh Singh
 
Kubernetes Networking 101
Kubernetes Networking 101Kubernetes Networking 101
Kubernetes Networking 101
Weaveworks
 
Speeding up your team with GitOps
Speeding up your team with GitOpsSpeeding up your team with GitOps
Speeding up your team with GitOps
Brice Fernandes
 
Kubernetes 101 for Beginners
Kubernetes 101 for BeginnersKubernetes 101 for Beginners
Kubernetes 101 for Beginners
Oktay Esgul
 
Linux Container Technology 101
Linux Container Technology 101Linux Container Technology 101
Linux Container Technology 101
inside-BigData.com
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
gjuljo
 
Kubernetes Architecture
 Kubernetes Architecture Kubernetes Architecture
Kubernetes Architecture
Knoldus Inc.
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
Red Hat Developers
 
Integrating microservices with apache camel on kubernetes
Integrating microservices with apache camel on kubernetesIntegrating microservices with apache camel on kubernetes
Integrating microservices with apache camel on kubernetes
Claus Ibsen
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
InfluxData
 

Similar to Container World 2018 (20)

NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
Sharma Podila
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
Sreenivas Makam
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
Andrew Leung
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Introduction to rook
Introduction to rookIntroduction to rook
Introduction to rook
Rohan Gupta
 
Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Public
aspyker
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
Hojoong Kim
 
Welcome to icehouse
Welcome to icehouseWelcome to icehouse
Welcome to icehouse
Marcos García
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
All Things Open
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
aspyker
 
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
Akihiro Suda
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
Terry Wang
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
Olivier Sallou
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
C4Media
 
The journey to container adoption in enterprise
The journey to container adoption in enterpriseThe journey to container adoption in enterprise
The journey to container adoption in enterprise
Igor Moochnick
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
Sharma Podila
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
Sreenivas Makam
 
Netflix Titus WASP October 2017
Netflix Titus WASP October 2017Netflix Titus WASP October 2017
Netflix Titus WASP October 2017
Andrew Leung
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Introduction to rook
Introduction to rookIntroduction to rook
Introduction to rook
Rohan Gupta
 
Herding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes PublicHerding Kats - Netflix’s Journey to Kubernetes Public
Herding Kats - Netflix’s Journey to Kubernetes Public
aspyker
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
Hojoong Kim
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
All Things Open
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
aspyker
 
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
[KubeCon EU 2021] Introduction and Deep Dive Into Containerd
Akihiro Suda
 
Introduction and Deep Dive Into Containerd
Introduction and Deep Dive Into ContainerdIntroduction and Deep Dive Into Containerd
Introduction and Deep Dive Into Containerd
Kohei Tokunaga
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
Terry Wang
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
C4Media
 
The journey to container adoption in enterprise
The journey to container adoption in enterpriseThe journey to container adoption in enterprise
The journey to container adoption in enterprise
Igor Moochnick
 
Ad

More from aspyker (20)

Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
aspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
aspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
aspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
aspyker
 
NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
aspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
aspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
aspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
aspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
aspyker
 
NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
aspyker
 
Ad

Recently uploaded (20)

Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 

Container World 2018

  • 1. Running Containers at Scale at Netflix @aspyker @corindwyer
  • 2. The Titus Team ● Develop ● Operate ● Support
  • 3. Netflix’s Container Management Platform Titus Scheduling ● Service & batch jobs ● Resource management Container Execution ● Docker/AWS Integration ● Netflix Infra Support Service Job and Fleet Management Resource Management & Optimization Container Execution Integration Batch
  • 4. ● 1000+ Applications ● Netflix API, NodeJS Backend UI Scripts ● Machine Learning (GPUs) for personalization ● Encoding and Content use cases ● Netflix Studio use cases ● CDN tracking and planning ● Massively parallel CI system ● Data Pipeline routing and SPaaS ● Big Data platform use cases Growing set of container use cases Batch Q4 15 Basic Service 1Q 16 Production Service 4Q 16 Customer Facing Service 2Q 17 shadow
  • 5. High Level Titus Architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Agents Mesos agent Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Mesos Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 6. Q1 2018 Container Usage Common Jobs Launched 176K jobs / day Different applications 1K+ different images Regional isolated Titus stacks 7 Services Single App Cluster Size 5K (real), 12K containers (benchmark) Agents managed 16K VMs Batch Containers launched 430K / day Agents autoscaled 350K VMs / month
  • 7. Leveraging existing Netflix and AWS Infrastructure Single consistent cloud environment between VMs and containers VMVM EC2 AWSAutoScaler VMs Service App Cloud Platform (metrics, IPC, health) VPC VMVM Atlas TitusJobControl Containers Service App Cloud Platform (metrics, IPC, health) Eureka Edda VMVMContainers Batch App Cloud Platform (metrics, IPC, health)
  • 8. Most Native AWS Container Platform IP per container ● VPC IP, ENI and security group ● Optimized to share ENIs ● ENI pre-attaching, opportunistic batching of IPs (bursty deploys) IAM Roles and Metadata Endpoint per container ● Container view of 169.254.169.254 Cryptographic identity per container ● Using Amazon instance identity document Service job container autoscaling ● Using Native AWS Cloudwatch and Autoscaling policies and engine Application Load Balancing (ALB)
  • 9. Advanced Scheduling and Control Plane Technologies
  • 10. Scheduling / Placement Considering the realities of … ● Docker, Linux, Image Pulling, etc. ● Complex resources (ENIs) ● Amazon rate limiting ● Filtering (constraints) and ranking (fitness) ● Different profiles for service | batch, critical | operational, etc. Reliability Provisioning Time Cost Trade offs
  • 11. Capacity Management User configures “capacity groups” based on workload type Critical (RIs) ● Preallocated instances in order to achieve low provisioning time ● Buffer to support temporary extra capacity needs for deployments Flex (On-Demand) ● Autoscaled instances based on demand Opportunistic (Spot) - Coming ● Utilize extra instances with the ability to preempt or evict the workload
  • 12. Centralized Agent Management Agent Management Other subsystems Health checks Cluster lifecycle Other signals Unified component for tracking agent information, Powers other systems like task migration, canaries, agent remediations Cluster B Agents states = schedulable For example: Task Migration Cluster A Agent state = non-schedulable, drain tasks Agent Management Task Migration Cluster state
  • 13. Integrations Titus External Resources Operational VisibilityInterfaces Load Balancers Autoscaling Spinnaker (Task Migration) Telemetry - Atlas Event Storage - Elasticsearch REST / gRPC Streaming updates
  • 15. Multi-tenant networking is hard Decided early on we wanted full IP stacks per container But what about? ● Security group support ● IAM role support ● Network bandwidth isolation ● Leverage VPC
  • 16. Virtual Machine Host Titus Networking sg=A,B IP 2 sg=B,C IP 3 Metadata service IPVlan, BPF, IFBs to route app traffic Container 1 Container 2 sg=A,B IP 4 Container 3 eth 0 sg=Titus control plane eth1 sg=A,B eth2 sg=B,C eth-mdeth-md Titus executor eth0eth0eth0 IP 2 IP 4 IP 3 IP 1 Metadata service eth-md Metadata service 169.254.169.254
  • 17. Next challenge: Speed limits of EC2 Networking Largest EC2 challenge: speed of networking reconfiguration Changes in how we work with EC2 API’s ● Work with Amazon to redefine networking related API rate limits, buckets ● Pre-attach all networking interfaces ● IPs are asked for in bulk opportunistically Also, coordination with scheduler ● Prefer instances with containers already in the same security group For large scale failovers ● Before … hours, after ... minutes
  • 18. ● Detection - health checks ○ Linux subsystems (systemd, filesystems) ○ Docker aspects (runtime health, registry pulls) ○ Titus processes (networking, GPU, security drivers) ○ Mesos aspects (agent, executor) ● Remediation ○ Local reconciliation ○ Docker image cleanup Overcoming failures on each agent
  • 19. Process Model Evolution Single process containers ● Worked for some time, until we needed system services System services ● Telemetry, IAM support, log uploading ● Added as host installed daemons; isolation & multi-tenancy concerns ● Currently injecting system services into containers Composing system services into containers ● Considered pods; lifecycle and usage complexities limited value ● Considering future of both systemd and docker image composability
  • 20. Resource Isolation ● CPU ○ Started with bursting; was interfering with predictability ○ Resource tiers to avoid interference problems ● Memory ○ Hard limit, OOM kills entire container ● Network ○ Bandwidth throttling ● Disk space ● GPUs
  • 21. Security Isolation ● Deployed user namespaces ○ Challenging due to shared systems without UID shifting ● Needed ad hoc debugging ○ Titus-ssh for user level access to their container ○ Still required power user access for kernel functions ○ Working to automate through tools like Vector (NetflixOSS) ● Seccomp overhead and complexity is prohibitive ○ Working towards automated policies and BPF driven implementations
  • 22. Open Sourcing Currently in private open source collaboration with those who want ... ● The NetflixOSS container solution (Spinnaker + Titus + Netflix RPC) ● A unified batch and service Mesos scheduler ● More robust & native AWS container platform Hope to fully open source in early Q2 ● If you want access now, let us know ● Looking for collaborators, feedback
  • 23. Q&A