SlideShare a Scribd company logo
CMP376 - Another Week, Another Million Containers on Amazon EC2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Another Week, Another Million
Containers on Amazon EC2
Andrew Spyker
Software Engineering Manager
Netflix
C M P 3 7 6
Joe Hsieh
Principal Technical Account Manager
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why containers?
Given our VM architecture comprised of …
Amazingly resilient
Microservice driven
Cloud native
CI/CD DevOps enabled
Elastically scalable
Do we really need containers?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What was missing from our VM environment?
Packaging
• Simple to customize application focused artifacts
• Especially for growth of polyglot environments
• Notably for platforms with OS level dependencies
Local development
• Ability to run applications locally on developer laptops
Simple way to manage compute resources
• Especially for ad hoc batch processing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus, Netflix’s container management platform
Scheduling
• Service & batch job lifecycle
• Resource management
Container execution
• AWS Integration
• Netflix Ecosystem Support
Job and Fleet Management
Batch
Resource Management & Optimization
Container Execution
Service
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Titus team
• Design
• Develop
• Operate
• Support
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus and containers product strategy
• Ordered priority focus on
• Developer velocity
• Reliability
• Cost efficiency
Easy migration from VMs to containers
Easy container integration with VMs and Amazon Services
Focus on just what Netflix needs
“Our focus is to leverage EC2 deeply in Titus,
not abstract it away or implement similar
features. We see this as a differentiator of
Titus versus other container management
solutions.”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mesos
High level architecture
Titus Control Plane
• API
• Scheduling
• Job Lifecycle Control
Fenzo
Titus Agents
User Containers
Docker
Mesos Agent
Netflix System Services
AWS Virtual Machines
Docker Registry
Cassandra
AWS Auto Scaling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
EC2 virtual machine portability
Early on we decided a container MUST …
• Natively integrate with VPC for networking
• Natively integrate with security groups for firewalling
• Work with IAM based Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key leverage points
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EC2
GPUs - 10’s of p2.8xlarges
Memory optimized - 100’s of r4.16xlarges
General purpose - 1000’s of m4.16xlarges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
VPC and security groups
EC2 VM
ENI0
(to control plane)
ENI1
SG = w
ENI2
SG = x
ENIn
SG = z
Container 1
SG = w
ENI1 IP1
Container 2
SG = w
ENI1 IP2
Container 3
SG = y
ENI3 IP1
Titus
Container
Mgmt
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IAM based services
EC2 VM
ENI0
Container 1
eth0 ethMD
ENI1
Titus
Metadata
Proxy
Normal
networking 169.254.169.254
Amazon Metadata Service and
Security Token Service (STS)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus Host
Instance cryptographic identity
Metatron
Service
User
Container
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
All I really needed to know about
containers, I learned from Titus …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choices for Auto Scaling Titus applications
Use the two existing Netflix autoscaling engines we already had
• Pro: Code existed
• Con: Lacking features, we’d have to operate
Write a new one
Look for one from Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choices for Auto Scaling Titus applications
Use the two existing Netflix autoscaling engines we already had
Write a new one
• Pro: Would be specific to our needs
• Con: Would be lacking features, we’d have to operate
Look for one from Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Choices for Auto Scaling Titus applications
Use the two existing Netflix autoscaling engines we already had
Write a new one
Look for one from Amazon Web Services
• Pro: Already well understood for VMs, feature-rich
• Con: Only works for VMs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A true story
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A product manager introduction,
development team interchanges, and
multiple iterations later …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Application Auto Scaling with custom resources
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configuring Auto Scaling in Spinnaker
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus and Application Auto Scaling integration
User Containers
Control Plane
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus API call pattern
CreateNetworkInterface Total CreateNetworkInterface Throttled
AttachNetworkInterface Total AttachNetworkInterface Throttled
ModifyNetworkInterfaceAttribute Total ModifyNetworkInterfaceAttribute Throttled
AssignPrivateIpAddresses Total AssignPrivateIpAddresses Throttled
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Auto Scaling group Auto Scaling group Auto Scaling group
An infrastructure view of applications
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
An infrastructure view of applications
Auto Scaling group
VPC
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API calls
RunInstances
CreateNetworkInterface
AttachNetworkInterface
AssignPrivateIpAddress
ModifyNetworkInterface
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Netflix regional failover
Kong evacuation of us-east-1
Traffic diverted to other regions
Fail back to us-east-1
Traffic moved back to us-east-1
us-east-1
eu-west-1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Infrastructure challenge
• Increase capacity during scale up of savior region
• Launch 1000s of containers in seven minutes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Easy right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Easy right?
“we reduced time to schedule 30,000
pods onto 1,000 nodes from
8,780 seconds to 587 seconds”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus can do this by …
• Dynamically changeable scheduling behavior
• Fleet wide networking optimizations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Normal scheduling
VM1
App 1
App 2
ENI 1 App 2
IP1 IP1
VM2
App 1
ENI 1
IP1
VMn
App 1
App 2
ENI 1 App 2
IP1 IP1
Trade-off for reliability
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failover scheduling
VM1
App 1
App 2
ENI 1 App 2
IP1 IP1
VM2
App 1
ENI 1
IP1
VMn
App 1
App 2
ENI 1 App 2
IP1 IP1
App 1
App 1
App 1
App 1
App 1
App 2
App 2
IP2, IP3 IP2, IP3, IP4 IP2, IP3
Trade-off for speed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
On each host
Change when create and attach ENIs is performed
• Moved this to instance start time
• No longer needed on-demand
Need to burst allocate IP addresses
• Opportunistically batch allocate at container launch time
• Likely if one container was launched more are coming
• Garbage collect unused later
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus API pattern
ModifyNetworkInterfaceAttribute Total ModifyNetworkInterfaceAttribute Throttled
AssignPrivateIpAddresses Total AssignPrivateIpAddresses Throttled
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Results
us-east-1 / prod
containers started per minute
} 7500 Launched
in 5 minutes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Netflix load balancing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IP based Application Load Balancing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configuring EC2 load balancers in Spinnaker
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Titus and Load Balancing integration
User Containers
Control Plane
IP Target
Group
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use cases on Titus
• Netflix API, Node.js Backend UI Scripts
• Machine Learning (GPUs) for personalization
• Encoding and Content use cases
• Netflix Studio use cases
• CDN tracking and planning
• Massively parallel CI system
• Data Pipeline routing and SPaaS
• Big Data platform use cases
Batch
Q4 15
Basic
Services
1Q 16
Production
Services
4Q 16
Customer
Facing
Services
2Q 17
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Q4 2018 container usage
Common
Jobs launched 255K jobs / day
Different applications 1K+ different images
Isolated Titus deployments 7 stacks
Services
Single app cluster size 5K (real), 12K containers (benchmark)
Hosts managed 7K VMs (435,000 CPUs)
Batch
Containers launched 450K / day (750K / day peak)
Hosts managed (autoscaled) 55K VMs / month
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Open Source
Open sourced April 2018
Help other communities by sharing our approach
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Current and future work
Advanced CPU Isolation Opportunistic Workloads
Nitro and Bare Metal Instances Next Amazon and Netflix
Partnership
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Andrew Spyker
@aspyker
Joe Hsieh
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ad

More Related Content

What's hot (20)

Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
 
The new Netflix API
The new Netflix APIThe new Netflix API
The new Netflix API
Katharina Probst
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
aspyker
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
Ruslan Meshenberg
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
Mukta Aphale
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Tim Bozarth
 
CDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaCCDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaC
smalltown
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
aspyker
 
The service mesh management plane
The service mesh management planeThe service mesh management plane
The service mesh management plane
LibbySchulze
 
The Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, WixThe Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, Wix
Codemotion Tel Aviv
 
DevOps at Tradeshift - AWS community day nordics
DevOps at Tradeshift - AWS community day nordicsDevOps at Tradeshift - AWS community day nordics
DevOps at Tradeshift - AWS community day nordics
JesperTerkelsen1
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Docker, Inc.
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
aspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
NetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker TalkNetflixOSS and ZeroToDocker Talk
NetflixOSS and ZeroToDocker Talk
aspyker
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
aspyker
 
Dev309 from asgard to zuul - netflix oss-final
Dev309  from asgard to zuul - netflix oss-finalDev309  from asgard to zuul - netflix oss-final
Dev309 from asgard to zuul - netflix oss-final
Ruslan Meshenberg
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
aspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
aspyker
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
Datadog
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
Mukta Aphale
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Tim Bozarth
 
CDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaCCDK Meetup: Rule the World through IaC
CDK Meetup: Rule the World through IaC
smalltown
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
aspyker
 
The service mesh management plane
The service mesh management planeThe service mesh management plane
The service mesh management plane
LibbySchulze
 
The Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, WixThe Art of Decomposing Monoliths - Kfir Bloch, Wix
The Art of Decomposing Monoliths - Kfir Bloch, Wix
Codemotion Tel Aviv
 
DevOps at Tradeshift - AWS community day nordics
DevOps at Tradeshift - AWS community day nordicsDevOps at Tradeshift - AWS community day nordics
DevOps at Tradeshift - AWS community day nordics
JesperTerkelsen1
 
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Docker, Inc.
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
aspyker
 

Similar to CMP376 - Another Week, Another Million Containers on Amazon EC2 (9)

[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS
Amazon Web Services Korea
 
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
AWS Germany
 
More Containers Less Operations
More Containers Less OperationsMore Containers Less Operations
More Containers Less Operations
Donnie Prakoso
 
Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28
Boaz Ziniman
 
Wildrydes Serverless Workshop Tel Aviv
Wildrydes Serverless Workshop Tel AvivWildrydes Serverless Workshop Tel Aviv
Wildrydes Serverless Workshop Tel Aviv
Boaz Ziniman
 
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
AWS Germany
 
Builders' Day- Mastering Kubernetes on AWS
Builders' Day- Mastering Kubernetes on AWSBuilders' Day- Mastering Kubernetes on AWS
Builders' Day- Mastering Kubernetes on AWS
Amazon Web Services LATAM
 
[AWS Container Service] Introducing AWS Fargate
[AWS Container Service] Introducing AWS Fargate[AWS Container Service] Introducing AWS Fargate
[AWS Container Service] Introducing AWS Fargate
Amazon Web Services Korea
 
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
Amazon Web Services Japan
 
[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS[AWS Container Service] Getting Started with Kubernetes on AWS
[AWS Container Service] Getting Started with Kubernetes on AWS
Amazon Web Services Korea
 
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
Deep Dive on Amazon Elastic Container Service (ECS) I AWS Dev Day 2018
AWS Germany
 
More Containers Less Operations
More Containers Less OperationsMore Containers Less Operations
More Containers Less Operations
Donnie Prakoso
 
Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28Introduction to Serverless computing and AWS Lambda - Floor28
Introduction to Serverless computing and AWS Lambda - Floor28
Boaz Ziniman
 
Wildrydes Serverless Workshop Tel Aviv
Wildrydes Serverless Workshop Tel AvivWildrydes Serverless Workshop Tel Aviv
Wildrydes Serverless Workshop Tel Aviv
Boaz Ziniman
 
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
AWS Germany
 
[AWS Container Service] Introducing AWS Fargate
[AWS Container Service] Introducing AWS Fargate[AWS Container Service] Introducing AWS Fargate
[AWS Container Service] Introducing AWS Fargate
Amazon Web Services Korea
 
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
Amazon Web Services Japan
 
Ad

More from aspyker (13)

SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
aspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
aspyker
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014
aspyker
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalk
aspyker
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
aspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
aspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
aspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
aspyker
 
Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014Docker Demo IBM Impact 2014
Docker Demo IBM Impact 2014
aspyker
 
Netflix s2e1lightningtalk
Netflix s2e1lightningtalkNetflix s2e1lightningtalk
Netflix s2e1lightningtalk
aspyker
 
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@PulseGoing Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
Going Cloud Native with IBM Cloud and NetflixOSS for Dev@Pulse
aspyker
 
Ad

Recently uploaded (20)

Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 

CMP376 - Another Week, Another Million Containers on Amazon EC2

  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Another Week, Another Million Containers on Amazon EC2 Andrew Spyker Software Engineering Manager Netflix C M P 3 7 6 Joe Hsieh Principal Technical Account Manager Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why containers? Given our VM architecture comprised of … Amazingly resilient Microservice driven Cloud native CI/CD DevOps enabled Elastically scalable Do we really need containers?
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What was missing from our VM environment? Packaging • Simple to customize application focused artifacts • Especially for growth of polyglot environments • Notably for platforms with OS level dependencies Local development • Ability to run applications locally on developer laptops Simple way to manage compute resources • Especially for ad hoc batch processing
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus, Netflix’s container management platform Scheduling • Service & batch job lifecycle • Resource management Container execution • AWS Integration • Netflix Ecosystem Support Job and Fleet Management Batch Resource Management & Optimization Container Execution Service
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Titus team • Design • Develop • Operate • Support
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus and containers product strategy • Ordered priority focus on • Developer velocity • Reliability • Cost efficiency Easy migration from VMs to containers Easy container integration with VMs and Amazon Services Focus on just what Netflix needs
  • 8. “Our focus is to leverage EC2 deeply in Titus, not abstract it away or implement similar features. We see this as a differentiator of Titus versus other container management solutions.”
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mesos High level architecture Titus Control Plane • API • Scheduling • Job Lifecycle Control Fenzo Titus Agents User Containers Docker Mesos Agent Netflix System Services AWS Virtual Machines Docker Registry Cassandra AWS Auto Scaling
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. EC2 virtual machine portability Early on we decided a container MUST … • Natively integrate with VPC for networking • Natively integrate with security groups for firewalling • Work with IAM based Amazon Web Services
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key leverage points
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 GPUs - 10’s of p2.8xlarges Memory optimized - 100’s of r4.16xlarges General purpose - 1000’s of m4.16xlarges
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. VPC and security groups EC2 VM ENI0 (to control plane) ENI1 SG = w ENI2 SG = x ENIn SG = z Container 1 SG = w ENI1 IP1 Container 2 SG = w ENI1 IP2 Container 3 SG = y ENI3 IP1 Titus Container Mgmt
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IAM based services EC2 VM ENI0 Container 1 eth0 ethMD ENI1 Titus Metadata Proxy Normal networking 169.254.169.254 Amazon Metadata Service and Security Token Service (STS)
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus Host Instance cryptographic identity Metatron Service User Container
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. All I really needed to know about containers, I learned from Titus …
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choices for Auto Scaling Titus applications Use the two existing Netflix autoscaling engines we already had • Pro: Code existed • Con: Lacking features, we’d have to operate Write a new one Look for one from Amazon Web Services
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choices for Auto Scaling Titus applications Use the two existing Netflix autoscaling engines we already had Write a new one • Pro: Would be specific to our needs • Con: Would be lacking features, we’d have to operate Look for one from Amazon Web Services
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choices for Auto Scaling Titus applications Use the two existing Netflix autoscaling engines we already had Write a new one Look for one from Amazon Web Services • Pro: Already well understood for VMs, feature-rich • Con: Only works for VMs
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A true story
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A product manager introduction, development team interchanges, and multiple iterations later …
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Application Auto Scaling with custom resources
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configuring Auto Scaling in Spinnaker
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus and Application Auto Scaling integration User Containers Control Plane
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus API call pattern CreateNetworkInterface Total CreateNetworkInterface Throttled AttachNetworkInterface Total AttachNetworkInterface Throttled ModifyNetworkInterfaceAttribute Total ModifyNetworkInterfaceAttribute Throttled AssignPrivateIpAddresses Total AssignPrivateIpAddresses Throttled
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Auto Scaling group Auto Scaling group Auto Scaling group An infrastructure view of applications
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. An infrastructure view of applications Auto Scaling group VPC
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API calls RunInstances CreateNetworkInterface AttachNetworkInterface AssignPrivateIpAddress ModifyNetworkInterface
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Netflix regional failover Kong evacuation of us-east-1 Traffic diverted to other regions Fail back to us-east-1 Traffic moved back to us-east-1 us-east-1 eu-west-1
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Infrastructure challenge • Increase capacity during scale up of savior region • Launch 1000s of containers in seven minutes
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Easy right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Easy right? “we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds”
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus can do this by … • Dynamically changeable scheduling behavior • Fleet wide networking optimizations
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Normal scheduling VM1 App 1 App 2 ENI 1 App 2 IP1 IP1 VM2 App 1 ENI 1 IP1 VMn App 1 App 2 ENI 1 App 2 IP1 IP1 Trade-off for reliability
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failover scheduling VM1 App 1 App 2 ENI 1 App 2 IP1 IP1 VM2 App 1 ENI 1 IP1 VMn App 1 App 2 ENI 1 App 2 IP1 IP1 App 1 App 1 App 1 App 1 App 1 App 2 App 2 IP2, IP3 IP2, IP3, IP4 IP2, IP3 Trade-off for speed
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. On each host Change when create and attach ENIs is performed • Moved this to instance start time • No longer needed on-demand Need to burst allocate IP addresses • Opportunistically batch allocate at container launch time • Likely if one container was launched more are coming • Garbage collect unused later
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus API pattern ModifyNetworkInterfaceAttribute Total ModifyNetworkInterfaceAttribute Throttled AssignPrivateIpAddresses Total AssignPrivateIpAddresses Throttled
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Results us-east-1 / prod containers started per minute } 7500 Launched in 5 minutes
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Netflix load balancing
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IP based Application Load Balancing
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configuring EC2 load balancers in Spinnaker
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Titus and Load Balancing integration User Containers Control Plane IP Target Group
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use cases on Titus • Netflix API, Node.js Backend UI Scripts • Machine Learning (GPUs) for personalization • Encoding and Content use cases • Netflix Studio use cases • CDN tracking and planning • Massively parallel CI system • Data Pipeline routing and SPaaS • Big Data platform use cases Batch Q4 15 Basic Services 1Q 16 Production Services 4Q 16 Customer Facing Services 2Q 17
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Q4 2018 container usage Common Jobs launched 255K jobs / day Different applications 1K+ different images Isolated Titus deployments 7 stacks Services Single app cluster size 5K (real), 12K containers (benchmark) Hosts managed 7K VMs (435,000 CPUs) Batch Containers launched 450K / day (750K / day peak) Hosts managed (autoscaled) 55K VMs / month
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Open Source Open sourced April 2018 Help other communities by sharing our approach
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Current and future work Advanced CPU Isolation Opportunistic Workloads Nitro and Bare Metal Instances Next Amazon and Netflix Partnership
  • 54. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Andrew Spyker @aspyker Joe Hsieh
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.