SlideShare a Scribd company logo
Lessons learned running large
real-world Docker environments
Oct 27th 2015
Alois Mayr
@mayralois
alois.mayr@ruxit.com
Dec 3rd 2015
Source: http://www.schoonoart.de/
What is a “large” environment?
Lessons learned running large real-world Docker environments
Campfire stories
#1 – The Death Star of Service Dependencies
#1 – Death Star of Service Dependencies
Load-balanced service
System-wide service
dependencies
Reverse proxies are essential
#1 – The Death Star of Service Dependencies
App #1
App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies
Use proper versioning for
services, APIs, and images
#1 – The Death Star of Service Dependencies
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Retransmissions
Retransmissions Retransmissions
Retransmissions Retransmissions
Retransmissions
Retransmissions
• Hardware defect in a single network interface card
• NIC worked well under low load
• Retransmissions only under heavy load
• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Co-locate related containers.
Check network infrastructure.
#2 – The Network Retransmission Episode
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#3 – The Hungry Container Breakdown
Low disk space
Low disk space
• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown
• Container health checks failed
• Marathon terminated task and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown
• Log management tools for app logs, e.g. Fluentd and Logstash
--log-driver=none|syslog
• Remove container
--rm=true
• Run Mesos slave with
--docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown
Use log management tools
Empty /var/lib/docker
#3 – The Hungry Container Breakdown
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#4 – The Day Orchestration Stood Still
Queue and deployment
methods are slow
• Marathon 0.8.x keeps all versions of applications for recovery (by default)
• High frequency of microservices deployments
• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still
• Respective parameter (zk_max_versions) was not set to proper limit
--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still
Track orchestration layer performance
Separate Mesos clusters
#4 – The Day Orchestration Stood Still
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
#5 – The Mushroom Cloud Effect
Way too many
components involved
820 BILLION dependencies!
• Massive load testing in preparation for Black Friday
• Tests ran for 3 days
• No impact to real users, only backend services affected
• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect
Lessons learned running large real-world Docker environments
Automation needed for problem
analysis in large environments
#5 – The Mushroom Cloud Effect
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
Free trial - https://ruxit.com/docker-monitoring/
Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?

More Related Content

What's hot (20)

PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
PPT
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
PPTX
Automated Deployment Using Jenkins Across Clusters
Naveen S.R
 
PPTX
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
PDF
Windows container security
Docker, Inc.
 
PDF
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Security Conference
 
PDF
How to install and use Kubernetes
Luke Marsden
 
PDF
Docker {at,with} SignalFx
Maxime Petazzoni
 
PDF
Securing & Enforcing Network Policy and Encryption with Weave Net
Luke Marsden
 
PDF
Accessible hpc for everyone with docker and containers
Docker, Inc.
 
PDF
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
PDF
Lightning Fast Monitoring against Lightning Fast Outages
Maxime Petazzoni
 
PDF
How and why we got Prometheus working with Docker Swarm
Luke Marsden
 
PPTX
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
PDF
Build your own Service Bus V2
Kévin LOVATO
 
PDF
An empirical comparison of dependency issues in open source software packagin...
Tom Mens
 
PDF
Locking down your Kubernetes cluster with Linkerd
Buoyant
 
PDF
KubeCon London 2016 Ronana Cloud Native SDN
Romana Project
 
PDF
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
PDF
Docker casual alpine with nim nimlang 박승환_2016_03
Seunghwan Park
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
Automated Deployment Using Jenkins Across Clusters
Naveen S.R
 
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
Windows container security
Docker, Inc.
 
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Security Conference
 
How to install and use Kubernetes
Luke Marsden
 
Docker {at,with} SignalFx
Maxime Petazzoni
 
Securing & Enforcing Network Policy and Encryption with Weave Net
Luke Marsden
 
Accessible hpc for everyone with docker and containers
Docker, Inc.
 
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Lightning Fast Monitoring against Lightning Fast Outages
Maxime Petazzoni
 
How and why we got Prometheus working with Docker Swarm
Luke Marsden
 
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
Build your own Service Bus V2
Kévin LOVATO
 
An empirical comparison of dependency issues in open source software packagin...
Tom Mens
 
Locking down your Kubernetes cluster with Linkerd
Buoyant
 
KubeCon London 2016 Ronana Cloud Native SDN
Romana Project
 
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
Docker casual alpine with nim nimlang 박승환_2016_03
Seunghwan Park
 

Viewers also liked (7)

PPTX
Blue Whale in an Enterprise Pond
Digia Plc
 
PDF
Using Docker in the Real World
Tim Haak
 
PDF
Solving Real World Production Problems with Docker
Marc Campbell
 
PPTX
A Fabric/Puppet Build/Deploy System
adrian_nye
 
PPTX
Real World Experience of Running Docker in Development and Production
Ben Hall
 
PDF
Real-World Docker: 10 Things We've Learned
RightScale
 
PPTX
Programming the world with Docker
Patrick Chanezon
 
Blue Whale in an Enterprise Pond
Digia Plc
 
Using Docker in the Real World
Tim Haak
 
Solving Real World Production Problems with Docker
Marc Campbell
 
A Fabric/Puppet Build/Deploy System
adrian_nye
 
Real World Experience of Running Docker in Development and Production
Ben Hall
 
Real-World Docker: 10 Things We've Learned
RightScale
 
Programming the world with Docker
Patrick Chanezon
 
Ad

Similar to Lessons learned running large real-world Docker environments (20)

PDF
KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeAcademy
 
PPTX
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
PDF
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
PDF
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
Masaaki Nakagawa
 
PDF
Fasten Industry Meeting with GitHub about Dependancy Management
Fasten Project
 
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
QAware GmbH
 
PDF
Orchestrating Linux Containers while tolerating failures
Docker, Inc.
 
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
PPTX
Remote core locking-Andrea Lombardo
Andrea Lombardo
 
PDF
Resource replication in cloud computing.
Hitesh Mohapatra
 
PDF
Hands on kubernetes_container_orchestration
Amir Hossein Sorouri
 
PDF
Sample Solution Blueprint
Mike Alvarado
 
PPT
4. system models
AbDul ThaYyal
 
PDF
Tupperware: Containerized Deployment at FB
Docker, Inc.
 
PPTX
The Mushroom Cloud Effect - What happens when containers fail?
Alois Mayr
 
PPTX
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
PDF
John adams talk cloudy
John Adams
 
PDF
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker, Inc.
 
PDF
Cloud orchestration risks
Glib Pakharenko
 
KubeCon EU 2016: Kubernetes meets Finagle for Resilient Microservices
KubeAcademy
 
Tokyo azure meetup #12 service fabric internals
Tokyo Azure Meetup
 
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
Masaaki Nakagawa
 
Fasten Industry Meeting with GitHub about Dependancy Management
Fasten Project
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
QAware GmbH
 
Orchestrating Linux Containers while tolerating failures
Docker, Inc.
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Remote core locking-Andrea Lombardo
Andrea Lombardo
 
Resource replication in cloud computing.
Hitesh Mohapatra
 
Hands on kubernetes_container_orchestration
Amir Hossein Sorouri
 
Sample Solution Blueprint
Mike Alvarado
 
4. system models
AbDul ThaYyal
 
Tupperware: Containerized Deployment at FB
Docker, Inc.
 
The Mushroom Cloud Effect - What happens when containers fail?
Alois Mayr
 
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
John adams talk cloudy
John Adams
 
Docker Networking in Production at Visa - Sasi Kannappan, Visa and Mark Churc...
Docker, Inc.
 
Cloud orchestration risks
Glib Pakharenko
 
Ad

More from Alois Mayr (6)

PPTX
Automated distributed tracing - a first class citizen of monitoring
Alois Mayr
 
PDF
Monitoring a cloud native platform feature
Alois Mayr
 
PDF
When containers fail
Alois Mayr
 
PPTX
Running microservice environments is no free lunch
Alois Mayr
 
PDF
Managing and Scaling Microservices with Docker in the Wild
Alois Mayr
 
PDF
Scaling and Monitoring Docker environments
Alois Mayr
 
Automated distributed tracing - a first class citizen of monitoring
Alois Mayr
 
Monitoring a cloud native platform feature
Alois Mayr
 
When containers fail
Alois Mayr
 
Running microservice environments is no free lunch
Alois Mayr
 
Managing and Scaling Microservices with Docker in the Wild
Alois Mayr
 
Scaling and Monitoring Docker environments
Alois Mayr
 

Recently uploaded (20)

PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 

Lessons learned running large real-world Docker environments

  • 1. Lessons learned running large real-world Docker environments Oct 27th 2015 Alois Mayr @mayralois [email protected] Dec 3rd 2015
  • 3. What is a “large” environment?
  • 5. Campfire stories #1 – The Death Star of Service Dependencies
  • 6. #1 – Death Star of Service Dependencies Load-balanced service System-wide service dependencies
  • 7. Reverse proxies are essential #1 – The Death Star of Service Dependencies
  • 8. App #1 App #2 App #1 depends on App #2 Where is this specified? Unwanted dependencies break architecture #1 – The Death Star of Service Dependencies
  • 9. Use proper versioning for services, APIs, and images #1 – The Death Star of Service Dependencies
  • 10. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode
  • 11. #2 – The Network Retransmission Episode Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions
  • 12. • Hardware defect in a single network interface card • NIC worked well under low load • Retransmissions only under heavy load • Affected communications to other machines in datacenter • Still not sure about exact defect on NIC What was the problem? #2 – The Network Retransmission Episode
  • 13. #2 – The Network Retransmission Episode
  • 14. Co-locate related containers. Check network infrastructure. #2 – The Network Retransmission Episode
  • 15. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown
  • 16. #3 – The Hungry Container Breakdown Low disk space Low disk space
  • 17. • Shared /logs partition on host • No log rotation, no archiving for app logs • No proper log management used for Docker environment • Shared /logs partition on a single host ran out of space What was the problem? #3 – The Hungry Container Breakdown
  • 18. • Container health checks failed • Marathon terminated task and rescheduled new one • Still no free space on /logs • Termination and rescheduling • /var/lib/docker ran out of space • Mesos slave unable to run Docker tasks How the problem evolved over time #3 – The Hungry Container Breakdown
  • 19. • Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog • Remove container --rm=true • Run Mesos slave with --docker_remove_delay=VALUE How the problem could have been avoided #3 – The Hungry Container Breakdown
  • 20. Use log management tools Empty /var/lib/docker #3 – The Hungry Container Breakdown
  • 21. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still
  • 22. #4 – The Day Orchestration Stood Still Queue and deployment methods are slow
  • 23. • Marathon 0.8.x keeps all versions of applications for recovery (by default) • High frequency of microservices deployments • Slowdown through zk overload What was the problem? #4 – The Day Orchestration Stood Still
  • 24. • Respective parameter (zk_max_versions) was not set to proper limit --zk_max_versions=20 How the problem could have been avoided #4 – The Day Orchestration Stood Still
  • 25. Track orchestration layer performance Separate Mesos clusters #4 – The Day Orchestration Stood Still
  • 26. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 27. #5 – The Mushroom Cloud Effect Way too many components involved 820 BILLION dependencies!
  • 28. • Massive load testing in preparation for Black Friday • Tests ran for 3 days • No impact to real users, only backend services affected • Many components to take into account What was the problem? 174 / 3.4k 22 / 13.3k Service Container Host 1 1..* * 1 #5 – The Mushroom Cloud Effect
  • 30. Automation needed for problem analysis in large environments #5 – The Mushroom Cloud Effect
  • 31. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 32. Free trial - https://ruxit.com/docker-monitoring/ Blog - https://blog.ruxit.com/ @ruxit What lessons have you learned?