SlideShare a Scribd company logo
Embracing
collaborative chaos
Running chaos days on large platforms
Lyndsay Prewer | @equalexperts
Photo by Darius Bashar on Unsplash
What is chaos engineering
and why should we care?
Building vital, high traffic services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
â—Ź Delivered 10 days early!
â—Ź Built in 4 weeks.
â—Ź 140,000 claims processed
on launch day.
â—Ź No production incidents
Building cool, planet-scale, services, fast
Google Cloud Dataflow In the Smart Home Data Pipeline
Operating on the edge of chaos
http://bit.ly/2ZavoyP
http://bit.ly/2QVeWzA
“Two normally-
benign
misconfigurations,
and a specific
software bug,
combined to initiate
the outage”
How can your system fail?
Google Cloud Dataflow In the Smart Home Data Pipeline
â—Ź What are the component parts?
â—Ź How are they connected?
â—Ź How reliable is each part?
â—Ź How reliable are the connections?
â—Ź What happens when X fails?
Addressing the risk of unexpected failure
A
B
A
B D
C
Z
E
G H
F
I
â—Ź Address risk by deliberate
inducing failure
â—Ź Observe, reflect and improve
â—Ź Build resilience in (like quality)
â—Ź Think about production (and
failure) all the time
Simples Hard
What do we mean by resilience?
Four chaos engineering approaches
Manual
In process
Automated
Manual chaos
â—Ź Chaos Days
â—Ź AWS Game Days
â—Ź Change specific chaos
â—Ź Chaos monkey
â—Ź AWS spot instances / GCP
Preemptible VMs
â—Ź Randomised pod killer
Automated chaos
In process chaos engineering
â—Ź Part of normal engineering process
â—Ź Focus for all roles in the team
â—Ź Production thinking / building resilience in
Product
Owner
Dev QA Dev Ops
Focus on: Quality AND Production AND Resilience
Define Build Explore Deploy
(Unplanned chaos)
â—Ź Every day is a school day
â—Ź Handle incidents well
â—Ź Learn from incidents - post incident
reviews
â—Ź Start simple then incorporate tooling
A
B D
C
Z
E
G H
F
I
How does it help?
People
ProcessProduct
Knowledge
Behaviour
Expertise
Managing incidents
Learning from incidents
Engineering approach
Simplification
Observability
Runbooks
Resilience
Photo by Darius Bashar on Unsplash
Running a Chaos Day
- when and how?
Our context
Legacy systems
x100 million
internal
requests
(busiest day)
x100 million
log messages
(busiest day)
x850
microservices
x100M Customers
60 Delivery teams
~1000 Microservices
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
6 Platform teams
(AWS PaaS)
When were we ready for chaos?
2013 2014
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
When were we ready for chaos?
2013 2014 2015 2016
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
When were we ready for chaos?
2013 2014 2015 2016 2017 2018
Cloud
Docker
Scala
Mongo
ELK
Fast
growth
(teams,
services,
traffic)
Multi
active WIP
Multi
active
More multi
active
(to AWS)
Self serve
deploys
AWS
Ready
for
Chaos
Photo by Darius Bashar on Unsplash
Who, where and exactly how?
Agents of chaos
â—Ź Virtual, closed team
â—Ź Draw from component
teams
â—Ź Experts / veterans
â—Ź Highest bus factor
Chaos scope - know thyself
â—Ź Know your architecture
â—Ź Know your steady state
â—Ź Know your constraints
○ What’s in your control?
○ What’s not?
â—‹ What needs protecting?
Lorenipsumcaveatempor
Loren ipsum caveat empor. Loren ipsum
caveat empor. Loren ipsum caveat empor
Loren ipsum caveat empor.
X00 million
internal
requests
(busiest day)
X00 million
log messages
(busiest day)
Chaos scope - trust the brains-storm
http://bit.ly/2XzR7Q9
Chaos scope - brainstorm, then plan the
detail
Team
X
Team
Y
Team
Z
Chaos scope - hack the chaos
Team
X
Team
Y
Team
Z
Deciding where
â—Ź Production or closest to it
â—Ź Production (like) load
â—Ź Production (like) telemetry
â—Ź Decide the blast radius
● Decide comm’s channel(s)
Production
Staging
QA
Development
Photo by Darius Bashar on Unsplash
Execution
Deciding when
â—Ź To warn or not
● It was just another ordinary day …
â—Ź What else is going on?
â—Ź Chaos cut-off
Keep calm and chaos on (agents)
â—Ź (Virtually) co-locate the agents
â—Ź Collaborate and coordinate well
â—Ź Time-box, cover ground
â—Ź (Self) document well
Keep calm and chaos on (everyone else)
â—Ź It was just another ordinary day ...
â—Ź Also (self) document well
● Pretend it’s Production on
Photo by Darius Bashar on Unsplash
Retrospection
Divide and conquer, then regroup
● Component teams retro’s /
incident reviews first
â—Ź Major on engineering
improvements (people,
process, product)
â—Ź Then team-of-teams retro
â—Ź Minor on chaos day
improvements
People
ProcessProduct
Team X
Team Y
Team Z
Team of
teams
What did we learn?
â—Ź Start small
â—Ź Manage/limit the pain
â—Ź Production is a tough step
â—Ź Production-like is also hard!
â—Ź Have fun!
Photo by Darius Bashar on Unsplash
What next?
What’s your next chaos step?
Manual
In process
Automated
Unplanned
â—Ź Where are you at in the journey?
● What’s the next (baby) step?
â—Ź Need any help?
â—‹ Talk to us
â—‹ Check out our playbooks
Thank You
Simple solutions to big business problems.
Simple solutions to big business problems.
Contact us
Our experienced teams deliver software
all around the globe.
London
+44 203 603 7830
helloUK@equalexperts.com
Manchester
+44 203 603 7830
helloUK@equalexperts.com
Pune
+91 20 6687 2400
helloIndia@equalexperts.com
Bengaluru
+91 99 7298 0224
helloIndia@equalexperts.com
Lisbon
+351 211 378 414
helloPortugal@equalexperts.com
New York
+1 866-943-9737
helloUSA@equalexperts.com
Calgary
+1 403 775-4861
helloCanada@equalexperts.com
Berlin
helloDE@equalexperts.com
Sydney
+612 8999 6661
helloAUS@equalexperts.com
Cape Town
+27 21 680 5252
helloSA@equalexperts.com

More Related Content

PDF
Embracing collaborative chaos
PDF
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
PDF
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
ODP
Devops, the future is here it's not evenly distributed yet
PPTX
Predicting & influencing with kanban metrics
PPT
Scrum introduction
PDF
SPOF - Single "Person" of Failure
PDF
DevSecOps: The End of the Beginning - Austin
Embracing collaborative chaos
Journeys To Cloud Native Architecture: Sun, Sea And Emergencies - Nicki Watt
Goto Chicago; Journeys To Cloud Native Architecture: Sun, Sea And Emergencies...
Devops, the future is here it's not evenly distributed yet
Predicting & influencing with kanban metrics
Scrum introduction
SPOF - Single "Person" of Failure
DevSecOps: The End of the Beginning - Austin

Similar to Embracing collaborative chaos (April 2020) by Lyndsay Prewer (20)

PDF
Chaos Engineering 101 by Russ Miles
PDF
Chaos Engineering 101: A Field Guide
PDF
Chaos is a ladder !
PPTX
Making disaster routine
PDF
An introduction to chaos engineering as part of DevOps at XP2019
PPTX
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
PDF
Choose your own adventure Chaos Engineering - QCon NYC 2017
PPTX
Chaos engineering & Gameday on AWS
PDF
Using chaos to bring resiliency to your applications
ODP
muCon 2017 - Build Confidence in your System with Chaos Engineering
PDF
Chaos Driven Development (Bruce Wong)
PDF
Chaos Driven Development
PPTX
Chaos engineering
PPTX
The Case for Chaos
PDF
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
PPTX
Chaos Engineering when you're not Netflix
PPTX
Embracing Failure - AzureDay Rome
PPTX
Embrace chaos
PDF
Applying principles of chaos engineering to serverless (reinvent DVC305)
PPTX
Introduction to Chaos Engineering
Chaos Engineering 101 by Russ Miles
Chaos Engineering 101: A Field Guide
Chaos is a ladder !
Making disaster routine
An introduction to chaos engineering as part of DevOps at XP2019
#ATAGTR2021 Presentation : "Chaos engineering: Break it to make it" by Anupa...
Choose your own adventure Chaos Engineering - QCon NYC 2017
Chaos engineering & Gameday on AWS
Using chaos to bring resiliency to your applications
muCon 2017 - Build Confidence in your System with Chaos Engineering
Chaos Driven Development (Bruce Wong)
Chaos Driven Development
Chaos engineering
The Case for Chaos
Introduction to Chaos Engineering | SRECon Asia - Ana Medina
Chaos Engineering when you're not Netflix
Embracing Failure - AzureDay Rome
Embrace chaos
Applying principles of chaos engineering to serverless (reinvent DVC305)
Introduction to Chaos Engineering
Ad

More from Equal Experts (20)

PPTX
TRUST Framework Talk 2023-03-10.pptx
PDF
Will it matter if your child cannot code?
PPTX
Platform Security IRL: Busting Buzzwords & Building Better
PPTX
Software development practices & Infrastructure as Code - how well do they wo...
PDF
A Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
PPTX
Secure Continuous Delivery
PDF
Smoothing the continuous delivery path a tale of two architectures - expert...
PPTX
Design Systems: Designing out Waste, Designing in Consistency
PPTX
Growing Together - software development in the Developing world
PPTX
Infrastructure - a journey from datacentres to cloud
PPTX
Data Science In Action: Prenatal Screening for Down Syndrome
PPTX
The essentials of the IT industry or What I wish I was taught about at Univer...
PPTX
Secrets of an agile transformation
PPTX
Obstacles of Digital Transformation Evolution
PDF
Avoiding the security brick
PDF
Continuous Security
PDF
Organising for Continuous Delivery
PPTX
Cracking passwords via common topologies
PPTX
Inception Phases - Handling Complexity
PDF
Smoothing the Continuous Delivery Path - A Tale of Two Teams
TRUST Framework Talk 2023-03-10.pptx
Will it matter if your child cannot code?
Platform Security IRL: Busting Buzzwords & Building Better
Software development practices & Infrastructure as Code - how well do they wo...
A Whole Team Approach to Quality in Continuous Delivery - Lisa Crispin
Secure Continuous Delivery
Smoothing the continuous delivery path a tale of two architectures - expert...
Design Systems: Designing out Waste, Designing in Consistency
Growing Together - software development in the Developing world
Infrastructure - a journey from datacentres to cloud
Data Science In Action: Prenatal Screening for Down Syndrome
The essentials of the IT industry or What I wish I was taught about at Univer...
Secrets of an agile transformation
Obstacles of Digital Transformation Evolution
Avoiding the security brick
Continuous Security
Organising for Continuous Delivery
Cracking passwords via common topologies
Inception Phases - Handling Complexity
Smoothing the Continuous Delivery Path - A Tale of Two Teams
Ad

Recently uploaded (20)

PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
 
PPTX
How to Build Crypto Derivative Exchanges from Scratch.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
PDF
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
 
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
CroxyProxy Instagram Access id login.pptx
Smarter Business Operations Powered by IoT Remote Monitoring
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Sensors and Actuators in IoT Systems using pdf
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
 
How to Build Crypto Derivative Exchanges from Scratch.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
CIFDAQ's Token Spotlight: SKY - A Forgotten Giant's Comeback?
 
NewMind AI Weekly Chronicles - August'25 Week I
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
A Day in the Life of Location Data - Turning Where into How.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
Top Generative AI Tools for Patent Drafting in 2025.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CroxyProxy Instagram Access id login.pptx

Embracing collaborative chaos (April 2020) by Lyndsay Prewer

  • 1. Embracing collaborative chaos Running chaos days on large platforms Lyndsay Prewer | @equalexperts
  • 2. Photo by Darius Bashar on Unsplash What is chaos engineering and why should we care?
  • 3. Building vital, high traffic services, fast Google Cloud Dataflow In the Smart Home Data Pipeline â—Ź Delivered 10 days early! â—Ź Built in 4 weeks. â—Ź 140,000 claims processed on launch day. â—Ź No production incidents
  • 4. Building cool, planet-scale, services, fast Google Cloud Dataflow In the Smart Home Data Pipeline
  • 5. Operating on the edge of chaos http://bit.ly/2ZavoyP http://bit.ly/2QVeWzA “Two normally- benign misconfigurations, and a specific software bug, combined to initiate the outage”
  • 6. How can your system fail? Google Cloud Dataflow In the Smart Home Data Pipeline â—Ź What are the component parts? â—Ź How are they connected? â—Ź How reliable is each part? â—Ź How reliable are the connections? â—Ź What happens when X fails?
  • 7. Addressing the risk of unexpected failure A B A B D C Z E G H F I â—Ź Address risk by deliberate inducing failure â—Ź Observe, reflect and improve â—Ź Build resilience in (like quality) â—Ź Think about production (and failure) all the time Simples Hard
  • 8. What do we mean by resilience?
  • 9. Four chaos engineering approaches Manual In process Automated
  • 10. Manual chaos â—Ź Chaos Days â—Ź AWS Game Days â—Ź Change specific chaos
  • 11. â—Ź Chaos monkey â—Ź AWS spot instances / GCP Preemptible VMs â—Ź Randomised pod killer Automated chaos
  • 12. In process chaos engineering â—Ź Part of normal engineering process â—Ź Focus for all roles in the team â—Ź Production thinking / building resilience in Product Owner Dev QA Dev Ops Focus on: Quality AND Production AND Resilience Define Build Explore Deploy
  • 13. (Unplanned chaos) â—Ź Every day is a school day â—Ź Handle incidents well â—Ź Learn from incidents - post incident reviews â—Ź Start simple then incorporate tooling A B D C Z E G H F I
  • 14. How does it help? People ProcessProduct Knowledge Behaviour Expertise Managing incidents Learning from incidents Engineering approach Simplification Observability Runbooks Resilience
  • 15. Photo by Darius Bashar on Unsplash Running a Chaos Day - when and how?
  • 16. Our context Legacy systems x100 million internal requests (busiest day) x100 million log messages (busiest day) x850 microservices x100M Customers 60 Delivery teams ~1000 Microservices Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. 6 Platform teams (AWS PaaS)
  • 17. When were we ready for chaos? 2013 2014 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic)
  • 18. When were we ready for chaos? 2013 2014 2015 2016 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic) Multi active WIP Multi active
  • 19. When were we ready for chaos? 2013 2014 2015 2016 2017 2018 Cloud Docker Scala Mongo ELK Fast growth (teams, services, traffic) Multi active WIP Multi active More multi active (to AWS) Self serve deploys AWS Ready for Chaos
  • 20. Photo by Darius Bashar on Unsplash Who, where and exactly how?
  • 21. Agents of chaos â—Ź Virtual, closed team â—Ź Draw from component teams â—Ź Experts / veterans â—Ź Highest bus factor
  • 22. Chaos scope - know thyself â—Ź Know your architecture â—Ź Know your steady state â—Ź Know your constraints â—‹ What’s in your control? â—‹ What’s not? â—‹ What needs protecting? Lorenipsumcaveatempor Loren ipsum caveat empor. Loren ipsum caveat empor. Loren ipsum caveat empor Loren ipsum caveat empor. X00 million internal requests (busiest day) X00 million log messages (busiest day)
  • 23. Chaos scope - trust the brains-storm http://bit.ly/2XzR7Q9
  • 24. Chaos scope - brainstorm, then plan the detail Team X Team Y Team Z
  • 25. Chaos scope - hack the chaos Team X Team Y Team Z
  • 26. Deciding where â—Ź Production or closest to it â—Ź Production (like) load â—Ź Production (like) telemetry â—Ź Decide the blast radius â—Ź Decide comm’s channel(s) Production Staging QA Development
  • 27. Photo by Darius Bashar on Unsplash Execution
  • 28. Deciding when â—Ź To warn or not â—Ź It was just another ordinary day … â—Ź What else is going on? â—Ź Chaos cut-off
  • 29. Keep calm and chaos on (agents) â—Ź (Virtually) co-locate the agents â—Ź Collaborate and coordinate well â—Ź Time-box, cover ground â—Ź (Self) document well
  • 30. Keep calm and chaos on (everyone else) â—Ź It was just another ordinary day ... â—Ź Also (self) document well â—Ź Pretend it’s Production on
  • 31. Photo by Darius Bashar on Unsplash Retrospection
  • 32. Divide and conquer, then regroup â—Ź Component teams retro’s / incident reviews first â—Ź Major on engineering improvements (people, process, product) â—Ź Then team-of-teams retro â—Ź Minor on chaos day improvements People ProcessProduct Team X Team Y Team Z Team of teams
  • 33. What did we learn? â—Ź Start small â—Ź Manage/limit the pain â—Ź Production is a tough step â—Ź Production-like is also hard! â—Ź Have fun!
  • 34. Photo by Darius Bashar on Unsplash What next?
  • 35. What’s your next chaos step? Manual In process Automated Unplanned â—Ź Where are you at in the journey? â—Ź What’s the next (baby) step? â—Ź Need any help? â—‹ Talk to us â—‹ Check out our playbooks
  • 36. Thank You Simple solutions to big business problems.
  • 37. Simple solutions to big business problems. Contact us Our experienced teams deliver software all around the globe. London +44 203 603 7830 [email protected] Manchester +44 203 603 7830 [email protected] Pune +91 20 6687 2400 [email protected] Bengaluru +91 99 7298 0224 [email protected] Lisbon +351 211 378 414 [email protected] New York +1 866-943-9737 [email protected] Calgary +1 403 775-4861 [email protected] Berlin [email protected] Sydney +612 8999 6661 [email protected] Cape Town +27 21 680 5252 [email protected]

Editor's Notes

  • #2: Hello, my name is Lyndsay Prewer. Over the last couple of years, I’ve been leading a group of teams that develop and operate a Platform-as-a-Service for a very large public sector client. In this talk I’ll describe how we’ve used Chaos Days to improve the resilience of our platform, and the effectiveness of our platform and it’s teams to gracefully handle catastrophic failures.
  • #4: Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation. Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost. We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
  • #5: Chaos engineering is particularly relevant to distributed systems, as these have a scale and high level of complexity that make it impossible to determine their emergent properties and behaviour, let alone every possible failure mode, it’s impact and possible mitigation. Although distributed systems have been around for decades, recent advances in technology, such as serverless, combined with agile and lean practices have led to teams being able to get more complex stuff into production faster and at lower cost. We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!?
  • #6: We can build really cool applications like Nest XYZ, so we can do ABC. What could possibly go wrong!? Complex/distributed systems will fail - not if but when - our systems operate on the edge of chaos
  • #7: Consider your own system...
  • #8: As component parts and connections increase we get an exponential increase in the complexity of the emergent behaviour and thus the number of possible failure modes. This equates to a decrease in our ability to predict failures and their impact zone. Building resilience in, similar to Build quality in Production thinking “It’s a mindset, not a toolset: you don’t need to be running EKS on AWS to benefit from ….”
  • #9: It doesn’t mean we build systems that never fail, that are perfect and indestructible. It means we build systems that cope with failure well, that recover well, that are elastic.
  • #11: Chaos Days (focus on what, not why, as why comes later) Chaos testing (focus is very narrow/local to new/changed components)
  • #12: Chaos Monkey, Symian army et al AWS and GCP alternatives (spot instances, etc.) (Semi-automated) - Super K8S Chaos Bro
  • #13: Making this part of normal flow - link back to Production thinking / Building resilience in
  • #14: Reference https://medium.com/@NetflixTechBlog/introducing-dispatch-da4b8a2a8072
  • #15: It’s not just about more resilient components. It starts with people, their knowledge, their expertise, their behaviours. It covers process - how we respond to and manage incidents, how we learn from them, how we fold these learnings into our engineering practices. On the product front, it’s more than just resilience improvements. It’s also making systems easier to observe, easier to understand and reason about. Systems that automatically heal and tolerate failure is the goal, but improvements in things such as telemetry, alerting and runbooks.
  • #17: Describe size, scale and architecture of Public sector client At various other clients, ranging from retail to payment systems, we’ve setup and run kube-monkey in all environments, opted for preemptible VMs, and run Game Days to help teams learn how to diagnose and debug Production issues.
  • #22: For large platforms, owning teams should provide Chaos Agent to plot and scheme in secret with others. Who knows your system the best? Who do you turn to when the shit hits the fan? Should be high bus factor person.
  • #23: Map out your architecture and dependencies Define steady state What’s normal load/throughput? How do you know the system is healthy? (heart rate, VO2-Max, metrics, 5XX / 499 (check this) responses, alerts) What do you have control over? What services / teams do you want to protect?
  • #24: Apollo 13 picture Map out your architecture and dependencies Doesn’t need to be a big diagram - just get the experts together and brainstorm. Give them a clear intent, a goal, a direction and some constraints, then leave them to figure it out.
  • #25: Define hypothesis for specific interventions and expected response, e.g. Instance failures, app failures, AZ failures, volumes filling up, connections failing/slowing, database failing. Security attacks (break-the-bank approaches, malicious engineer) Map out sequencing, e.g. what should go together, what kept apart, what can be done independently. How will normal service be resumed?
  • #26: Chaos Days are a perfect time to also run security attacks (break-the-bank approaches, malicious engineer)
  • #27: Production or not? If not how production like are things (cookie cutter environments, telemetry) How will load be generated? Who will be impacted if chaos does reign? What comm’s channel is normally used?
  • #29: Some warning? Anything else happening at that time (e.g. peak loads, major releases) How will you ensure normal service is resumed - story from our first day
  • #30: [Photo from 1st chaos day?] Co-locate agents of chaos, plus comm’s channel Collaborate and coordinate in response to chaos and how it’s handled. Timebox to ensure enough chaos variants covered and normal service is resumed [Slack and trello screen shots?] Record what you’re doing (slack, trello - hypothesis, expected response, actual)
  • #31: Just an ordinary day (i.e. all teams working as normal) Record what you’re doing (slack) Treat chaos environment as production
  • #33: Team based retro’s then team of teams Separate resilience improvements (e.g. tech, process, people) from chaos day improvements
  • #34: [Slide, check our own list] Lessons learnt What’s not worked well Things we’d do next time
  • #35: What’s your next step?
  • #36: Describe various possible contexts, and possible next steps for each