SlideShare a Scribd company logo
Making Disaster Routine
Anticipating and Practicing Failures Using Active Monitoring and Chaos
Engineering
Peter Varhol and Gerie Owen
About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
Gerie Owen
3
• QA Evangelist, test manager
• Subject matter expert on testing for
TechTarget’s SearchSoftwareQuality.com
• International and domestic conference
presenter
• Marathon runner & running coach
gowen@qualitestgroup.com
Agenda
• DevOps and disaster
• Preparing for disaster
• Principles of chaos
• Monitoring for disaster
• Getting back on your feet
• Conclusions
What is DevOps?
• Containerized development, rapid iteration with real-time
performance insights, intelligent feedback, diagnostic services, an
integrated DevOps pipeline, and deployment to the cloud
• Boshe moi!
• In layman’s terms:
• We automatically integrate and build every time there is a valid check-in
• We run automated tests at all stages, including production
• We send the app to production when it has been integrated and tested
• Automation makes it all work like a Swiss watch
What is a Disaster?
• A serious disruption, occurring over a relatively short time, loss and
impacts, which exceeds the ability of the affected community or
society to cope using its own resources.
• Disruption
• Short timeframe
• Exceeds the ability to cope
What is a Disaster
• Consistency becomes uncertain
• Automated workflow breaks down
• Build fails; smoke tests are blocked
• Server farm goes offline
• Application won’t start again
• Showstopper bug in production
• Anything that disrupts consistency
Preparing for Disaster
• We don’t react well when things go wrong
• Disbelief
• Uncertainty
• Panic
• How can we prepare for the unknown?
We Can Learn from Aircrews
• US Airways Flight 1549
• Sullenberger and Skiles had never met before that day
• Yet worked from established procedures
• Practiced for hundreds of hours
• Immediately turned to checklists
• 90 seconds after the bird strike, they were in the Hudson
• You have to practice this
We Can Learn From Aircrews
• Indecision and panic are killers
• Checklists drive decision-making by focusing on essentials
• Courses of action are defined fast
• Practice makes disasters just another day in the office
• Clear and structured communications is essential
We Can Also Practice Disaster
• Chaos engineering
• Failure scenarios
• Application health monitoring
Chaos Engineering
• Distributed systems at scale
• Experiments to uncover systemic weaknesses
• Defining normal behavior
• Set your null and alternative hypothesis
• Introduce variables that reflect real world events
• servers crash
• hard drives malfunction
• network connections lost
• Try to disprove the null hypothesis
Chaos Engineering
• Practice in production
• Vary real world events
• Yes, there could be customer impact
• It is incumbent upon the chaos engineer to minimize customer impact
• But that is the point of the experiment
Chaos Monkey
• Now called Simian Army
• Developed by Netflix
• Causes breakdowns in their production environment
• Now consists of a variety of tools
• It’s all about resiliency
• Can our application survive?
Practice Failure Scenarios
• Each team member contributes one or more scenarios
• The more unlikely, the better
• Write up the scenarios
• Only the team leader sees them beforehand
• They can be real failures experienced or thought exercises
Practice Failure Scenarios
• Describe the scenarios to the team
• “Load is remaining constant but performance is gradually
deteriorating. We’re starting to get 404 and related errors. The server
farm seems to be operating correctly; it’s an application issue. Pings
are slowing down, but not drastically.”
• How do we diagnose and address?
We Don’t Need Another Hero
• Heroes use superhuman efforts to fix a disaster
• In doing so, they break with team conventions
• Teams function better together
• If a team has a hero:
• the team may not try as hard in the future
• the hero is not replicable
• the hero can’t solve every problem
Monitor Application Health in Production
• Ping just doesn’t cut it any more
• Availability and performance data
• Synthetic testing
• Health over time
• Track trends of performance, page painting, database calls
• Whatever might give you health trends
Directions for Monitoring
• Watermarks for action
• E.g., 25 percent of pages take longer than 2 seconds to load
• AI for prediction
• Based on similar results in the past, the application is likely to fail in six hours
• Analytics for trends
• A combination of six measures indicates unhealthy trends
The Power of Checklists
• Checklists are part of our daily lives
• They
• relieve the cognitive load of remembering to do’s
• organize complicated decision-making
• reduce risk in complicated activities by ensuring that critical tasks are not
overlooked.
Types of Checklists
Using Checklists in DevOps
• Checklists can be used to:
• Replace Test Cases
• Supplement Test Cases
• Verify Entry and Exit Criteria
• Sanity Testing
• Ambiguity Reviews
• Dev Estimates
Types of Checklists
• Project Set Up
• Application Specific Regression
• Process type specific
• Website Graphics
• Browser Dependencies
• Usability checks
What Does Thinking Of Failure Accomplish?
• Failure doesn’t come as a surprise
• It does so all too often
• We have procedures to deal with failure
• We have practice dealing with failure
• Failure is just another day at the office
A Final Lesson
• You are not alone
Conclusions
• Things will go wrong
• Don’t yell or panic
• Practice non-conforming situations regularly
• Make up unlikely scenarios; chances are they will happen
• Structured practices and communications may make work boring, but
they help when things start going wrong
• Ease into chaos engineering for resiliency
• Use your experiences to create checklists
Making disaster routine

More Related Content

What's hot (20)

PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Codemotion
 
PDF
Continuous Integration Is for Everyone—Especially DevOps
TechWell
 
PPT
Continuous Deployment
Brian Henerey
 
PDF
Robert and Anne Sabourin: Gauging Software Health
Anna Royzman
 
PDF
DevOps Picc12 Management Talk
Michael Rembetsy
 
PDF
Quality at Speed - Penny Wyatt
Atlassian
 
PDF
Testing in a Continuous World
Lisi Hocke
 
PPTX
The Business Case for DevOps - Justifying the Journey
XebiaLabs
 
PPTX
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
DevOpsDays Tel Aviv
 
PDF
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
PPTX
SDET approach for Agile Testing
Gopikrishna Kannan
 
PPTX
Nf final chef-lisa-metrics-2015-ss
Nicole Forsgren
 
PDF
NYC MeetUp 10.9
Solano Labs
 
PPTX
Anatomy of Three Incidents -- Commonalities and Lessons
Randy Shoup
 
PPTX
TestDriven Development, Why How and Smells
Prowareness
 
PPTX
Microservices Summit - The Human Side of Services
Yelp Engineering
 
PPTX
Moving QA from Reactive to Proactive with qTest
QASymphony
 
PDF
DevOps: Hype or Hope
Dr. Tathagat Varma
 
PDF
Soft Skills You Need Are Not Always Taught in Class
TechWell
 
PDF
DevOPs Transformation Workshop
Jules Pierre-Louis
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Codemotion
 
Continuous Integration Is for Everyone—Especially DevOps
TechWell
 
Continuous Deployment
Brian Henerey
 
Robert and Anne Sabourin: Gauging Software Health
Anna Royzman
 
DevOps Picc12 Management Talk
Michael Rembetsy
 
Quality at Speed - Penny Wyatt
Atlassian
 
Testing in a Continuous World
Lisi Hocke
 
The Business Case for DevOps - Justifying the Journey
XebiaLabs
 
SLO DRIVEN DEVELOPMENT, ALON NATIV, Tomorrow.io
DevOpsDays Tel Aviv
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
SDET approach for Agile Testing
Gopikrishna Kannan
 
Nf final chef-lisa-metrics-2015-ss
Nicole Forsgren
 
NYC MeetUp 10.9
Solano Labs
 
Anatomy of Three Incidents -- Commonalities and Lessons
Randy Shoup
 
TestDriven Development, Why How and Smells
Prowareness
 
Microservices Summit - The Human Side of Services
Yelp Engineering
 
Moving QA from Reactive to Proactive with qTest
QASymphony
 
DevOps: Hype or Hope
Dr. Tathagat Varma
 
Soft Skills You Need Are Not Always Taught in Class
TechWell
 
DevOPs Transformation Workshop
Jules Pierre-Louis
 

Similar to Making disaster routine (20)

PDF
Embracing collaborative chaos
Equal Experts
 
PPTX
Embracing collaborative chaos (April 2020) by Lyndsay Prewer
Equal Experts
 
PPTX
Antifragility and testing for distributed systems failure
DiUS
 
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
PDF
Embrace Chaos - Introducing Chaos Engineering to your Organization
Paul Osman
 
PPTX
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
PDF
Resilience Engineering & Human Error... in IT
João Miranda
 
PDF
Chaos Engineering 101: A Field Guide
matthewbrahms
 
PPTX
Devopssecfail
cacois
 
PPTX
VMUG UserCon Presentation for 2018
Jon Hildebrand
 
PDF
Andrey Adamovich - Enterprise flight into DevOps space - ConFu
DevConFu
 
PDF
DevOps for the Discouraged
James Wickett
 
PPTX
Mindfulness to becoming a successful supply chain manager
https://logisticscompanies.co.za
 
PDF
Combining Speed of Delivery and Quality in Complex Systems
Manuel Pais
 
PPTX
Chaos engineering
Alberto Acerbis
 
PDF
Flight training for DevOps
Server Density
 
PDF
Redefining Failure: Creating a Culture Where Setbacks Are Seen as Catalysts f...
Agile ME
 
PPTX
Resilience and Compliance at Speed and Scale
Jason Chan
 
PDF
Emergent Patterns in DevOps
George Miranda
 
PPTX
DevSecOps - It can change your life (cycle)
Qualitest
 
Embracing collaborative chaos
Equal Experts
 
Embracing collaborative chaos (April 2020) by Lyndsay Prewer
Equal Experts
 
Antifragility and testing for distributed systems failure
DiUS
 
From Duke of DevOps to Queen of Chaos - Api days 2018
Christophe Rochefolle
 
Embrace Chaos - Introducing Chaos Engineering to your Organization
Paul Osman
 
Embracing Failure - AzureDay Rome
Alberto Acerbis
 
Resilience Engineering & Human Error... in IT
João Miranda
 
Chaos Engineering 101: A Field Guide
matthewbrahms
 
Devopssecfail
cacois
 
VMUG UserCon Presentation for 2018
Jon Hildebrand
 
Andrey Adamovich - Enterprise flight into DevOps space - ConFu
DevConFu
 
DevOps for the Discouraged
James Wickett
 
Mindfulness to becoming a successful supply chain manager
https://logisticscompanies.co.za
 
Combining Speed of Delivery and Quality in Complex Systems
Manuel Pais
 
Chaos engineering
Alberto Acerbis
 
Flight training for DevOps
Server Density
 
Redefining Failure: Creating a Culture Where Setbacks Are Seen as Catalysts f...
Agile ME
 
Resilience and Compliance at Speed and Scale
Jason Chan
 
Emergent Patterns in DevOps
George Miranda
 
DevSecOps - It can change your life (cycle)
Qualitest
 
Ad

More from Peter Varhol (18)

PPTX
Not fair! testing AI bias and organizational values
Peter Varhol
 
PPTX
DevOps and the Impostor Syndrome
Peter Varhol
 
PPTX
Not fair! testing ai bias and organizational values
Peter Varhol
 
PPTX
162 the technologist of the future
Peter Varhol
 
PPTX
Correlation does not mean causation
Peter Varhol
 
PPTX
Digital transformation through devops dod indianapolis
Peter Varhol
 
PPTX
Testing for cognitive bias in ai systems
Peter Varhol
 
PPTX
What Aircrews Can Teach Testing Teams
Peter Varhol
 
PPTX
Identifying and measuring testing debt
Peter Varhol
 
PPTX
What aircrews can teach devops teams ignite
Peter Varhol
 
PPTX
Talking to people lightning
Peter Varhol
 
PPTX
Using Machine Learning to Optimize DevOps Practices
Peter Varhol
 
PPTX
Varhol oracle database_firewall_oct2011
Peter Varhol
 
PPTX
Qa test managed_code_varhol
Peter Varhol
 
PPTX
Testing a movingtarget_quest_dynatrace
Peter Varhol
 
PDF
Talking to people: the forgotten DevOps tool
Peter Varhol
 
PPTX
How do we fix testing
Peter Varhol
 
PPTX
Moneyball peter varhol_starwest2012
Peter Varhol
 
Not fair! testing AI bias and organizational values
Peter Varhol
 
DevOps and the Impostor Syndrome
Peter Varhol
 
Not fair! testing ai bias and organizational values
Peter Varhol
 
162 the technologist of the future
Peter Varhol
 
Correlation does not mean causation
Peter Varhol
 
Digital transformation through devops dod indianapolis
Peter Varhol
 
Testing for cognitive bias in ai systems
Peter Varhol
 
What Aircrews Can Teach Testing Teams
Peter Varhol
 
Identifying and measuring testing debt
Peter Varhol
 
What aircrews can teach devops teams ignite
Peter Varhol
 
Talking to people lightning
Peter Varhol
 
Using Machine Learning to Optimize DevOps Practices
Peter Varhol
 
Varhol oracle database_firewall_oct2011
Peter Varhol
 
Qa test managed_code_varhol
Peter Varhol
 
Testing a movingtarget_quest_dynatrace
Peter Varhol
 
Talking to people: the forgotten DevOps tool
Peter Varhol
 
How do we fix testing
Peter Varhol
 
Moneyball peter varhol_starwest2012
Peter Varhol
 
Ad

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 

Making disaster routine

  • 1. Making Disaster Routine Anticipating and Practicing Failures Using Active Monitoring and Chaos Engineering Peter Varhol and Gerie Owen
  • 2. About me • International speaker and writer • Graduate degrees in Math, CS, Psychology • Technology communicator • Former university professor, tech journalist • Cat owner and distance runner • [email protected]
  • 3. Gerie Owen 3 • QA Evangelist, test manager • Subject matter expert on testing for TechTarget’s SearchSoftwareQuality.com • International and domestic conference presenter • Marathon runner & running coach [email protected]
  • 4. Agenda • DevOps and disaster • Preparing for disaster • Principles of chaos • Monitoring for disaster • Getting back on your feet • Conclusions
  • 5. What is DevOps? • Containerized development, rapid iteration with real-time performance insights, intelligent feedback, diagnostic services, an integrated DevOps pipeline, and deployment to the cloud • Boshe moi! • In layman’s terms: • We automatically integrate and build every time there is a valid check-in • We run automated tests at all stages, including production • We send the app to production when it has been integrated and tested • Automation makes it all work like a Swiss watch
  • 6. What is a Disaster? • A serious disruption, occurring over a relatively short time, loss and impacts, which exceeds the ability of the affected community or society to cope using its own resources. • Disruption • Short timeframe • Exceeds the ability to cope
  • 7. What is a Disaster • Consistency becomes uncertain • Automated workflow breaks down • Build fails; smoke tests are blocked • Server farm goes offline • Application won’t start again • Showstopper bug in production • Anything that disrupts consistency
  • 8. Preparing for Disaster • We don’t react well when things go wrong • Disbelief • Uncertainty • Panic • How can we prepare for the unknown?
  • 9. We Can Learn from Aircrews • US Airways Flight 1549 • Sullenberger and Skiles had never met before that day • Yet worked from established procedures • Practiced for hundreds of hours • Immediately turned to checklists • 90 seconds after the bird strike, they were in the Hudson • You have to practice this
  • 10. We Can Learn From Aircrews • Indecision and panic are killers • Checklists drive decision-making by focusing on essentials • Courses of action are defined fast • Practice makes disasters just another day in the office • Clear and structured communications is essential
  • 11. We Can Also Practice Disaster • Chaos engineering • Failure scenarios • Application health monitoring
  • 12. Chaos Engineering • Distributed systems at scale • Experiments to uncover systemic weaknesses • Defining normal behavior • Set your null and alternative hypothesis • Introduce variables that reflect real world events • servers crash • hard drives malfunction • network connections lost • Try to disprove the null hypothesis
  • 13. Chaos Engineering • Practice in production • Vary real world events • Yes, there could be customer impact • It is incumbent upon the chaos engineer to minimize customer impact • But that is the point of the experiment
  • 14. Chaos Monkey • Now called Simian Army • Developed by Netflix • Causes breakdowns in their production environment • Now consists of a variety of tools • It’s all about resiliency • Can our application survive?
  • 15. Practice Failure Scenarios • Each team member contributes one or more scenarios • The more unlikely, the better • Write up the scenarios • Only the team leader sees them beforehand • They can be real failures experienced or thought exercises
  • 16. Practice Failure Scenarios • Describe the scenarios to the team • “Load is remaining constant but performance is gradually deteriorating. We’re starting to get 404 and related errors. The server farm seems to be operating correctly; it’s an application issue. Pings are slowing down, but not drastically.” • How do we diagnose and address?
  • 17. We Don’t Need Another Hero • Heroes use superhuman efforts to fix a disaster • In doing so, they break with team conventions • Teams function better together • If a team has a hero: • the team may not try as hard in the future • the hero is not replicable • the hero can’t solve every problem
  • 18. Monitor Application Health in Production • Ping just doesn’t cut it any more • Availability and performance data • Synthetic testing • Health over time • Track trends of performance, page painting, database calls • Whatever might give you health trends
  • 19. Directions for Monitoring • Watermarks for action • E.g., 25 percent of pages take longer than 2 seconds to load • AI for prediction • Based on similar results in the past, the application is likely to fail in six hours • Analytics for trends • A combination of six measures indicates unhealthy trends
  • 20. The Power of Checklists • Checklists are part of our daily lives • They • relieve the cognitive load of remembering to do’s • organize complicated decision-making • reduce risk in complicated activities by ensuring that critical tasks are not overlooked.
  • 22. Using Checklists in DevOps • Checklists can be used to: • Replace Test Cases • Supplement Test Cases • Verify Entry and Exit Criteria • Sanity Testing • Ambiguity Reviews • Dev Estimates
  • 23. Types of Checklists • Project Set Up • Application Specific Regression • Process type specific • Website Graphics • Browser Dependencies • Usability checks
  • 24. What Does Thinking Of Failure Accomplish? • Failure doesn’t come as a surprise • It does so all too often • We have procedures to deal with failure • We have practice dealing with failure • Failure is just another day at the office
  • 25. A Final Lesson • You are not alone
  • 26. Conclusions • Things will go wrong • Don’t yell or panic • Practice non-conforming situations regularly • Make up unlikely scenarios; chances are they will happen • Structured practices and communications may make work boring, but they help when things start going wrong • Ease into chaos engineering for resiliency • Use your experiences to create checklists