Making disaster routine

Making Disaster Routine
Anticipating and Practicing Failures Using Active Monitoring and Chaos
Engineering
Peter Varhol and Gerie Owen

About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com

Gerie Owen
3
• QA Evangelist, test manager
• Subject matter expert on testing for
TechTarget’s SearchSoftwareQuality.com
• International and domestic conference
presenter
• Marathon runner & running coach
gowen@qualitestgroup.com

Agenda
• DevOps and disaster
• Preparing for disaster
• Principles of chaos
• Monitoring for disaster
• Getting back on your feet
• Conclusions

What is DevOps?
• Containerized development, rapid iteration with real-time
performance insights, intelligent feedback, diagnostic services, an
integrated DevOps pipeline, and deployment to the cloud
• Boshe moi!
• In layman’s terms:
• We automatically integrate and build every time there is a valid check-in
• We run automated tests at all stages, including production
• We send the app to production when it has been integrated and tested
• Automation makes it all work like a Swiss watch

What is a Disaster?
• A serious disruption, occurring over a relatively short time, loss and
impacts, which exceeds the ability of the affected community or
society to cope using its own resources.
• Disruption
• Short timeframe
• Exceeds the ability to cope

What is a Disaster
• Consistency becomes uncertain
• Automated workflow breaks down
• Build fails; smoke tests are blocked
• Server farm goes offline
• Application won’t start again
• Showstopper bug in production
• Anything that disrupts consistency

Preparing for Disaster
• We don’t react well when things go wrong
• Disbelief
• Uncertainty
• Panic
• How can we prepare for the unknown?

We Can Learn from Aircrews
• US Airways Flight 1549
• Sullenberger and Skiles had never met before that day
• Yet worked from established procedures
• Practiced for hundreds of hours
• Immediately turned to checklists
• 90 seconds after the bird strike, they were in the Hudson
• You have to practice this

We Can Learn From Aircrews
• Indecision and panic are killers
• Checklists drive decision-making by focusing on essentials
• Courses of action are defined fast
• Practice makes disasters just another day in the office
• Clear and structured communications is essential

We Can Also Practice Disaster
• Chaos engineering
• Failure scenarios
• Application health monitoring

Chaos Engineering
• Distributed systems at scale
• Experiments to uncover systemic weaknesses
• Defining normal behavior
• Set your null and alternative hypothesis
• Introduce variables that reflect real world events
• servers crash
• hard drives malfunction
• network connections lost
• Try to disprove the null hypothesis

Chaos Engineering
• Practice in production
• Vary real world events
• Yes, there could be customer impact
• It is incumbent upon the chaos engineer to minimize customer impact
• But that is the point of the experiment

Chaos Monkey
• Now called Simian Army
• Developed by Netflix
• Causes breakdowns in their production environment
• Now consists of a variety of tools
• It’s all about resiliency
• Can our application survive?

Practice Failure Scenarios
• Each team member contributes one or more scenarios
• The more unlikely, the better
• Write up the scenarios
• Only the team leader sees them beforehand
• They can be real failures experienced or thought exercises

Practice Failure Scenarios
• Describe the scenarios to the team
• “Load is remaining constant but performance is gradually
deteriorating. We’re starting to get 404 and related errors. The server
farm seems to be operating correctly; it’s an application issue. Pings
are slowing down, but not drastically.”
• How do we diagnose and address?

We Don’t Need Another Hero
• Heroes use superhuman efforts to fix a disaster
• In doing so, they break with team conventions
• Teams function better together
• If a team has a hero:
• the team may not try as hard in the future
• the hero is not replicable
• the hero can’t solve every problem

Monitor Application Health in Production
• Ping just doesn’t cut it any more
• Availability and performance data
• Synthetic testing
• Health over time
• Track trends of performance, page painting, database calls
• Whatever might give you health trends

Directions for Monitoring
• Watermarks for action
• E.g., 25 percent of pages take longer than 2 seconds to load
• AI for prediction
• Based on similar results in the past, the application is likely to fail in six hours
• Analytics for trends
• A combination of six measures indicates unhealthy trends

The Power of Checklists
• Checklists are part of our daily lives
• They
• relieve the cognitive load of remembering to do’s
• organize complicated decision-making
• reduce risk in complicated activities by ensuring that critical tasks are not
overlooked.

Using Checklists in DevOps
• Checklists can be used to:
• Replace Test Cases
• Supplement Test Cases
• Verify Entry and Exit Criteria
• Sanity Testing
• Ambiguity Reviews
• Dev Estimates

Types of Checklists
• Project Set Up
• Application Specific Regression
• Process type specific
• Website Graphics
• Browser Dependencies
• Usability checks

What Does Thinking Of Failure Accomplish?
• Failure doesn’t come as a surprise
• It does so all too often
• We have procedures to deal with failure
• We have practice dealing with failure
• Failure is just another day at the office

A Final Lesson
• You are not alone

Conclusions
• Things will go wrong
• Don’t yell or panic
• Practice non-conforming situations regularly
• Make up unlikely scenarios; chances are they will happen
• Structured practices and communications may make work boring, but
they help when things start going wrong
• Ease into chaos engineering for resiliency
• Use your experiences to create checklists

Making disaster routine

More Related Content

What's hot (20)

Similar to Making disaster routine (20)

More from Peter Varhol (18)

Recently uploaded (20)

Making disaster routine