SlideShare a Scribd company logo
More Aim,
Less Blame!
How do you feel after a
conference?
Inspired!
Empowered!
More Aim, Less Blame: How to use postmortems to turn failures into something valuable for your team
Blame
Feeling Down
Guilt
Chief Enterprise Architect
@ SiteGround
About Me - @dvkanchev
DevOps Engineer/SRE
Adrenaline Junkie (snowboarding, sailing, parenting)
Safety Enthusiast
Focuses on culture and not technology
Based on technical failures examples
But valid for handling all types of failures
About This Talk
What Is A
Post-Mortem?
“A postmortem is a written record of an
incident, its impact, the actions taken to
mitigate or resolve it, the root cause(s),
and the follow-up actions to prevent the
incident from recurring.”
Website
Downtime
Site Is Broken.
What Do You Do?
I just fix it
Fix it + postmortem
Someone else fixes such problems for me
1
2
3
I just fix it - 82%
Fix it + postmortem - 12%
Someone else fixes such problems for me - 6%
1
2
3
“Successful Software
Never Gets Simpler”
Blame, Sanctions And Accountability
Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”





Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!

Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!



Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”

Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!



Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”



BETTER!
Blame, Sanctions And Accountability
“Safety requires prevention, prevention requires
honesty, honesty requires absence of fear.”
A Pinch Of
Blameless
“Focus on the situational aspects of a failure’s
mechanism AND the decision-making process of
individuals proximate to the failure.”
Post-mortem template.
The obvious stuff.
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?1
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?
How was it solved?
1
2
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
1
2
3
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
04:52 AM WordPress plugin was downgraded to fix the issue
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
1
2
3
4
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
1
2
3
4
5
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
1
2
3
4
5
6
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
Post-Mortem Review and Approval.
1
2
3
4
5
6
7
Post-mortem template. The hidden gems.
Post-mortem template. The hidden gems.
Different Triggers/Contributors.1
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
1
2
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
1
2
3
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
escalations
on-call dev
DBA
customer
service
network
engineer
security
engineer
Called for 

assistance
On poor
conference
wifi
Starts checking backups
and preparing for restore
Unrelated alerts for
connected systems
Working on a theory
related to load balancing
as new data is obtained
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
1
2
3
4
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
Open discussions.
1
2
3
4
5
Step Back
Example Time
The best way to find out if you can
trust somebody is to trust them!
Ernest Hemingway
”
“
More Aim, Less Blame: How to use postmortems to turn failures into something valuable for your team
Resources
github.com/dkanchev
Questions
Thank You!
@dvkanchev

More Related Content

What's hot (17)

PDF
dotSecurity2017
Zane Lackey
 
PPTX
2016 virus bulletin
Adrian Sanabria
 
PDF
Building a Modern Security Engineering Organization
Zane Lackey
 
PDF
Secrets and Mysteries of Automated Execution Keynote slides
Alan Richardson
 
PDF
Root Cause Analysis (RCA) Tools
Jeremy Jay Lim
 
PPTX
DevSecCon Singapore 2018 - Pushing left like a boss by Tanya Janca
DevSecCon
 
PPTX
Software Analytics
Andy Zaidman
 
PDF
10 Reasons Why You Fix Bugs As Soon As You Find Them
Rosie Sherry
 
PPTX
Fact or Fiction? What Software Analytics Can Do For Us
Andy Zaidman
 
PDF
Mindful Metrics (QAotHW 2018)
Dmitry Sharkov
 
PDF
How to adapt the SDLC to the era of DevSecOps
Zane Lackey
 
PPTX
Software Analytics: The Dark Side and the Test Side
Andy Zaidman
 
PDF
Purple View
Haydn Johnson
 
PDF
Laura Bell (SafeStack)
AgileNZ Conference
 
PDF
The limits of unit testing by Craig Stuntz
QA or the Highway
 
PDF
Effective approaches to web application security
Zane Lackey
 
PDF
Top 5 Reasons Why Improvement Efforts Fail
Arty Starr
 
dotSecurity2017
Zane Lackey
 
2016 virus bulletin
Adrian Sanabria
 
Building a Modern Security Engineering Organization
Zane Lackey
 
Secrets and Mysteries of Automated Execution Keynote slides
Alan Richardson
 
Root Cause Analysis (RCA) Tools
Jeremy Jay Lim
 
DevSecCon Singapore 2018 - Pushing left like a boss by Tanya Janca
DevSecCon
 
Software Analytics
Andy Zaidman
 
10 Reasons Why You Fix Bugs As Soon As You Find Them
Rosie Sherry
 
Fact or Fiction? What Software Analytics Can Do For Us
Andy Zaidman
 
Mindful Metrics (QAotHW 2018)
Dmitry Sharkov
 
How to adapt the SDLC to the era of DevSecOps
Zane Lackey
 
Software Analytics: The Dark Side and the Test Side
Andy Zaidman
 
Purple View
Haydn Johnson
 
Laura Bell (SafeStack)
AgileNZ Conference
 
The limits of unit testing by Craig Stuntz
QA or the Highway
 
Effective approaches to web application security
Zane Lackey
 
Top 5 Reasons Why Improvement Efforts Fail
Arty Starr
 

Similar to More Aim, Less Blame: How to use postmortems to turn failures into something valuable for your team (20)

PDF
The Limits of Unit Testing by Craig Stuntz
QA or the Highway
 
PDF
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors
 
PPT
Normal accidents and outpatient surgeries
Jonathan Creasy
 
PDF
Гірка правда про безпеку програмного забезпечення, Володимир Стиран
Sigma Software
 
PPTX
Root cause analysis
Ronald Bartels
 
PDF
Identify Development Pains and Resolve Them with Idea Flow
TechWell
 
PPTX
Put Some SRE in Your Shipped Software
Theo Schlossnagle
 
PDF
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
PDF
Working Effectively with Legacy Code
slicklash
 
PDF
Continuous Automated Testing - Cast conference workshop august 2014
Noah Sussman
 
PPTX
Metric Abuse: Frequently Misused Metrics in Oracle
Steve Karam
 
PDF
Incident Management in the Age of DevOps and SRE
Rundeck
 
PPT
Automatic Assessment of Failure Recovery in Erlang Applications
Jan Henry Nystrom
 
PDF
Incident Management in the Age of DevOps and SRE
Rundeck
 
ODP
I Smell A RAT- Rapid Application Testing
Peter Presnell
 
PDF
Gamification of Chaos Testing
Bram Vogelaar
 
PDF
Orca webinar: Fix Your Configs Before You Wreck Your Release
Scott Turner
 
PPTX
Are Automated Debugging Techniques Actually Helping Programmers
Chris Parnin
 
PDF
Nick Drage & Fraser Scott - Epic battle devops vs security
DevSecCon
 
PDF
Let's Make the PAIN Visible!
Arty Starr
 
The Limits of Unit Testing by Craig Stuntz
QA or the Highway
 
Chaos Engineering Without Observability ... Is Just Chaos
Charity Majors
 
Normal accidents and outpatient surgeries
Jonathan Creasy
 
Гірка правда про безпеку програмного забезпечення, Володимир Стиран
Sigma Software
 
Root cause analysis
Ronald Bartels
 
Identify Development Pains and Resolve Them with Idea Flow
TechWell
 
Put Some SRE in Your Shipped Software
Theo Schlossnagle
 
Using security to drive chaos engineering - April 2018
Dinis Cruz
 
Working Effectively with Legacy Code
slicklash
 
Continuous Automated Testing - Cast conference workshop august 2014
Noah Sussman
 
Metric Abuse: Frequently Misused Metrics in Oracle
Steve Karam
 
Incident Management in the Age of DevOps and SRE
Rundeck
 
Automatic Assessment of Failure Recovery in Erlang Applications
Jan Henry Nystrom
 
Incident Management in the Age of DevOps and SRE
Rundeck
 
I Smell A RAT- Rapid Application Testing
Peter Presnell
 
Gamification of Chaos Testing
Bram Vogelaar
 
Orca webinar: Fix Your Configs Before You Wreck Your Release
Scott Turner
 
Are Automated Debugging Techniques Actually Helping Programmers
Chris Parnin
 
Nick Drage & Fraser Scott - Epic battle devops vs security
DevSecCon
 
Let's Make the PAIN Visible!
Arty Starr
 
Ad

More from Daniel Kanchev (11)

PDF
Drupal8 + AngularJS
Daniel Kanchev
 
PDF
Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
Daniel Kanchev
 
PDF
DrupalCon Barcelona 2015
Daniel Kanchev
 
PDF
Challenges Building The New Joomla! Demo & Free Hosting Platform
Daniel Kanchev
 
PDF
Hidden Secrets For A Hack-Proof Joomla! Site
Daniel Kanchev
 
PDF
WP migrations
Daniel Kanchev
 
PDF
How to Speed Up Your Joomla! Site
Daniel Kanchev
 
PDF
Are you ready to be hacked?
Daniel Kanchev
 
PDF
8 Most Common Joomla! Hacks and How to Avoid Them
Daniel Kanchev
 
PDF
Sofia WP User Group Presentation
Daniel Kanchev
 
PDF
WordPress website optimization
Daniel Kanchev
 
Drupal8 + AngularJS
Daniel Kanchev
 
Enterprise Drupal Application & Hosting Infrastructure Level Monitoring
Daniel Kanchev
 
DrupalCon Barcelona 2015
Daniel Kanchev
 
Challenges Building The New Joomla! Demo & Free Hosting Platform
Daniel Kanchev
 
Hidden Secrets For A Hack-Proof Joomla! Site
Daniel Kanchev
 
WP migrations
Daniel Kanchev
 
How to Speed Up Your Joomla! Site
Daniel Kanchev
 
Are you ready to be hacked?
Daniel Kanchev
 
8 Most Common Joomla! Hacks and How to Avoid Them
Daniel Kanchev
 
Sofia WP User Group Presentation
Daniel Kanchev
 
WordPress website optimization
Daniel Kanchev
 
Ad

Recently uploaded (20)

PDF
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PPTX
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
PDF
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
DOCX
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
PDF
Kubernetes - Architecture & Components.pdf
geethak285
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPSX
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
PPTX
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
Redefining Work in the Age of AI - What to expect? How to prepare? Why it mat...
Malinda Kapuruge
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
Reimaginando la Ciberdefensa: De Copilots a Redes de Agentes
Cristian Garcia G.
 
“Scaling i.MX Applications Processors’ Native Edge AI with Discrete AI Accele...
Edge AI and Vision Alliance
 
Daily Lesson Log MATATAG ICT TEchnology 8
LOIDAALMAZAN3
 
Kubernetes - Architecture & Components.pdf
geethak285
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
Usergroup - OutSystems Architecture.ppsx
Kurt Vandevelde
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Dev Dives: Accelerating agentic automation with Autopilot for Everyone
UiPathCommunity
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
How to Comply With Saudi Arabia’s National Cybersecurity Regulations.pdf
Bluechip Advanced Technologies
 
Paycifi - Programmable Trust_Breakfast_PPTXT
FinTech Belgium
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 

More Aim, Less Blame: How to use postmortems to turn failures into something valuable for your team