SlideShare a Scribd company logo
/ Robert Treat
Less Alarming Alerts
Saturday, April 9, 16
Hello /@robtreat2
Former
WebDev SysAdmin DBA
I have now been promoted to where I can do the least damage
Saturday, April 9, 16
Hello /@robtreat2
Now
CEO @OMNITI
Saturday, April 9, 16
Hello /@robtreat2
Who Cares What Some Suite Thinks?
Saturday, April 9, 16
Hello /@robtreat2
Phantom Pages
Saturday, April 9, 16
Memory Lane /@robtreat2
Benny
Saturday, April 9, 16
Memory Lane /@robtreat2
MyFirstPager
Saturday, April 9, 16
Memory Lane /@robtreat2
Multiple Rotations
Saturday, April 9, 16
Memory Lane /@robtreat2
always available, phone only
no pager for years
Saturday, April 9, 16
Hello /@robtreat2
Phantom Pages
Saturday, April 9, 16
Hello /@robtreat2
I manage the SRE team at OmniTI
we manage multiple sites
24x7
millions of users
(omniti.com/is/hiring)
Saturday, April 9, 16
Why God Why?
paging is useful
“broken systems should not be
just another day at the ofïŹce”
-- me
Saturday, April 9, 16
Why God Why?
paging is useful
Who has ever gotten an alert and ignored it?
(/me looks at alert, says “oh, it’ll probably recover, no need to look further”)
Saturday, April 9, 16
Why God Why?
paging is useful
How many alerts were received in the past
week that were not actionable?
(no human action was required)
Saturday, April 9, 16
Why God Why?
paging CAN BE useful
Saturday, April 9, 16
Can We Fix It?
how to improve?
Saturday, April 9, 16
Can We Fix It?
hello@omniti.com
we offer operationally focused services to
help build and manage your infrastructure
:-)
Saturday, April 9, 16
Terms
‱ Metrics
‱ (anything which can be measured)
Saturday, April 9, 16
Terms
‱ Metrics
‱ (anything which can be measured)
‱ Graphs
‱ (trending systems)
Saturday, April 9, 16
Terms
‱ Metrics
‱ (anything which can be measured)
‱ Graphs
‱ (trending systems)
‱ Notices
‱ (notification of event; email)
Saturday, April 9, 16
Terms
‱ Metrics
‱ (anything which can be measured)
‱ Graphs
‱ (trending systems)
‱ Notices
‱ (notification of event; email)
‱ ALERTS
‱ (wake’n you up; pages)
Saturday, April 9, 16
Terms
‱ Metrics
‱ (anything which can be measured)
‱ Graphs
‱ (trending systems)
‱ Notices
‱ (notification of event; email)
‱ ALERTS
‱ (wake’n you up; pages)
Saturday, April 9, 16
Onward and Upward
If you want to improve
your alerts
use systems thinking to reason about your
“system”
Saturday, April 9, 16
Onward and Upward
alerts should be seen as evidence
that your system is behaving in a way
outside of your existing understanding
Saturday, April 9, 16
Onward and Upward
If you want to improve
your alerts
think in terms your business can get on
board with
Saturday, April 9, 16
Onward and Upward
for every alert you receive
What is the business impact of this alert?
Saturday, April 9, 16
Onward and Upward
for every alert you receive
What is the remediation for this alert?
Saturday, April 9, 16
Onward and Upward
remediation:
‱ Summarize the problem
‱ What was done to solve the problem?
‱ Who was notified?
‱ Can this be prevented?
Saturday, April 9, 16
Onward and Upward
send the answer to these questions
to everyone on the team
every time
Saturday, April 9, 16
Onward and Upward
link to this documentation
from your alerting system
Saturday, April 9, 16
Onward and Upward
‱ Knowledge Transfer
‱ Gaps Exposed
‱ Patterns will emerge
Saturday, April 9, 16
Onward and Upward
you might be a bad alert
‱ cannot determine business impact
‱ no remediation necessary
‱ no one needs to be told
‱ work arounds are available
Saturday, April 9, 16
Onward and Upward
if you can’t ïŹx it, you don’t
need to wake up for it
Saturday, April 9, 16
Onward and Upward
if it can wait until morning,
you don’t need to wake up
for it
Saturday, April 9, 16
Onward and Upward
in case of bad alert
‱ remove the alert
Saturday, April 9, 16
Onward and Upward
in case of bad alert
‱ remove the alert
‱ convert the alert to a notice
Saturday, April 9, 16
Onward and Upward
in case of bad alert
‱ remove the alert
‱ convert the alert to a notice
‱ implement fixes
Saturday, April 9, 16
Onward and Upward
pro tip:
never let anyone add an alert
unless they can answer these
questions ïŹrst
Saturday, April 9, 16
Can We Really Do This?
this is partially an organizational issue
Saturday, April 9, 16
Can We Really Do This?
thought exercise:
if you launched a new web site today,
you really only need one alarm
Saturday, April 9, 16
Can We Really Do This?
“I don’t care if my servers are on ïŹre,
as long as I am still making money”
-- Kevin, actual OmniTI customer
Saturday, April 9, 16
This sounds good but...
Most SA/SRE types want to be
pro-active, not re-active.
ie. they want to alert on leading
indicators, not on problems
Saturday, April 9, 16
This sounds good but...
Carrie: I-I'm just making sure we don't get hit again.
Saul: Well, I'm glad someone's looking out for us, Carrie.
Carrie: I'm serious. I-I missed something once before, I
won't... I can't let that happen again.
Saul: It was ten years ago. Everyone missed something
that day.
Carrie: Yeah, everyone's not me.
Saturday, April 9, 16
Based On A True Story
site down: monitor was checking 200
response code.
failed to notice absence of response
code.
easily ïŹxed, but reactive
Saturday, April 9, 16
Based On A True Story
“root cause” ==> OOM
why don’t we alert on OOM?
OOM does not consistently cause outages
Saturday, April 9, 16
Based On A True Story
too many false positives leads to
ignoring alarms
Saturday, April 9, 16
Digression
Friendman, Naparstek,Taussing-Rubbo,
Alarmingly Useless,The Case For Banning Car Alarms In NYC
http://transalt.org/ïŹles/news/reports/caralarms/report.pdf
Blackstone, Buck, Hakim
Evaluation of alternative policies to combat false emergency calls
http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf
Wickens, Rice, Keller, Hutchins, Hughes, Clayton
False Alerts in Air TrafïŹc Control ConïŹ‚ict Alerting System: Is There A Cry Wolf Effect?
http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf
Görges M, Markewitz BA,Westenskow DR
Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context
http://www.ncbi.nlm.nih.gov/pubmed/19372334
“In an intensive care unit, alarms are used to call attention to a patient, to
alert a change in the patient's physiology, or to warn of a failure in a medical
device; however, up to 94% of the alarms are false.”
Saturday, April 9, 16
Digression
AESOP
The Boy Who Cried Wolf
Saturday, April 9, 16
Based On A True Story
‱ send notice of OOM?
‱ fix the cause of OOM?
‱ make a useful alert?
Saturday, April 9, 16
Based On A True Story
useful alerting
‱ script that checks for OOM
‱ restart app server when found
‱ find offending process; kill it
‱ spin up new node; kill old node
in the event all of these fail, send an alert?
Saturday, April 9, 16
Based On A True Story
thought exercise:
if you launched a new web site today,
you really only need one alert
Saturday, April 9, 16
In Conclusion
if we need software that runs 24x7, we should
design resiliency into our software,
not human intervention
Saturday, April 9, 16
In Conclusion
thinking doesn’t scale
especially at 2AM
Saturday, April 9, 16
In Conclusion
thanks!
more:
Surge 2016
http://surge.omniti.com
@robtreat2
@omniti
Saturday, April 9, 16
Saturday, April 9, 16

More Related Content

PDF
Less Alarming Alerts!
Robert Treat
 
PDF
Changing The Guardian through “Guerilla usability testing”
Martin Belam
 
PDF
Lean UX is Haaaard
Zac Halbert
 
PPTX
Using Tin Can with an LMS
vtrainingroom
 
PPTX
Be Bop7
CoVideo Systems
 
PPTX
Ruth Milligan - Columbus Web Analytics Wednesday - August 2018
Tim Wilson
 
PDF
"How to build your growth process with canvas?" by Julien Le Coupanec
TheFamily
 
PDF
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-106
Thinkful
 
Less Alarming Alerts!
Robert Treat
 
Changing The Guardian through “Guerilla usability testing”
Martin Belam
 
Lean UX is Haaaard
Zac Halbert
 
Using Tin Can with an LMS
vtrainingroom
 
Be Bop7
CoVideo Systems
 
Ruth Milligan - Columbus Web Analytics Wednesday - August 2018
Tim Wilson
 
"How to build your growth process with canvas?" by Julien Le Coupanec
TheFamily
 
Deck 8983a1d9-68df-4447-8481-3b4fd0de734c-106
Thinkful
 

What's hot (9)

PDF
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
Erik Van Rompay
 
PDF
How to create value from your web traffic by Salvatore Bruno
TheFamily
 
PDF
Affili@SYD 10 minute presentation
Lee Hopkins
 
PDF
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
Dustin Hartzler
 
PDF
Accessibility doesn't exist
Chris Mills
 
PDF
Turning huge ships - Open Source and Microsoft
Christian Heilmann
 
PDF
Virtual Pet
Thinkful
 
PDF
Automated Analytics Testing with Open Source Tools
TechWell
 
PDF
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Carlos Solis
 
REBEL practices to implement innovation initiatives #RebelJam15 #vanrompay...
Erik Van Rompay
 
How to create value from your web traffic by Salvatore Bruno
TheFamily
 
Affili@SYD 10 minute presentation
Lee Hopkins
 
FinCon15 - You're Doing It Wrong; 13 Mistakes WordPress Users Make
Dustin Hartzler
 
Accessibility doesn't exist
Chris Mills
 
Turning huge ships - Open Source and Microsoft
Christian Heilmann
 
Virtual Pet
Thinkful
 
Automated Analytics Testing with Open Source Tools
TechWell
 
Workshop de Desarrollo con Cascades Blackberry Dev Meeting Santiago
Carlos Solis
 
Ad

Viewers also liked (20)

PDF
A Guide To PostgreSQL 9.0
Robert Treat
 
PDF
Database Scalability Patterns
Robert Treat
 
PPT
Scaling With Postgres
Robert Treat
 
PDF
What Ops Can Learn From Design
Robert Treat
 
PDF
Intro to pl/PHP Oscon2007
Robert Treat
 
PDF
Postgres 9.4 First Look
Robert Treat
 
PDF
plProxy, pgBouncer, pgBalancer
elliando dias
 
PDF
Managing Databases In A DevOps Environment
Robert Treat
 
PDF
Managing Databases In A DevOps Environment 2016
Robert Treat
 
PDF
Advanced WAL File Management With OmniPITR
Robert Treat
 
PDF
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
PDF
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Payal Singh
 
PDF
The Essential PostgreSQL.conf
Robert Treat
 
PDF
Out of the box replication in postgres 9.4
Denish Patel
 
PDF
Best Practices for a Complete Postgres Enterprise Architecture Setup
EDB
 
PDF
PostgreSQL Disaster Recovery with Barman
Gabriele Bartolini
 
PPTX
The Magic of Tuning in PostgreSQL
Ashnikbiz
 
PDF
PostgreSQL performance improvements in 9.5 and 9.6
Tomas Vondra
 
ODP
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
PDF
Scaling postgres
Denish Patel
 
A Guide To PostgreSQL 9.0
Robert Treat
 
Database Scalability Patterns
Robert Treat
 
Scaling With Postgres
Robert Treat
 
What Ops Can Learn From Design
Robert Treat
 
Intro to pl/PHP Oscon2007
Robert Treat
 
Postgres 9.4 First Look
Robert Treat
 
plProxy, pgBouncer, pgBalancer
elliando dias
 
Managing Databases In A DevOps Environment
Robert Treat
 
Managing Databases In A DevOps Environment 2016
Robert Treat
 
Advanced WAL File Management With OmniPITR
Robert Treat
 
Out of the Box Replication in Postgres 9.4(PgCon)
Denish Patel
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Payal Singh
 
The Essential PostgreSQL.conf
Robert Treat
 
Out of the box replication in postgres 9.4
Denish Patel
 
Best Practices for a Complete Postgres Enterprise Architecture Setup
EDB
 
PostgreSQL Disaster Recovery with Barman
Gabriele Bartolini
 
The Magic of Tuning in PostgreSQL
Ashnikbiz
 
PostgreSQL performance improvements in 9.5 and 9.6
Tomas Vondra
 
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Scaling postgres
Denish Patel
 
Ad

Similar to Less Alarming Alerts - SRECon 2016 (20)

PDF
OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups
NETWAYS
 
PDF
Avoiding Alert Bankruptcy and Burnout
Paige Cruz
 
PDF
Actionable Alarm Management
Dan Young
 
PPTX
Customer Level 2 Training: Service Groups, Alerts and Dependencies
SolarWinds
 
PDF
Rick Clymer - Incident Management.pdf
QA or the Highway
 
PDF
The Red Button: Adventures in Security Leadership
VMware Tanzu
 
PDF
Making On-Call More Humane - Ignite Version
Jeffery Smith
 
PDF
OSMC 2022 | How we improved our monitoring so that everyone likes to be on-ca...
NETWAYS
 
PPTX
Four ways to combat non actionable alerts
BigPanda
 
PDF
Bosun Monitoring Talk at LISA14
Kyle Brandt
 
PDF
Brighttalk outage insurance- what you need to know - final
Andrew White
 
PDF
Brighttalk what should we be monitoring - final
Andrew White
 
PDF
SRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of Power
Paige Cruz
 
PPTX
incident analysis - procedure and approach
Derek Chang
 
PPT
False Website Downtime Alerts Are More Than a Nuisance
Davis J Martin
 
PPTX
S4x20 - Tuning ICS Security Alerts: An Alarm Management Approach
Chris Sistrunk
 
PDF
BlueHat v18 || Keynote - This is not fine - surviving cynicism and building h...
BlueHat Security Conference
 
PPT
20110204 alarm management seminar ureason v1 3
UReasonChannel
 
PDF
Nathan Robert Brown: From Engineering to End Users & Back Again
Jack Molisani
 
PDF
Network Operation Center Best Practices
Ayehu Software Technologies Ltd.
 
OSMC 2018 | Eliminating Alerts or Operation Forest by Rihards Olups
NETWAYS
 
Avoiding Alert Bankruptcy and Burnout
Paige Cruz
 
Actionable Alarm Management
Dan Young
 
Customer Level 2 Training: Service Groups, Alerts and Dependencies
SolarWinds
 
Rick Clymer - Incident Management.pdf
QA or the Highway
 
The Red Button: Adventures in Security Leadership
VMware Tanzu
 
Making On-Call More Humane - Ignite Version
Jeffery Smith
 
OSMC 2022 | How we improved our monitoring so that everyone likes to be on-ca...
NETWAYS
 
Four ways to combat non actionable alerts
BigPanda
 
Bosun Monitoring Talk at LISA14
Kyle Brandt
 
Brighttalk outage insurance- what you need to know - final
Andrew White
 
Brighttalk what should we be monitoring - final
Andrew White
 
SRECon23 Cognitive Apprenticeship in Action_ Alert Triage Hour of Power
Paige Cruz
 
incident analysis - procedure and approach
Derek Chang
 
False Website Downtime Alerts Are More Than a Nuisance
Davis J Martin
 
S4x20 - Tuning ICS Security Alerts: An Alarm Management Approach
Chris Sistrunk
 
BlueHat v18 || Keynote - This is not fine - surviving cynicism and building h...
BlueHat Security Conference
 
20110204 alarm management seminar ureason v1 3
UReasonChannel
 
Nathan Robert Brown: From Engineering to End Users & Back Again
Jack Molisani
 
Network Operation Center Best Practices
Ayehu Software Technologies Ltd.
 

More from Robert Treat (16)

PDF
Advanced Int->Bigint Conversions
Robert Treat
 
PDF
Explaining Explain
Robert Treat
 
PDF
the-lost-art-of-plpgsql
Robert Treat
 
PDF
Managing Chaos In Production: Testing vs Monitoring
Robert Treat
 
PDF
Past, Present, and Pachyderm - All Things Open - 2013
Robert Treat
 
PDF
Big Bad "Upgraded" Postgres
Robert Treat
 
PDF
Pro Postgres 9
Robert Treat
 
PDF
Scaling with Postgres (Highload++ 2010)
Robert Treat
 
PDF
Intro to Postgres 9 Tutorial
Robert Treat
 
PDF
Check Please!
Robert Treat
 
PDF
Intro to Postgres 8.4 Tutorial
Robert Treat
 
PDF
The Essential postgresql.conf
Robert Treat
 
PDF
PostgreSQL Partitioning, PGCon 2007
Robert Treat
 
ODP
Pro PostgreSQL, OSCon 2008
Robert Treat
 
PDF
Database Anti Patterns
Robert Treat
 
ODP
Pro PostgreSQL
Robert Treat
 
Advanced Int->Bigint Conversions
Robert Treat
 
Explaining Explain
Robert Treat
 
the-lost-art-of-plpgsql
Robert Treat
 
Managing Chaos In Production: Testing vs Monitoring
Robert Treat
 
Past, Present, and Pachyderm - All Things Open - 2013
Robert Treat
 
Big Bad "Upgraded" Postgres
Robert Treat
 
Pro Postgres 9
Robert Treat
 
Scaling with Postgres (Highload++ 2010)
Robert Treat
 
Intro to Postgres 9 Tutorial
Robert Treat
 
Check Please!
Robert Treat
 
Intro to Postgres 8.4 Tutorial
Robert Treat
 
The Essential postgresql.conf
Robert Treat
 
PostgreSQL Partitioning, PGCon 2007
Robert Treat
 
Pro PostgreSQL, OSCon 2008
Robert Treat
 
Database Anti Patterns
Robert Treat
 
Pro PostgreSQL
Robert Treat
 

Recently uploaded (20)

PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 

Less Alarming Alerts - SRECon 2016

  • 1. / Robert Treat Less Alarming Alerts Saturday, April 9, 16
  • 2. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage Saturday, April 9, 16
  • 4. Hello /@robtreat2 Who Cares What Some Suite Thinks? Saturday, April 9, 16
  • 8. Memory Lane /@robtreat2 Multiple Rotations Saturday, April 9, 16
  • 9. Memory Lane /@robtreat2 always available, phone only no pager for years Saturday, April 9, 16
  • 11. Hello /@robtreat2 I manage the SRE team at OmniTI we manage multiple sites 24x7 millions of users (omniti.com/is/hiring) Saturday, April 9, 16
  • 12. Why God Why? paging is useful “broken systems should not be just another day at the ofïŹce” -- me Saturday, April 9, 16
  • 13. Why God Why? paging is useful Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”) Saturday, April 9, 16
  • 14. Why God Why? paging is useful How many alerts were received in the past week that were not actionable? (no human action was required) Saturday, April 9, 16
  • 15. Why God Why? paging CAN BE useful Saturday, April 9, 16
  • 16. Can We Fix It? how to improve? Saturday, April 9, 16
  • 17. Can We Fix It? [email protected] we offer operationally focused services to help build and manage your infrastructure :-) Saturday, April 9, 16
  • 18. Terms ‱ Metrics ‱ (anything which can be measured) Saturday, April 9, 16
  • 19. Terms ‱ Metrics ‱ (anything which can be measured) ‱ Graphs ‱ (trending systems) Saturday, April 9, 16
  • 20. Terms ‱ Metrics ‱ (anything which can be measured) ‱ Graphs ‱ (trending systems) ‱ Notices ‱ (notification of event; email) Saturday, April 9, 16
  • 21. Terms ‱ Metrics ‱ (anything which can be measured) ‱ Graphs ‱ (trending systems) ‱ Notices ‱ (notification of event; email) ‱ ALERTS ‱ (wake’n you up; pages) Saturday, April 9, 16
  • 22. Terms ‱ Metrics ‱ (anything which can be measured) ‱ Graphs ‱ (trending systems) ‱ Notices ‱ (notification of event; email) ‱ ALERTS ‱ (wake’n you up; pages) Saturday, April 9, 16
  • 23. Onward and Upward If you want to improve your alerts use systems thinking to reason about your “system” Saturday, April 9, 16
  • 24. Onward and Upward alerts should be seen as evidence that your system is behaving in a way outside of your existing understanding Saturday, April 9, 16
  • 25. Onward and Upward If you want to improve your alerts think in terms your business can get on board with Saturday, April 9, 16
  • 26. Onward and Upward for every alert you receive What is the business impact of this alert? Saturday, April 9, 16
  • 27. Onward and Upward for every alert you receive What is the remediation for this alert? Saturday, April 9, 16
  • 28. Onward and Upward remediation: ‱ Summarize the problem ‱ What was done to solve the problem? ‱ Who was notified? ‱ Can this be prevented? Saturday, April 9, 16
  • 29. Onward and Upward send the answer to these questions to everyone on the team every time Saturday, April 9, 16
  • 30. Onward and Upward link to this documentation from your alerting system Saturday, April 9, 16
  • 31. Onward and Upward ‱ Knowledge Transfer ‱ Gaps Exposed ‱ Patterns will emerge Saturday, April 9, 16
  • 32. Onward and Upward you might be a bad alert ‱ cannot determine business impact ‱ no remediation necessary ‱ no one needs to be told ‱ work arounds are available Saturday, April 9, 16
  • 33. Onward and Upward if you can’t ïŹx it, you don’t need to wake up for it Saturday, April 9, 16
  • 34. Onward and Upward if it can wait until morning, you don’t need to wake up for it Saturday, April 9, 16
  • 35. Onward and Upward in case of bad alert ‱ remove the alert Saturday, April 9, 16
  • 36. Onward and Upward in case of bad alert ‱ remove the alert ‱ convert the alert to a notice Saturday, April 9, 16
  • 37. Onward and Upward in case of bad alert ‱ remove the alert ‱ convert the alert to a notice ‱ implement fixes Saturday, April 9, 16
  • 38. Onward and Upward pro tip: never let anyone add an alert unless they can answer these questions ïŹrst Saturday, April 9, 16
  • 39. Can We Really Do This? this is partially an organizational issue Saturday, April 9, 16
  • 40. Can We Really Do This? thought exercise: if you launched a new web site today, you really only need one alarm Saturday, April 9, 16
  • 41. Can We Really Do This? “I don’t care if my servers are on ïŹre, as long as I am still making money” -- Kevin, actual OmniTI customer Saturday, April 9, 16
  • 42. This sounds good but... Most SA/SRE types want to be pro-active, not re-active. ie. they want to alert on leading indicators, not on problems Saturday, April 9, 16
  • 43. This sounds good but... Carrie: I-I'm just making sure we don't get hit again. Saul: Well, I'm glad someone's looking out for us, Carrie. Carrie: I'm serious. I-I missed something once before, I won't... I can't let that happen again. Saul: It was ten years ago. Everyone missed something that day. Carrie: Yeah, everyone's not me. Saturday, April 9, 16
  • 44. Based On A True Story site down: monitor was checking 200 response code. failed to notice absence of response code. easily ïŹxed, but reactive Saturday, April 9, 16
  • 45. Based On A True Story “root cause” ==> OOM why don’t we alert on OOM? OOM does not consistently cause outages Saturday, April 9, 16
  • 46. Based On A True Story too many false positives leads to ignoring alarms Saturday, April 9, 16
  • 47. Digression Friendman, Naparstek,Taussing-Rubbo, Alarmingly Useless,The Case For Banning Car Alarms In NYC http://transalt.org/ïŹles/news/reports/caralarms/report.pdf Blackstone, Buck, Hakim Evaluation of alternative policies to combat false emergency calls http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf Wickens, Rice, Keller, Hutchins, Hughes, Clayton False Alerts in Air TrafïŹc Control ConïŹ‚ict Alerting System: Is There A Cry Wolf Effect? http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf Görges M, Markewitz BA,Westenskow DR Improving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context http://www.ncbi.nlm.nih.gov/pubmed/19372334 “In an intensive care unit, alarms are used to call attention to a patient, to alert a change in the patient's physiology, or to warn of a failure in a medical device; however, up to 94% of the alarms are false.” Saturday, April 9, 16
  • 48. Digression AESOP The Boy Who Cried Wolf Saturday, April 9, 16
  • 49. Based On A True Story ‱ send notice of OOM? ‱ fix the cause of OOM? ‱ make a useful alert? Saturday, April 9, 16
  • 50. Based On A True Story useful alerting ‱ script that checks for OOM ‱ restart app server when found ‱ find offending process; kill it ‱ spin up new node; kill old node in the event all of these fail, send an alert? Saturday, April 9, 16
  • 51. Based On A True Story thought exercise: if you launched a new web site today, you really only need one alert Saturday, April 9, 16
  • 52. In Conclusion if we need software that runs 24x7, we should design resiliency into our software, not human intervention Saturday, April 9, 16
  • 53. In Conclusion thinking doesn’t scale especially at 2AM Saturday, April 9, 16