SlideShare a Scribd company logo
SITE RELIABILITY
ENGINEERING
FOR GROWING ORGANIZATIONS
My company in 20s
• End-to-end payments platform
• API-First
• Docker, C#, ASP.NET, Java,
Powershell, SQL
• #31 Nilsen top merchant acquirer
• Inc. 5000 fastest growing
company
• STL Top Place to Work
• Sound fun? It is. Come see me.
2
WHAT IS SRE?
• “Ops, if everything is treated as a
software problem”
• Typically experienced software
devs with a passion for
automation & infrastructure
• Sort of like devops, but with more
of a focus on production
automation, resiliency and
scalability
• Google wrote this book – It is
being adopted and explored by
many companies
• SRE for Google won't be SRE for
your team!
3
GROWING COMPANY PROBLEMS
4
• If you don't set a service level expectation, they will form around 100% uptime
• Keeping everything running as-is gets treated as a sunk cost
• Code atrophies or gets frozen, but the business keeps changing
EXPECTATIONS OFTEN OUTPACE CAPACITY
5
• How many more users can we
support at our current growth rate?
• If you buy X, will that let us scale?
• If we agree to buy X, can we wait
until next year's budget?
These are very hard questions to
answer without data and documented
expectations for performance &
uptime.
GROWING COMPANY PROBLEMS
FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
6
You will need more automation to keep
it running, not more people
• People are an ongoing cost,
automation is a capitalizable
investment
• With a bigger customer base, five
minute outages become damaging
experiences. Machines can react
faster.
• 4 nines (99.99%) is < 5 minutes
downtime per month. How quickly
can you triage an alert?
GROWING COMPANY PROBLEMS
COMPLEXITY IS EVER INCREASING
7
• Improve reaction time to incidents
‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and
improving RTO
• Learn from mistakes, turn them into opportunities
‒ SRE teams can focus on blameless postmortems, extracting as much marrow as
possible from incidents, then being a champion for change
• Raise awareness for system behavior, weaknesses & strengths
‒ SRE can be an independent consulting agency or PR firm for dev teams
‒ SRE will create and/or publicize metrics to show facts
• Bandwidth dedicated to forward-looking system behaviors.
‒ Usually this is done as time permits (which is limited when companies grow fast).
WHY YOU WANT A DEDICATED SRE TEAM
SOUNDS GOOD! HOW IS IT DONE?
8
‒ Monitor externally the way your customers see you AND the way you see you
‒ There will be false alarms so not everyone should see these
AUTOMATED MONITORING
SOUNDS GOOD! HOW IS IT DONE?
9
LOG INDEXING AND AGGREGATION
10
• Build self-healing systems when we can
‒ Service health checks & automated recovery actions
‒ Desired state configuration
‒ Service Orchestration
• Document procedures/playbooks/runbooks when we can't
SOUNDS GOOD! HOW IS IT DONE?
11
• More than just a socket connection
‒ Does a typical request return a 200-OK?
‒ How many 200/300 Responses vs 400/500?
‒ Can you connect to your downstream
dependencies?
‒ How long have you been up?
• Provide rich info, but quickly
‒ Other endpoints can give more expensive
info
HEALTH CHECKS
12
• SLOs – Service Level Objectives
‒ Where you’d like to be
• SLAs – Service Level Agreements
‒ Where you tell your customers you’ll be
‒ Penalties
‒ More liberal than your SLOs
• Error Budgets
‒ Based on your SLO, how much risk
can you tolerate?
SERVICE LEVELS
SIGNAL VS NOISE
13
• Alert fatigue is real. Keep your alerts actionable.
• Rare errors can be the most interesting, but error velocity is an indicator.
• Strengthen the signal-noise ratio to combat fatigue.
14
MY EXPERIENCES
• Team was formed from various departments
• Carried forward some SRE-related projects from dev
• Matured & documented processes
‒ Playbooks
‒ Dependencies, metrics, app catalog
• Sharing responsibility for prod incidents with operations and
dev teams
• Finding ways to consult on app design & rollout
• We are first-responders, but the dev & ops teams are on call
STORIES FROM THE FIELD
15
5,124 HOURS
AKA CISCO FIELD NOTICE FN-64291
AUTO-IMMUNE DISORDER
AGGRESSIVE HEALTH CHECKING
STORIES FROM THE FIELD
WHAT’S THE STRANGEST PLACE YOU’VE
WORKED A PRODUCTION INCIDENT?
THANK YOU!
• Twitter: jmloeffler
• G+: jmloeffler
• Github: jmloeffler
19
Ad

More Related Content

What's hot (20)

SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Tori Wieldt
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
Ladislav Prskavec
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
New Relic
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
Squadcast Inc
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
Michae Blakeney
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
Levon Avakyan
 
SRE 101
SRE 101SRE 101
SRE 101
Diego Pacheco
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
Michael Kehoe
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
Grier Johnson
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
Kaushik Bhattacharya
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
Clément Michaud
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Tori Wieldt
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
New Relic
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
Franklin Angulo
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
Squadcast Inc
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
Michae Blakeney
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
Setyo Legowo
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
Dr Ganesh Iyer
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
Michael Kehoe
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
What's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE ParisWhat's an SRE at Criteo - Meetup SRE Paris
What's an SRE at Criteo - Meetup SRE Paris
Clément Michaud
 

Similar to Site reliability engineering (20)

Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
Naresh Jain
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another View
Agron Fazliu
 
How Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your BusinessHow Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your Business
OmniVue
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Datavail
 
DevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation RoadmapDevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation Roadmap
Jason Montgomery
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
Shannon Lietz
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
XebiaLabs
 
UCAAS portals make or buy
UCAAS portals make or buyUCAAS portals make or buy
UCAAS portals make or buy
AbdelKander
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
Shannon Lietz
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
Shannon Lietz
 
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, DeploymentGWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVA
 
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
NoNameCon
 
To Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planningTo Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planning
Raj Indugula
 
The Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan LockardThe Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan Lockard
agilemaine
 
7 tips for better requirements management
7 tips for better requirements management7 tips for better requirements management
7 tips for better requirements management
Willy Marroquin (WillyDevNET)
 
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
Hyperdrive Agile Leadership (powered by Bratton & Company)
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
Andrew White
 
Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"
Faruk Shah
 
What to consider when monitoring microservices
What to consider when monitoring microservicesWhat to consider when monitoring microservices
What to consider when monitoring microservices
Particular Software
 
DevOps & Site Reliability Engineering (SRE).pptx
DevOps & Site Reliability Engineering (SRE).pptxDevOps & Site Reliability Engineering (SRE).pptx
DevOps & Site Reliability Engineering (SRE).pptx
abiguimeleroy
 
Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
Naresh Jain
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another View
Agron Fazliu
 
How Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your BusinessHow Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your Business
OmniVue
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Datavail
 
DevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation RoadmapDevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation Roadmap
Jason Montgomery
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
Shannon Lietz
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
XebiaLabs
 
UCAAS portals make or buy
UCAAS portals make or buyUCAAS portals make or buy
UCAAS portals make or buy
AbdelKander
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
Shannon Lietz
 
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, DeploymentGWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVA
 
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
NoNameCon
 
To Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planningTo Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planning
Raj Indugula
 
The Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan LockardThe Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan Lockard
agilemaine
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
Andrew White
 
Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"
Faruk Shah
 
What to consider when monitoring microservices
What to consider when monitoring microservicesWhat to consider when monitoring microservices
What to consider when monitoring microservices
Particular Software
 
DevOps & Site Reliability Engineering (SRE).pptx
DevOps & Site Reliability Engineering (SRE).pptxDevOps & Site Reliability Engineering (SRE).pptx
DevOps & Site Reliability Engineering (SRE).pptx
abiguimeleroy
 
Ad

Recently uploaded (20)

Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Ad

Site reliability engineering

  • 2. My company in 20s • End-to-end payments platform • API-First • Docker, C#, ASP.NET, Java, Powershell, SQL • #31 Nilsen top merchant acquirer • Inc. 5000 fastest growing company • STL Top Place to Work • Sound fun? It is. Come see me. 2
  • 3. WHAT IS SRE? • “Ops, if everything is treated as a software problem” • Typically experienced software devs with a passion for automation & infrastructure • Sort of like devops, but with more of a focus on production automation, resiliency and scalability • Google wrote this book – It is being adopted and explored by many companies • SRE for Google won't be SRE for your team! 3
  • 4. GROWING COMPANY PROBLEMS 4 • If you don't set a service level expectation, they will form around 100% uptime • Keeping everything running as-is gets treated as a sunk cost • Code atrophies or gets frozen, but the business keeps changing EXPECTATIONS OFTEN OUTPACE CAPACITY
  • 5. 5 • How many more users can we support at our current growth rate? • If you buy X, will that let us scale? • If we agree to buy X, can we wait until next year's budget? These are very hard questions to answer without data and documented expectations for performance & uptime. GROWING COMPANY PROBLEMS FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
  • 6. 6 You will need more automation to keep it running, not more people • People are an ongoing cost, automation is a capitalizable investment • With a bigger customer base, five minute outages become damaging experiences. Machines can react faster. • 4 nines (99.99%) is < 5 minutes downtime per month. How quickly can you triage an alert? GROWING COMPANY PROBLEMS COMPLEXITY IS EVER INCREASING
  • 7. 7 • Improve reaction time to incidents ‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and improving RTO • Learn from mistakes, turn them into opportunities ‒ SRE teams can focus on blameless postmortems, extracting as much marrow as possible from incidents, then being a champion for change • Raise awareness for system behavior, weaknesses & strengths ‒ SRE can be an independent consulting agency or PR firm for dev teams ‒ SRE will create and/or publicize metrics to show facts • Bandwidth dedicated to forward-looking system behaviors. ‒ Usually this is done as time permits (which is limited when companies grow fast). WHY YOU WANT A DEDICATED SRE TEAM
  • 8. SOUNDS GOOD! HOW IS IT DONE? 8 ‒ Monitor externally the way your customers see you AND the way you see you ‒ There will be false alarms so not everyone should see these AUTOMATED MONITORING
  • 9. SOUNDS GOOD! HOW IS IT DONE? 9 LOG INDEXING AND AGGREGATION
  • 10. 10 • Build self-healing systems when we can ‒ Service health checks & automated recovery actions ‒ Desired state configuration ‒ Service Orchestration • Document procedures/playbooks/runbooks when we can't SOUNDS GOOD! HOW IS IT DONE?
  • 11. 11 • More than just a socket connection ‒ Does a typical request return a 200-OK? ‒ How many 200/300 Responses vs 400/500? ‒ Can you connect to your downstream dependencies? ‒ How long have you been up? • Provide rich info, but quickly ‒ Other endpoints can give more expensive info HEALTH CHECKS
  • 12. 12 • SLOs – Service Level Objectives ‒ Where you’d like to be • SLAs – Service Level Agreements ‒ Where you tell your customers you’ll be ‒ Penalties ‒ More liberal than your SLOs • Error Budgets ‒ Based on your SLO, how much risk can you tolerate? SERVICE LEVELS
  • 13. SIGNAL VS NOISE 13 • Alert fatigue is real. Keep your alerts actionable. • Rare errors can be the most interesting, but error velocity is an indicator. • Strengthen the signal-noise ratio to combat fatigue.
  • 14. 14 MY EXPERIENCES • Team was formed from various departments • Carried forward some SRE-related projects from dev • Matured & documented processes ‒ Playbooks ‒ Dependencies, metrics, app catalog • Sharing responsibility for prod incidents with operations and dev teams • Finding ways to consult on app design & rollout • We are first-responders, but the dev & ops teams are on call
  • 15. STORIES FROM THE FIELD 15
  • 16. 5,124 HOURS AKA CISCO FIELD NOTICE FN-64291
  • 18. STORIES FROM THE FIELD WHAT’S THE STRANGEST PLACE YOU’VE WORKED A PRODUCTION INCIDENT?
  • 19. THANK YOU! • Twitter: jmloeffler • G+: jmloeffler • Github: jmloeffler 19