SlideShare a Scribd company logo
Jorge Salamero Sanz <jsalamero@serverdensity.com>
CfgMgmtCamp 1 Feb 2016
War Games - Flight training for DevOps
How to Monitor MySQL
The Cost of Uptime
$ 3.55bn 2015 Q4
$ 1.21bn 2015 Q2
$ 4.1bn 2015 Q1
How much do you spend?
● Infrastructure automation
● Configuration automation
● Continuous testing
● Continuous deployment / delivery
● Monitoring
● Logs, error handling
● Feedback
● Human Ops
DevOps lifecycle
● Prepare
● Respond
● Postmortem
Expect downtime
Flight training for DevOps
● Power failure to half of our servers
● Automated failover unavailable
(known failure condition)
● Manual DNS switch required
● Expected impact: 20 min
● Actual impact: 43min
Incident example
● Unfamiliarity with the process
● Pressure of time sensitive event
(panic effect)
● Escalation introduces delays
The Human Factor
● Extended use of checklists
● Not to follow blindly, use knowledge
and experience
● Independent system
● Searchable
● List of known issues and
documented workarounds/fixes
Documented procedures
● Realistic incident simulation
● Practice general response process
● Practice specific incident response
● Deficiencies: practice and improve
the process
Practiced procedures
● First responder, acknowledge alert
● Load incident response checklist
● Log into #ops-war-room in Slack
● Log incident into JIRA
● Begin investigation
General response process
● The “limits of human memory and
attention”
○ Complexity
○ Stress and fatigue
○ Ego
● Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists
Flight training for DevOps
● Increase confidence
● Reduce panic
● Better coordination
● Trust relationships
● Improves time to resolution
Humans
● Replica environment
● or mock command line
● Record actions and timing
● Multiple failures
● Unexpected results
Realistic scenarios
● Team and individual test of response
● Run real commands
● Training the people
● Training the procedures
● Training the tools
Simulation goals
● Objective review
● Suggestions for improvements
● Do it again
● Scenario evolves
● People forget
loop(): review and repeat
● Failure sucks
● Fearless, blameless
● Significant learning
● Restores confidence
● Increases credibility
Postmortem
● Short regular updates
● Even “we’re still looking into it”
● ~1 week to publish full version
○ follow-up incidents
○ check with 3rd party providers
○ timeline for required changes
Postmortem Timing
● Root cause
● Turn of event led to failure
● Steps to identify & isolate the cause
● Services affected
● How we fixed it
● What we have learned and changed
Postmortem Content
Jorge Salamero Sanz
Chief Developer Advocate
@bencerillo
@serverdensity
our DevOps stories, no product spam
blog.serverdensity.com
Ad

More Related Content

Similar to Flight training for DevOps (20)

PEX Week: iDatix Workshop Part 3
PEX Week: iDatix Workshop Part 3PEX Week: iDatix Workshop Part 3
PEX Week: iDatix Workshop Part 3
iDatix
 
Put "fast" back in "fast feedback"
Put "fast" back in "fast feedback"Put "fast" back in "fast feedback"
Put "fast" back in "fast feedback"
Lars Thorup
 
Data Integrity - Patryk Hes
Data Integrity - Patryk HesData Integrity - Patryk Hes
Data Integrity - Patryk Hes
PROIDEA
 
Winston - Netflix's event driven auto remediation and diagnostics tool
Winston - Netflix's event driven auto remediation and diagnostics toolWinston - Netflix's event driven auto remediation and diagnostics tool
Winston - Netflix's event driven auto remediation and diagnostics tool
Vinay Shah
 
WSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2Con ASIA 2016: Getting More 9s from Your DeploymentWSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2
 
Performance testing with jmeter
Performance testing with jmeter Performance testing with jmeter
Performance testing with jmeter
Knoldus Inc.
 
MidCamp 2014 - A Perfect Launch, Every Time
MidCamp 2014 - A Perfect Launch, Every TimeMidCamp 2014 - A Perfect Launch, Every Time
MidCamp 2014 - A Perfect Launch, Every Time
Suzanne Aldrich
 
3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile
Neotys
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafa
Lama K Banna
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
HowTo DR
HowTo DRHowTo DR
HowTo DR
PostgreSQL Experts, Inc.
 
Moodle at scale why assigning a role can cause a catastrophe
Moodle at scale   why assigning a role can cause a catastropheMoodle at scale   why assigning a role can cause a catastrophe
Moodle at scale why assigning a role can cause a catastrophe
sammarshall_ou
 
Break Up the Monolith- Testing Microservices by Marcus Merrell
Break Up the Monolith- Testing Microservices by Marcus MerrellBreak Up the Monolith- Testing Microservices by Marcus Merrell
Break Up the Monolith- Testing Microservices by Marcus Merrell
Sauce Labs
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterprise
Christian McHugh
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
slandelle
 
Practical DevSecOps: Fundamentals of Successful Programs
Practical DevSecOps: Fundamentals of Successful ProgramsPractical DevSecOps: Fundamentals of Successful Programs
Practical DevSecOps: Fundamentals of Successful Programs
Matt Tesauro
 
8.6 Fine Tuning all of Your Salesforce Instruments and Processes
8.6 Fine Tuning all of Your Salesforce Instruments and Processes8.6 Fine Tuning all of Your Salesforce Instruments and Processes
8.6 Fine Tuning all of Your Salesforce Instruments and Processes
TargetX
 
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Cory Scott
 
8220 sad inquiry
8220 sad inquiry8220 sad inquiry
8220 sad inquiry
bbass03
 
Елена Панина - Drupal performance testing. Тестирование производительности, м...
Елена Панина - Drupal performance testing. Тестирование производительности, м...Елена Панина - Drupal performance testing. Тестирование производительности, м...
Елена Панина - Drupal performance testing. Тестирование производительности, м...
LEDC 2016
 
PEX Week: iDatix Workshop Part 3
PEX Week: iDatix Workshop Part 3PEX Week: iDatix Workshop Part 3
PEX Week: iDatix Workshop Part 3
iDatix
 
Put "fast" back in "fast feedback"
Put "fast" back in "fast feedback"Put "fast" back in "fast feedback"
Put "fast" back in "fast feedback"
Lars Thorup
 
Data Integrity - Patryk Hes
Data Integrity - Patryk HesData Integrity - Patryk Hes
Data Integrity - Patryk Hes
PROIDEA
 
Winston - Netflix's event driven auto remediation and diagnostics tool
Winston - Netflix's event driven auto remediation and diagnostics toolWinston - Netflix's event driven auto remediation and diagnostics tool
Winston - Netflix's event driven auto remediation and diagnostics tool
Vinay Shah
 
WSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2Con ASIA 2016: Getting More 9s from Your DeploymentWSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2Con ASIA 2016: Getting More 9s from Your Deployment
WSO2
 
Performance testing with jmeter
Performance testing with jmeter Performance testing with jmeter
Performance testing with jmeter
Knoldus Inc.
 
MidCamp 2014 - A Perfect Launch, Every Time
MidCamp 2014 - A Perfect Launch, Every TimeMidCamp 2014 - A Perfect Launch, Every Time
MidCamp 2014 - A Perfect Launch, Every Time
Suzanne Aldrich
 
3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile3 Keys to Performance Testing at the Speed of Agile
3 Keys to Performance Testing at the Speed of Agile
Neotys
 
Monitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafaMonitoring &amp; alerting presentation sabin&amp;mustafa
Monitoring &amp; alerting presentation sabin&amp;mustafa
Lama K Banna
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
confluent
 
Moodle at scale why assigning a role can cause a catastrophe
Moodle at scale   why assigning a role can cause a catastropheMoodle at scale   why assigning a role can cause a catastrophe
Moodle at scale why assigning a role can cause a catastrophe
sammarshall_ou
 
Break Up the Monolith- Testing Microservices by Marcus Merrell
Break Up the Monolith- Testing Microservices by Marcus MerrellBreak Up the Monolith- Testing Microservices by Marcus Merrell
Break Up the Monolith- Testing Microservices by Marcus Merrell
Sauce Labs
 
Using SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterpriseUsing SaltStack to DevOps the enterprise
Using SaltStack to DevOps the enterprise
Christian McHugh
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
slandelle
 
Practical DevSecOps: Fundamentals of Successful Programs
Practical DevSecOps: Fundamentals of Successful ProgramsPractical DevSecOps: Fundamentals of Successful Programs
Practical DevSecOps: Fundamentals of Successful Programs
Matt Tesauro
 
8.6 Fine Tuning all of Your Salesforce Instruments and Processes
8.6 Fine Tuning all of Your Salesforce Instruments and Processes8.6 Fine Tuning all of Your Salesforce Instruments and Processes
8.6 Fine Tuning all of Your Salesforce Instruments and Processes
TargetX
 
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Tactical Application Security: Getting Stuff Done - Black Hat Briefings 2015
Cory Scott
 
8220 sad inquiry
8220 sad inquiry8220 sad inquiry
8220 sad inquiry
bbass03
 
Елена Панина - Drupal performance testing. Тестирование производительности, м...
Елена Панина - Drupal performance testing. Тестирование производительности, м...Елена Панина - Drupal performance testing. Тестирование производительности, м...
Елена Панина - Drupal performance testing. Тестирование производительности, м...
LEDC 2016
 

More from Server Density (20)

Content marketing @ Server Density
Content marketing @ Server DensityContent marketing @ Server Density
Content marketing @ Server Density
Server Density
 
How to Monitor MySQL
How to Monitor MySQLHow to Monitor MySQL
How to Monitor MySQL
Server Density
 
Handling incidents
Handling incidentsHandling incidents
Handling incidents
Server Density
 
Scaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident managementScaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident management
Server Density
 
Briefing: Containers
Briefing: ContainersBriefing: Containers
Briefing: Containers
Server Density
 
Why puppet? Why now?
Why puppet? Why now?Why puppet? Why now?
Why puppet? Why now?
Server Density
 
Infrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metalInfrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metal
Server Density
 
Navigating the customer lifecycle
Navigating the customer lifecycleNavigating the customer lifecycle
Navigating the customer lifecycle
Server Density
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.
Server Density
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.
Server Density
 
How to monitor NGINX
How to monitor NGINXHow to monitor NGINX
How to monitor NGINX
Server Density
 
How to monitor MongoDB
How to monitor MongoDBHow to monitor MongoDB
How to monitor MongoDB
Server Density
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 
Puppet at the centre of everything
Puppet at the centre of everythingPuppet at the centre of everything
Puppet at the centre of everything
Server Density
 
NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013
Server Density
 
Remote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the worldRemote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the world
Server Density
 
NoSQL Infrastructure
NoSQL InfrastructureNoSQL Infrastructure
NoSQL Infrastructure
Server Density
 
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founderStartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
Server Density
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
Puppet Camp Ghent 2013
Puppet Camp Ghent 2013Puppet Camp Ghent 2013
Puppet Camp Ghent 2013
Server Density
 
Content marketing @ Server Density
Content marketing @ Server DensityContent marketing @ Server Density
Content marketing @ Server Density
Server Density
 
Scaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident managementScaling humans - Ops teams and incident management
Scaling humans - Ops teams and incident management
Server Density
 
Infrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metalInfrastructure choices - cloud vs colo vs bare metal
Infrastructure choices - cloud vs colo vs bare metal
Server Density
 
Navigating the customer lifecycle
Navigating the customer lifecycleNavigating the customer lifecycle
Navigating the customer lifecycle
Server Density
 
Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.Experiences from DevOps production: Deployment, performance, failure.
Experiences from DevOps production: Deployment, performance, failure.
Server Density
 
DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.DevOps Incident Handling - Making friends not enemies.
DevOps Incident Handling - Making friends not enemies.
Server Density
 
How to monitor MongoDB
How to monitor MongoDBHow to monitor MongoDB
How to monitor MongoDB
Server Density
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
Server Density
 
Puppet at the centre of everything
Puppet at the centre of everythingPuppet at the centre of everything
Puppet at the centre of everything
Server Density
 
NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013NoSQL Infrastructure - Late 2013
NoSQL Infrastructure - Late 2013
Server Density
 
Remote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the worldRemote startup - building a company from everywhere in the world
Remote startup - building a company from everywhere in the world
Server Density
 
StartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founderStartOps: Growing an ops team from 1 founder
StartOps: Growing an ops team from 1 founder
Server Density
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
Puppet Camp Ghent 2013
Puppet Camp Ghent 2013Puppet Camp Ghent 2013
Puppet Camp Ghent 2013
Server Density
 
Ad

Recently uploaded (20)

Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Process Parameter Optimization for Minimizing Springback in Cold Drawing Proc...
Journal of Soft Computing in Civil Engineering
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
new ppt artificial intelligence historyyy
new ppt artificial intelligence historyyynew ppt artificial intelligence historyyy
new ppt artificial intelligence historyyy
PianoPianist
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Ad

Flight training for DevOps

  • 1. Jorge Salamero Sanz <[email protected]> CfgMgmtCamp 1 Feb 2016 War Games - Flight training for DevOps
  • 3. The Cost of Uptime $ 3.55bn 2015 Q4 $ 1.21bn 2015 Q2 $ 4.1bn 2015 Q1
  • 4. How much do you spend?
  • 5. ● Infrastructure automation ● Configuration automation ● Continuous testing ● Continuous deployment / delivery ● Monitoring ● Logs, error handling ● Feedback ● Human Ops DevOps lifecycle
  • 6. ● Prepare ● Respond ● Postmortem Expect downtime
  • 8. ● Power failure to half of our servers ● Automated failover unavailable (known failure condition) ● Manual DNS switch required ● Expected impact: 20 min ● Actual impact: 43min Incident example
  • 9. ● Unfamiliarity with the process ● Pressure of time sensitive event (panic effect) ● Escalation introduces delays The Human Factor
  • 10. ● Extended use of checklists ● Not to follow blindly, use knowledge and experience ● Independent system ● Searchable ● List of known issues and documented workarounds/fixes Documented procedures
  • 11. ● Realistic incident simulation ● Practice general response process ● Practice specific incident response ● Deficiencies: practice and improve the process Practiced procedures
  • 12. ● First responder, acknowledge alert ● Load incident response checklist ● Log into #ops-war-room in Slack ● Log incident into JIRA ● Begin investigation General response process
  • 13. ● The “limits of human memory and attention” ○ Complexity ○ Stress and fatigue ○ Ego ● Pilots, doctors, divers: Bruce Willis Ruins All Films (BCD, weights, releases, air, final) Pre-flight checklists
  • 15. ● Increase confidence ● Reduce panic ● Better coordination ● Trust relationships ● Improves time to resolution Humans
  • 16. ● Replica environment ● or mock command line ● Record actions and timing ● Multiple failures ● Unexpected results Realistic scenarios
  • 17. ● Team and individual test of response ● Run real commands ● Training the people ● Training the procedures ● Training the tools Simulation goals
  • 18. ● Objective review ● Suggestions for improvements ● Do it again ● Scenario evolves ● People forget loop(): review and repeat
  • 19. ● Failure sucks ● Fearless, blameless ● Significant learning ● Restores confidence ● Increases credibility Postmortem
  • 20. ● Short regular updates ● Even “we’re still looking into it” ● ~1 week to publish full version ○ follow-up incidents ○ check with 3rd party providers ○ timeline for required changes Postmortem Timing
  • 21. ● Root cause ● Turn of event led to failure ● Steps to identify & isolate the cause ● Services affected ● How we fixed it ● What we have learned and changed Postmortem Content
  • 22. Jorge Salamero Sanz Chief Developer Advocate @bencerillo @serverdensity our DevOps stories, no product spam blog.serverdensity.com