Principles of System Safety Engineering and Management
Principles of System Safety Engineering and Management
SYSTEM SAFETY
ENGINEERING AND MANAGEMENT
Felix Redmill
Redmill Consultancy, London
[email protected]
RISK
Unacceptable Region
(Risk cannot be justified except in
extraordinary circumstances)
Limit of tolerability threshold
P Controller
(c) Felix Redmill, 2011
QUESTIONS
How could the accident have been avoided?
Better algorithm
How could the software designer have known that a better
algorithm was required?
Domain knowledge
But we cant be sure that such a fault wont be made, so
how can we find and correct such faults?
Risk analysis techniques
10
11
12
13
14
FUNCTIONAL SAFETY:
ACHIEVING UTILITY AND CREATING RISK
We are concerned with safety that depends on the correct
functioning of equipment
Control
system
Equipment
under control
Utility
(plus risks)
15
FUNCTIONAL SAFETY 2
Functional safety depends on hardware, software,
humans, data, and interactions between all of these
Control
system
Equipment
under control
Utility
(plus risks)
16
17
SAFETY - A DEFINITION
Safety: freedom from unacceptable risk (IEC)
Safety is not directly measurable
But it may be addressed via risk
18
VOCABULARY
In common usage, Risk is used to imply:
Likelihood (e.g. theres a high risk of infection)
Consequence (e.g. infection carries a high risk)
A combination of the two
Something, perhaps unspecified, to be avoided (e.g.
going out into the night is risky)
19
RISK - DEFINITIONS
Risk: A combination of the probability of occurrence and the
20
21
SOME PRINCIPLES
Absolute safety (zero risk) cannot be achieved
Doing it well does not guarantee safety
Correct functionality safety
We must address safety as well as functionality
Reliability is not a guarantee of safety
We require confidence of safety in advance, not
retrospectively
We must not only achieve safety but also demonstrate it
22
A RISK-BASED APPROACH
If safety is addressed via risk
We must base our safety-management actions on risk
We must understand the risks in order to manage safety
23
24
R = f(P.C) or f(L.C)
25
A SIMPLE CALCULATION
Probability of a 100-year flood = 0.01/year
Expected damage = 50M
R (financial (Expected Value)
= 0.01 x 50,000,000 = 500,000/year
26
27
28
29
30
COMMON-MODE FAILURES
Identifying common-mode failures is a crucial part of
traditional risk analysis
31
ASSUMPTIONS
We never have full knowledge
32
CONTROL OF RISK
Risk is eliminated if either Pr or C is reduced to zero
33
34
35
36
37
Avoid
Eliminate
Reduce
Minimise - within defined constraints
Transfer or share
- Financial risks may be insured
- Technical risks may be transferred to experts, maintainers
Hedge
- Make a second investment, which is likely to succeed if the
first fails
Accept
- Must do this in the end, when risks are deemed tolerable
- Need contingency plans to reduce consequences
(c) Felix Redmill, 2011
38
Voluntary or involuntary
Control in hands of self or another
Statistical or personal
Level of knowledge, uncertainty
Level of dread or fear evoked
Short-term or long-term view
Severity of outcome
Value of the prize
Level of excitement
Status quo bias
39
40
41
UNINTENDED CONSEQUENCES
Actions always likely to have some unintended results
Downsizing but loose the wrong staff
42
RISK COMMUNICATION
The risk analyst is usually not the decision maker
Risk information must be transferred (liver biopsy)
Usually only results are communicated
e.g. Theres an 80% chance of success
But are they correct (Bristol Royal Infirmary)?
Overconfidence bias
Managers rely heavily on risk information, collected,
analysed, packaged, by other staff
But what confidence does the staff have in it
What were the information sources, analysis methods,
and framing choices?
What uncertainties exist?
What assumptions were made?
(c) Felix Redmill, 2011
43
44
RISK IS TRICKY
We manage risks daily, with reasonable success
Our risk management is intuitive
45
46
Safe state
(in any
mode)
Failure
or unsafe
deviation
Accident
Danger
Disaster
Restoration
Recovery
47
48
49
Overall scope
definition
Overall safety
requirements
Safety requirements
allocation
7
Safety
validation
Realisation of:
8
Installation &
commissioning
9
Safetyrelated
E/E/PES
12
Overall installation
and commissioning
13
Overall safety
validation
14
Overall operation,
maintenance & repair
Decommissioning or
16CERN, May '11
disposal
10
Other tech.
safetyrelated
systems
11
External
risk
reduction
facilities
Overall
15 modification
and retrofit
50
51
52
53
54
GENERALIZED EXAMPLE
Claim: System S is acceptably safe when used in Application A
Claims are presented as structured arguments
55
56
57
58
EVIDENCE PLANNING - 1
Any system-related documentation (project, operational,
maintenance) may be required as evidence
59
EVIDENCE PLANNING - 2
The safety case structure should be designed early
Evidential requirements should be identified early and
planned for
Safety case development should be commenced early
Evidence of software adequacy may be derived from
Analysis
Testing
Proven-in-use
Process
60
61
62
63
64
GOLF-SHOT RISK
What is the probability of hitting your golf ball into a bunker?
What is the consequence (potential consequence) of hitting
your ball into a bunker?
65
66
67
68
Causal
analysis
Threat of
damage
Potential for
unreliability
Risk
Consequence
analysis
Potential for
unavailability
69
PRELIMINARY REQUIREMENTS
Knowledge of the subject of risk
Understanding of the current situation and context
Knowledge of the purpose of the intended analysis
The questions (e.g. tolerability) that it must answer
Such knowledge and understanding are essential to
searching for the appropriate information
70
BOTTOM-UP ANALYSIS
(Herald of Free Enterprise)
Bowsun asleep in his cabin when ship is due to depart
Ship capsizes
Lives lost
71
TOP-DOWN ANALYSIS
(Herald of Free Enterprise)
Ship puts to sea with bow doors open
Bosun not
on ship
Bosun on board
but not at station
Bosun asleep
in cabin
Door or
hinge
problem
Bosun in
bar
Problem
with power
supply
72
73
74
DEFINITION OF SCOPE
Types of risks to be studied
e.g. safety, security, financial
Risks to whom or what
e.g. employees, all people, environment, the company,
the mission
Study boundary
Plant boundary
Region
Admissible sources of information
e.g. stakeholders (which?), experts, public
75
SCOPE OF STUDY
How will the results be used?
76
Scope definition
Influences the nature and direction of the analysis
Is a predisposing factor on its results
77
78
Hazard analysis,
Risk analysis
Risk assessment,
Risk evaluation
Risk mitigation,
Risk reduction,
Risk management
79
AN APPROPRIATE CONVENTION?
Scope definition
Risk
analysis
Hazard identification
Hazard analysis
Risk assessment
Risk communication
Risk mitigation
Risk
management
Emergency planning
(c) Felix Redmill, 2011
80
81
HAZARD IDENTIFICATION
The foundation of risk analysis
- Identify the hazards (what could go wrong)
- Deduce their causes
- Determine whether they could lead to undesirable
outcomes
Knowing chains of cause and effect facilitates decisions
on where to take corrective action
But many accidents are caused by unexpected
interactions rather than by failures
82
HAZARD ID METHODS
Checklists
Brainstorming
Expert judgement
What-if analysis
Audits and reports
Site inspections
Formal and informal staff interviews
Interviews with others, such as customers, visitors
Specialised techniques
83
84
85
(O'Halloran) For the Navy Captain Hurford concedes that the possibility that this critical
section of pipework might fail was never even considered in the many years that these 12
submarines of the Swiftsure and Trafalgar classes have been in service.
(Hurford) "This component was analysed against its duty that it saw in service and was
supposed never to crack and so the fact that this crack had occurred in this component in the
way that it did and caused a leak before we had detected it, is a serious issue.
(O'Halloran) How big a question mark does this place over your general risk probability
assumptions about the whole working of one of these nuclear reactors.
(Hurford) "It places a question on the surveillance that we do when the submarines are in
refit and maintenance, unquestionably
(O'Halloran) How long have these various submarines been in service ?
(Hurford) "The oldest of the Swiftsure class came into service in the early seventies
(O'Halloran) So has this area of the pipework ever been looked at in any of the submarines,
the 12 hunter killer submarines now in service ?
(Hurford) "No it hasn't, because the design of the component was understood and the
calculations showed and experience showed that there would be no problem.
(O'Halloran) But the calculations were wrong ?
(Hurford) "Clearly there is something wrong with that component that caused the crack and
we don't know if it was the calculations or whether it was the way it was made and that what
is being found out in the analysis at the moment"
(c) Felix Redmill, 2011
86
87
HAZARD ANALYSIS
Analyse the identified hazards to determine
Potential consequences
Worst credible
Most likely
Ways in which they could lead to undesirable outcomes
88
89
90
91
92
Decisions - risk analysis is not carried out for its own sake
but to inform decisions (usually made by others)
Hazard identification and analysis may be carried out
concurrently
93
RISK ASSESSMENT
To determine the tolerability of analysed risks
So that risk-management decisions can be taken
Need tolerability criteria
Tolerability differs according to circumstance
e.g. medical
94
TOLERABLE RISK
Risk accepted in a given context based on the current
values of society
Not trivial to determine
Differs across industry sectors
May change with time
Depends on perception
Should be determined by discussion among parties,
including
Those posing the risks
Those to be exposed to the risks
Other stakeholders, e.g. regulators
95
96
97
98
TECHNIQUES
Techniques support risk analysis
They should not govern it
99
TECHNIQUES TO BE CONSIDERED
100
Fluid A
P1
V1
V2
Vat R
P2
Fluid B
V3
101
Local effects
System-level effects
Corrective action may be proposed
Best done by a team of experts with different viewpoints
102
Integration of components
Installation
103
Item
Pump
P1
Failure
mode
Fails to
st art
Burns out
Valve
V1
Sticks
closed
Sticks
open
Possible
causes
1. No power
2. Burnt out
1. Loss of
lubricant
2. Excessive
temperature
1. No power
2. Jammed
1. No power
2. Jammed
Local
effects
Fluid
does
flow
Fluid
does
flow
A
not
A
not
Fluid A
cannot
flow
Cannot
st op
flow of
Fluid A
Syst emlevel
effects
Excess of
Fluid B in
Vat R
Excess of
Fluid B in
Vat R
Monitor
pump
operation
Add alarm
to pump
monitor
Excess
Fluid B
Vat R
Danger
excess
Fluid A
Vat R
Monitor
Valve
operation
Introduce
additional
valve in
series
of
in
of
of
in
Proposed
correction
104
105
106
Meaning
No
More
Less
As well as
Part of
Reverse
Other than
Early
Late
Before
Aft er
107
HAZOP OVERVIEW
Introductions
Presentation of design
representation
Examine
design representation
methodically
Possible
deviation from design intent
?
No
Yes
Examine
consequences
and causes
Document
results
Define follow-up
work
No
Time up, or
completed study?
Yes
Agree documentation
Sign off meeting
108
HAZOP SUMMARY
HAZOP is a powerful technique for hazard identification and analysis
It requires a team of carefully chosen members
It depends on planning, preparation, and leadership
A study usually requires several meetings
Study proceeds methodically
Guide words are used to focus attention
Outputs are the identified hazards, recommendations,
questions
109
110
Valve
operates
Valve
monitor
O.K.
Alarm
relay O.K.
Claxon
O.K.
Operator
responds
Yes
Yes
Yes
Outcome
Safe
outcome
No
No
Yes
No
Unsafe
outcomes
No
No
111
Valve
monitor
functions
Alarm
relay
operates
Claxon
sounds
Yes
Yes
Operator
responds
Yes
Yes
No
No
112
Fire
starts
Fire
spreads
quickly
Sprinkler
fails to
work
People
cannot
escape
Yes (P=0.4)
Yes (P=0.2)
No (P=0.6)
Yes (P=0.1)
No (P=0.8)
Yes
No (P=0.9)
Resulting
event
Multiple
fatalities
Damage
and loss
Fire
controlled
Fire
contained
113
BOTTOM-UP ANALYSIS
(Herald of Free Enterprise)
Bowsun asleep in his cabin when ship is due to depart
Ship capsizes
Lives lost
114
TOP-DOWN ANALYSIS
(Herald of Free Enterprise)
Ship puts to sea with bow doors open
Bosun not
on ship
Bosun on board
but not at station
Bosun asleep
in cabin
Door or
hinge
problem
Bosun in
bar
Problem
with power
supply
115
116
117
AND
Dangerous failure
frequency of
equipt.: 10 - 3 / hour
Reliability of safety
function
10 -4 / hour
118
COMPLEMENTARITY OF TECHNIQUES
Compare results of FMEA with low-level causes from FTA
Carry out HAZOP on a sub-system identified as risky by a
high-level FMEA
Carry out ETA on low-level items identified as risky by FTA
119
A RISK MATRIX
(An example of qualitative risk analysis)
Likelihood
or
Frequency
Consequence
Negligible
Moderate
High
Catastrophic
High
Medium
Low
120
Consequence
Negligible
High
Medium
Moderate
Catastrophic
H2
H1, H5
H6
Low
High
H4
H3
121
Consequence
Negligible
Moderate
High
Catastrophic
High
Medium
Low
122
123
124
Root causes
1000s
Hazards
<100
Accidents
<20
125
126
127
128
129
130
131
132
SAFETY INTEGRITY
If risk is not tolerable it must be reduced
High-level requirement of safety function is 'to reduce the
risk'
Analysis leads to the functional requirements
The safety function becomes part of the overall system
Safety depends on it
So, will it reduce the risk to (at least) a tolerable level?
We try to ensure that it does by defining the reliability with
which it performs its safety function
In IEC 61508 in terms of Pr. of dangerous failure
This is referred to as 'safety integrity'
133
134
135
Continuous/High-demand Mode
of Operation
(Pr. of dangerous failure per
hour)
136
137
138
139
140
EUC
Control
system
Safety functions
(c) Felix Redmill, 2011
Protection
system
Safety functions
CERN, May '11
141
AND
EUC dangerous
failure frequency
10 -3
10 -4
(c) Felix Redmill, 2011
Reliability of
safety function
10 -4
10 -3
CERN, May '11
142
BEWARE
Note that a protection system claim requires total
independence of the safety function from the protected
function
143
Tolerable
Residual level of
risk 1
risk 1
Risk
2
Tolerable
level of
risk 2
Risk
1
Increasing
risk
144
145
146
147
BUT NOTE
IEC 61508 emphasises process evidence but does not
exclude the need for product or analysis evidence
It is impossible for process evidence to be conclusive
It is unlikely that conclusive evidence can be derived for
complex systems in which systematic failures dominate
Evidence of all types should be sought and assembled
148
149
SIL ALLOCATION - 2
(from IEC 61508)
Where a safety-related system is to implement both safety
and non-safety functions, all the hardware and software
shall be treated as safety-related unless it can be shown
that the implementation of the safety and non-safety
functions is sufficiently independent (i.e. that the failure of
any non-safety-related functions does not affect the safetyrelated functions). Wherever practicable, the safety-related
functions should be separated from the non-safety-related
functions
150
151
Industry guideline
152
153
Integrity
level
Uncontrollable
Extremely improbable
Difficult to control
Very remote
Debilitating
Remote
Distracting
Unlikely
Nuisance only
Reasonably possible
154
155
THREE QUESTIONS
Does good process necessarily lead to good product?
Instead of using a safety function, why not simply improve
the basic system (EUC)?
Can the SIL concept be applied to the basic system?
156
157
158
159
160
SIL SUMMARY
The SIL concept specifies the tolerable rate of dangerous
failures of a safety-related system
It defines a safety-reliability target
Evidence of achievement comes from product and process
IEC 61508 SIL defines constraints on development
processes
Different standards use the concept differently
SIL derivation may be sector-specific
It can be misleading in numerous ways
The SIL concept is a tool
It should be used only with understanding and care
161