0% found this document useful (0 votes)
2 views

UJ Module Lecture 3 Rev3

The document outlines the Design for Reliability (DfR) process, emphasizing its importance in achieving high long-term reliability and reducing lifecycle costs by integrating reliability engineering throughout the product development cycle. It details the DfR process, which includes defining reliability requirements, identifying risks, analyzing designs, quantifying improvements, validating products, and monitoring reliability during manufacturing. Additionally, it highlights the significance of understanding customer expectations and utilizing various reliability assessment tools to prevent failures and enhance product performance.

Uploaded by

Mapitso Makena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

UJ Module Lecture 3 Rev3

The document outlines the Design for Reliability (DfR) process, emphasizing its importance in achieving high long-term reliability and reducing lifecycle costs by integrating reliability engineering throughout the product development cycle. It details the DfR process, which includes defining reliability requirements, identifying risks, analyzing designs, quantifying improvements, validating products, and monitoring reliability during manufacturing. Additionally, it highlights the significance of understanding customer expectations and utilizing various reliability assessment tools to prevent failures and enhance product performance.

Uploaded by

Mapitso Makena
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

Reliability Management – Lecture 3

Designing for Reliability

24 August 2020
Module Index
Designing for Reliability 3
The Design For Reliability (DfR) Process 14
Methods to Identify Failures in the DfR Process 24
Design of Experiments 30
Physics of Failure 35
Analysis of Variance 40
Taguchi Method 43
Using Reliability Techniques to support the Design Process 47
Reliability Assessments in the Design Process 48
Safety Factors in Design 55
Formal Design Reviews 63
Requirements Management 68
Dealing with Non-Functional Requirements Specifications 83
Requirements Identification Methods 86
Techniques to set System Reliability Goals 93
Importance of Configuration Management 104
Reliability Baselines 109
Software Reliability in Engineering Systems 114
History of RE in Software and Engineering Systems 115
Software Reliability in Engineering Systems 125
Software in Engineering Systems 131
Preventing Software Errors 138
Data Reliability and Integrity 147
Software Checking and Testing 153
Software Reliability Prediction and Measurement 159
Hardware and Software Interfaces 171
IT Hardware and Software Lifecycle Management Considerations 173

2
Designing for Reliability
Product Cost in Lifecycle Phases
Total product cost is largely locked in during the Design Stage. If reliability is not designed into the
product, chances are slim at improving it in later asset lifecycle phases.

Source: Architectural Design for Reliability, R. Cranwell and R. Hunter, Sandia Labs, 1997 4
What is Design for Reliability (DfR)?
• DfR is a process.
• DfR describes the entire set of tools that support product and process design (typically from early
in the concept stage all the way through to product obsolescence) to ensure that customer
expectations for reliability are fully met throughout the life of the product with low overall life-cycle
costs.
• DfR is a systematic, streamlined, concurrent engineering program in which reliability engineering is
weaved into the total development cycle.
• It relies on an array of reliability engineering tools along with a proper understanding of when and
how to use these tools throughout the design cycle.
• This process encompasses a variety of tools and practices and describes the overall order of
deployment that an organisation needs to follow in order to design reliability into its products.

5
DfR Is an entire set of tools and processes not just a single design activity.
Why is Design for Reliability (DfR) Important?
• The answer to this question is quite simple... warranty costs and customer
satisfaction.
• Rework and Field failures are generally also very costly and need to be
prevented. The 10x Rule
• Design for Reliability is a process specifically geared toward achieving
high long-term reliability.
• This process attempts to identify and prevent design issues early in the
development phase, instead of having these issues found in the hands of
the customer.
• The common area between DFSS and DFR includes tools such as Voice
of the Customer (VOC), Design of Experiments(DOE) and Failure Modes
and Effects Analysis (FMEA), which are essential elements in any kind of
product improvement program.
DFR – Reliability Focus DFSS – Quality Focus
Design for Reliability – implementation of a holistic, integrated PROGRAM of Design for Six Sigma - DFSS have been quite successful in achieving
reliability activities. DFR describes the entire set of tools that support product higher quality, reducing variation and cutting down the number of non-
and process design (from FEED/concept stage all the way through to product conforming products, the methodologies are primarily focused on
obsolescence) to ensure that customer expectations for reliability are fully met product quality and many organizations are starting to realize that they do not
throughout the life of the product with low overall life-cycle costs. adequately support the achievement of high reliability.

6
DfR vs. DFSS

DFR – Reliability Focus DFSS – Quality Focus


Design for Reliability (DfR) are based on modelling the life of the product, Design for Sig Sigma (DFSS) aims at avoiding manufacturing problems by
understanding the operating stresses and the physics of failure. taking a more proactive approach to problem - it is thus focussed on solving
existing manufacturing issues. The primary goal of DFSS is to achieve a
significant reduction in the number of nonconforming units and production
variation.

There are overlapping processes and methods between the two methods, as indicated above. 7
Design Sensitivity of Reliability
• Reliability is the probability that a product will continue to work normally over a specified interval of
time, under specified conditions.
• A more reliable product spends less of its time being maintained, so there is often a design trade-
off between reliability and maintainability.
• Reliability is extremely design-sensitive.
Very slight changes to the design of a
component can cause profound changes
in reliability, which is why it is important to
specify product reliability and
maintainability targets before any design
work (or modification to existing products)
is undertaken.
• This in turn requires early knowledge of
the anticipated service life of the product,
and the degree to which parts of the
product are to be made replaceable.
8
Design – The Requirements Dictate the Approach
The requirements for the design of a ballpoint pen could be:
1. Disposable. It will be reliable until the ink is exhausted, at which point it is discarded. Neither the
ink nor parts of the pen body are replaceable, so the pen body needs to last no longer than the
ink. The product has a short service life.
2. Refillable. It will be designed for routine replacement of ink (usually as an ink cartridge), but pen
body parts will not be replaceable. The body must be reliable enough to outlast the specified
number of ink replacement cycles. The product has a moderate service life.
3. Repairable (fully maintainable). The pen is refillable and all body parts are replaceable. The
product has an extendable service life (until the spare parts are no longer available).
This example demonstrates the challenge if fundamental requirements are not clarified and resolved
BEFORE design commences. The end-product may miss the requirement in totality resulting in
customer non-acceptance.
It also demonstrates that the requirements will impact design calculations and assumptions
made.

Note: Product service life is not the same as market life. The market life (also known as the design life) of a product is the length of time the product will continue
9
to
be sold in the shops and supported before being withdrawn. For example, a particular brand of disposable razor may have a service life of ‘3 shaves’, but a market
(shelf) life of 10 years.
Typical Make-up of Production Costs
70% of a product’s total cost is determined by its design.

Source: Six Sigma, by M Harry and R Schroeder. Published by Doubleday. 10


Elements of Integrated Product Support (IPS)
Designing for Reliability must also consider the entire
lifecycle and elements that affect reliability in the O&M
(and thus support) phase.
1 Product Support Management
2 Design Interface (Product improvement/enhancement)
3 Sustaining Engineering in O&M
4 Supply Chain Support
5 Maintenance Planning and Management
6 Packaging, Handling, Storage, and Transportation
7 Technical Data Management
8 Support Equipment
9 Training & Training Support
10 Manpower & Personnel
11 Facilities and Infrastructure
12 Computer Resources
Source: https://www.dau.mil/guidebooks/Shared%20Documents/IPS_Element_Guidebook.pdf 11
Integrated Product Support Elements and Planning

ILS is the integrated planning and


action of a number of disciplines in
concert with one another to assure
system availability.
The planning of each element of ILS is
ideally developed in coordination with
the system engineering effort and with
each other.
Trade-offs may be required between
elements in order to acquire a
system that is: affordable (lowest
life cycle cost), operable,
supportable, sustainable,
transportable, and environmentally
sound.
In some cases, a deliberate process of
Logistics Support Analysis will be used
to identify tasks within each logistics
support element.
Source: https://www.dau.mil/guidebooks/Shared%20Documents/IPS_Element_Guidebook.pdf 12
Design for Reliability – Why it reduces Total Lifecycle Cost

Many industries make significant effort to


improve RAM of their assets. Various
methods are introduced to reduce the number
of field failures.

It is seen as the most feasible way to ensure


lowest Total Asset Cost!

13
The Design For Reliability (DfR) Process
Why Reliability Matters
Product Lifecycle Costs can be reduced by introducing and improving reliability
upfront.

As the project progresses through the


asset lifecycle, the impact of unreliability
and the need to change the product or
design because of this becomes more
significant and the impact spread over a
much wider scope of business activities.

Source: DfR Solutions, www.dfrsolutions.com 15


DFR Best Practices
• Build a solid DfR team and provide the team with the right tools.
• Understand the primary wear-out failure modes in electronics (incorporate Physics of Failure
methodologies).
• Understand the environment the product will operate in and the conditions it needs to withstand
(qualification/shipping/storage/user).
• The Product must be designed with consideration of test conditions, transportation, storage,
possible user environments and how these influence various failure mechanisms.
• Perform modelling of these failure mechanisms in the expected environments based on known
algorithms.
• Focus on testing (virtual and real) which accelerates the primary failure mechanisms to identify
design weaknesses as fast as possible.
• Perform failure analysis of higher risk components after testing (whether or not they pass basic
verification and validation activities).

16
The DfR Process

17
Source: http://www.reliasoft.com
The DFR Process (Cont.)
The process can be broken down into six key activities, which are:
1) Define: The purpose of this stage is to clearly and quantitatively define the reliability
requirements and goals for a product as well as the end-user product environmental/usage
conditions.
2) Identify: In this stage, a clearer picture about what the product is supposed to do starts
developing. With more design or application change, more reliability risks are introduced to the
success of the product and company. Best practice benchmarking is an important part of this
process step.
3) Analyse, Design and Assess: It is highly important to estimate the product's reliability, even with
a rough first cut estimate, early in the design phase. Design Failure Modes Effects and Criticality
Analysis (FMECA), analysis of the existing product issues (lessons learned), fault tree analysis,
critical items list, human factors, hazard and operability study (HAZOPS), and design reviews are
key activities to be undertaken during this phase. Also load protection and non-material failure
modes should be considered as part of FMECA process.
Even though this process is presented in a linear sequence, in reality some activities would be performed in parallel and/or in a loop based on the 18
knowledge gained as a project moves forward.
The DFR Process (Cont.)
4) Quantify and Improve: In this stage, we will start quantifying all of the previous work based on
test results. By this stage, prototypes should be ready for testing and more detailed analysis.
Typically, this involves an iterative process where different types of tests are performed, the
results are analysed, design changes are made, and tests are repeated.
5) Validate: In the Validate stage, a Demonstration Test can be used to make sure that the product
is ready for high volume production. Statistical methods can be used to develop a test plan (i.e., a
combination of test units, test time and acceptable failures) that will demonstrate the desired goal
with the least expenditure/use of resources.
6) Monitor and Control: Various methods are implemented to monitor and control the reliability and
quality of a product or service can be during the manufacturing and assembly
processes. Measurement System Analysis and record keeping are implemented according to
requirements to provide an audit trail of quality.

19
Reliability Engineering – Key tools for Designing for Reliability
To develop more complex models of the Identify, assess and classify the critical
systems or subsystems in order to get a failure modes that could have an
better understanding of how systems System FMECA/FMEA
impact on safety and reliability. ‘Design
interface and affect each other. This may Analytics and Weibull in’ the RAM requirements by
require more sophisticated modelling or Analysis considering alternative design options
system simulation techniques. The main and more reliable equipment as well as
focus is to still achieve the overall adding / removing redundancy. Also
performance and RAM requirements for used to assess maintenance strategy
the systems by starting at the system options.
level and then working down to individual
RAM
components.
Targets
To perform Life Cycle Cost analysis early
in the life cycle to ensure that an optimal Conduct risk assessments to ensure
design has been achieved. This analysis that issues such as safety, reliability,
provides important information into the quality and performance have not been
decision-making process when carrying compromised. The risk assessments
out trade-off studies. Determine what the should be made from an entire life
expected optimal spares holding and Life Cycle Cost cycle perspective. Identify mitigating
replenishment policies for the designs Analysis Risk Analysis actions that can be employed.
should be.

Source: Reliability Engineering Manual – E Pininski 20


Systems Engineering – V-model and RE Techniques

Reliability engineering
analysis techniques are
also considered key
activities in holistic
systems engineering and
project execution.

21
Using Change Point Analysis in DfR
• A thorough change point analysis should reveal changes in design, material, parts, manufacturing,
supplier design or process, usage environment, system's interface points, system's upstream and
downstream parts, specifications, interface between internal departments, performance
requirements, etc.
• This is very valuable when changes to a product, system or process is proposed.
• The purpose of this exercise is to identify and prioritize the Key Reliability Risk items and their
corresponding Risk Reduction Strategy. Designers should consider reducing design complexity
and maximizing the use of standard (proven) components.

Key Reliability Risk Items Risk Reduction Strategy


A good tool to assess risk early in the DFR program is the FMEA. FMEAs
The prioritized list of risks identified in the design, product or
identify potential failure modes for a product or process, assess the risk
process, from highest to lowest priority risk. associated with those failure modes, prioritize issues for corrective action and
identify and carry out corrective actions to address the most serious
concerns.

22
Change Point Analysis Example

https://dataingovernment.blog.gov.uk/2015/02/20/using-sentiment-scoring-and-changepoint-analysis-to-spot-problems-in-services/
23
Methods to Identify Failures in the DfR Process
Reliability Growth

Reliability growth is generally associated


only with the reduction of the effects of
systematic weaknesses.

The sequence of events from the initial


weakness to its elimination is shown here
for both systematic and residual cases.

Residual weaknesses are normally related


to manufacturing of the item or of its parts.
Residual weaknesses are found only in
hardware. Unlike systematic weaknesses,
their effects are restricted to single items.

25
Preventing Infant Mortality and Unnecessary Failures

• Burn-in and Screening are DfR tools that can be useful in preventing infant mortality failures, which
are typically caused by manufacturing-related problems, from happening in the field.
• Deciding on the appropriate burn-in time can be derived from (Quantitative Accelerated Life Testing
(QALT) and/or Life Data Analysis (LDA).
• Also, manufacturability challenges might force some design changes that would trigger many of the
DfR activities already mentioned.
• Protection against extreme loads is not always possible, but should be considered whenever
practicable – extreme loading is known to rapidly decrease design life and result in early failures.
• Protection against Strength Degradation should be taken into account. Combined stresses may
accelerate damage or reduce the fatigue limit.

Quantitative Accelerated Life Testing Highly Accelerated Test


https://slideplayer.com/slide/1596048/ https://slideplayer.com/slide/1596048/ 26
https://www.youtube.com/watch?v=chcMII5AIK0 https://www.emitech.fr/en/reliability-HALT-HASS-testing
DfR Test Methods
• Parametric Binomial: The binomial test is an exact test of the statistical significance of deviations from a
theoretically expected distribution of observations into two categories. It tests whether a proportion from a
dichotomous variable is equal to a presumed population value, and focus on the main difference.

• Non-Parametric Binomial: A non-parametric statistical test is a test whose model does NOT specify
conditions about the parameters of the population from which the sample was drawn. These tests generally
focus on order or ranking of test data. The focus is on the difference between medians.
• With testing comes data, such as failure times and censoring times. Test results can be analysed with Life
Data Analysis (LDA) techniques to statistically estimate the reliability of the product and calculate various
reliability-related metrics with a certain confidence interval.
• Applicable metrics may include reliability after a certain time of use, conditional reliability, B(X) information,
failure rate, MTBF, median life, etc. These calculations can help in verifying whether a product meets its
reliability goals, comparing designs, projecting failures and warranty returns, etc.
Parametric vs Non-Parametric Binomials
https://www.youtube.com/watch?v=3bcYLj11uME 27
Alternative Methods
• As an alternative to testing under normal use conditions and LDA, Quantitative Accelerated Life
Testing (QALT) can also be employed to cut down on the testing time. By carefully elevating the stress levels
applied during testing, failures occur faster and thus failure modes are revealed (and statistical life data
analysis can be applied) more quickly.
• Highly Accelerated Tests (HALT/HASS) are qualitative accelerated tests used to reveal possible failure modes
and complement the physics of failure knowledge about the product. However, data from qualitative tests
cannot be used to quantitatively project the product's reliability.
• A very important aspect of the DFR process also includes performing Failure Analysis (FA) or Root Cause
Analysis (RCA). FA relies on careful examination of failed devices to determine the root cause of failure and to
improve product reliability. This is where the engineers come face-to-face with the failure, see what a failure
actually looks like and study the processes that lead to it. FA provides better understanding of physics of
failure and can discover issues not foreseen by techniques used prior to testing (such as FMEA). FA helps
with developing tests focused on problematic failure modes. It can also help with selecting better materials
and/or designs and processes, and with implementing appropriate design changes to make the product more
robust.

28
Alternative Methods (Cont.)
• System Reliability Analysis with Reliability Block Diagrams (RBDs) can be used in lieu of testing an entire
system by relying on the information and probabilistic models developed on the component or subsystem
level to model the overall reliability of the system. It can also be used to identify weak areas of the system,
find optimum reliability allocation schemes, compare different designs and to perform auxiliary analysis such
as availability analysis (by combining maintainability and reliability information).
• Fault Tree Analysis (FTA) may be employed to identify defects and risks and the combination of events that
lead to them. This may also include an analysis of the likelihood of occurrence for each event.
• Reliability Growth (RG) testing and analysis is an effective methodology to discover defects and improve the
design during testing. Different strategies can be employed within the reliability growth program, namely: test-
find-test (to discover failures and plan delayed fixes), test-fix-test (to discover failures and implement fixes
during the test) and test-fix-find-test (to discover failures, fix some and delay fixes for some). RG analysis can
track the effectiveness of each design change and can be used to decide if a reliability goal has been met and
whether, and how much, additional testing is required.

RAM and Reliability Block Diagrams:


https://www.youtube.com/watch?v=SvvpT7rrYwI 29
https://slideplayer.com/slide/13006356/
Design of Experiments (DoE)
Design of Experiments (DOE)
• Design of Experiments (DOE) is a systematic method to determine the relationship between factors affecting a process and the
output of that process. In other words, it is used to find cause-and-effect relationships.
• Design of Experiments (DOE) provides a methodology to create organised test plans to identify important variables, to estimate their
effect on a certain product characteristic and to optimize the settings of these variables to improve the design robustness.

• Within the DfR concept, the designer’s interest lies in the effect of stresses on the test units. DoEs play an important role in DfR
because they assist in identifying the factors that are significant to the life of the product, especially when the physics of
failure are not well understood. Knowing the significant factors results in more realistic reliability tests and more efficient
accelerated tests (since resources are not wasted on including insignificant stresses in the test).
DoE Terms and Concepts:
• Controllable input factors, or x factors, are those input parameters that can be
modified in an experiment or process. Controllable input factors can be modified to
optimize the output E.g. quantity of items or quality of an item..
• Uncontrollable input factors are those parameters that cannot be changed, e.g.
ambient temperature. These factors need to be recognized in order to understand
how they may affect the response.
• Responses, or output measures, are the elements of the process outcome that
gauge the desired effect.
Source: K Sundararajan, Design of Experiments – A Primer 31
https://www.youtube.com/watch?v=hfdZabCVwzc
https://www.youtube.com/watch?v=tZWAYbKYVjM
Considerations for DOE
• Hypothesis testing helps determine the significant factors using statistical methods. There are two
possibilities in a hypothesis statement: the null and the alternative. The null hypothesis is valid if
the status quo is true. The alternative hypothesis is true if the status quo is not valid. Testing is
done at a level of significance, which is based on a probability.
• Blocking and replication:
• Blocking is an experimental technique to avoid any unwanted variations in the input or experimental
process. For example, an experiment may be conducted with the same equipment to avoid any equipment
variations.
• Practitioners also replicate experiments, performing the same combination run more than once, in order to
get an estimate for the amount of random error that could be part of the process.
• Interaction: When an experiment has three or more variables, an interaction is a situation in which
the simultaneous influence of two variables on a third is not additive.
• Design of Experiments is also a powerful tool in the Six Sigma methodology to manage the
significant input factors in order to optimise the desired output.
The following resources can be helpful in learning more about DOEs:
32
DOE Simplified Practical Tools for Effective Experimentation (Productivity Inc., 2000)
Design and Analysis of Experiments (John Wiley and Sons, 1997)
The DOE Process – V&V is critical!
• It is CRUCIAL to interpret models correctly.
• Validation and verification of the model and findings
must be done.
• The whole idea of a DOE is to obtain VALID and
OBJECTIVE Conclusions!!!

DoE Methodology: 33
https://www.keysight.com/find/eesof-how-to-doe
http://GembaAcademy.com
DoE Approach – An Example
The design of experiment (DoE) approach enables estimation of parameters (e.g. sample size, level
of tests, test duration, stress values like temperature, humidity, radiation, etc.) based on sound
statistical principles.
• The physics-of-failure approach is based on
the fundamental principles of science and
engineering.
• The associated component failure
mechanisms are evaluated, considering
basic phenomenon involved in degradation /
failure of components. These models
recreate the life of the component with
operating stresses and load profiles.
• Data includes estimation of life and root
causes of failure, incorporation of operational
load profiles of the component, evaluation of
associated failure mechanisms and detailed
modelling for identified dominant failure
mechanisms and modelling for wear out
phenomenon as part of life cycle loads
Some DoE Examples:
https://www.itl.nist.gov/div898/handbook/pri/section4/pri471.htm 34
https://www.itl.nist.gov/div898/handbook/pri/section4/pri472.htm
https://www.itl.nist.gov/div898/handbook/pri/section4/pri473.htm
Physics of Failure
Physics of Failure (PoF)
• PoF analysis provides much needed insights into the failure risks and mechanics that lead to them
(especially when actual test data is not available yet).
• PoF utilizes knowledge of life-cycle load profile, package architecture, material properties, relevant
geometry, processes, technologies, etc. to identify potential Key Process Indicator
Variables (KPIVs) for failure mechanisms.
• This approach very effectively models the root causes of failure such as fatigue, fracture, wear, and
corrosion.
• It can also be used to identify design margins and failure prevention actions as well as to focus
reliability testing.
• There are some limitations with the use of physics of failure in design assessments and reliability
prediction.
• Physics of failure algorithms typically assume a ‘perfect design’.
• Attempting to understand the influence of defects can be challenging and often leads to Physics of Failure (PoF)
predictions limited to end of life behaviour (as opposed to infant mortality or useful operating life).

https://www.lifetime-reliability.com/cms/physics-of-failure-explained/ 36
Deterministic vs. Empirical PoF

• Deterministic:
• This PoF approach starts with the identification of potential failure mechanism - the approach attempt to model the path
the item take to failure using a mathematical model.
• This method also requires stress at each failure site such as loading conditions, structural geometry, and material
properties. The models may describe degradation, erosion, diffusion, corrosion phenomenon leading to sudden or
eventual failure.
• Failures may occur due to accumulated damage or deterioration of the item to withstand the applied stress.
Empirical:
• Empirical models reflect the time to failure behaviour of a system related to specific stress conditions.
• Empirical models mostly use test or field data and do not attempt to model the detailed interactions between stress and
failure mechanism(s).
• An empirical model may be little more than curve fitting the experimental data, yet often the design of the testing and
model rely on a detailed understanding of the failure mechanisms involved.
• It may be a deterministic model based on the activation energy of the molecular rate reaction or the equation form used
for empirical modelling.

37
Physics of Failure (Cont.)
• When the failure mechanism of a product of component is understood, then it is possible to create a
mathematical model of the effect of stress or load on the time to failure behaviour.
• Such a model may take different forms, yet it is the ability to related the conditions surrounding the use
of a device to its eventual demise that is essential. The specifics include a molecular level of detail in
some cases.
• Physics of Failure models focus on the particular relationships between stresses and materials. PoF
modelling employs knowledge of life-cycling stress applications, loading profiles, and in-depth
understanding of failure mechanisms to craft mathematical models. These models are useful to:
• Model time to failure
• Design trade-off analysis
• Selecting the most appropriate materials for the design
• Determine mitigation strategies
• Minimise demonstration or need for accelerated testing
• Improve prognostics during use of the product/component.
• PoF modelling uses scientific theory and research to create rigorous mathematical models.
38
Physics of Failure Example

H. Wang, M. Liserre, F. Blaabjerg, P. P. Rimmen, J. B. Jacobsen, T. Kvisgaard and J. Landkildehus, "Transitioning to physics-of-failure as a reliability driver in power electronics,” IEEE Journal of
39
Emerging and Selected Topics in Power Electronics, accepted.
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)
• The analysis of variance (ANOVA) technique was developed by R. A. Fisher, and is a very
elegant, economical and powerful method for determining the significant effects and
interactions in multivariable situations.
• Analysis of variance is used widely in such fields as market research, optimization of chemical
and metallurgical processes, agriculture and medical research. It can provide the insights
necessary for optimizing product designs and for preventing and solving quality and reliability
problems.
• Possible Methods
• Analysis of Single Variables
• Analysis of Multiple Variables (Factorial Experiments)
• Non-Normally Distributed Variables
• Two-Level Factorial Experiments
• Fractional Factorial Experiments

Refer to Handbook, Page 305 for more information and examples


41
https://www.spss-tutorials.com/anova-what-is-it/
https://www.youtube.com/watch?v=x6F9uvaviEc
Analysis of Variance (ANOVA) Example
• The t-test is a method that determines whether two populations are statistically different from each other, whereas
ANOVA determines whether three or more populations are statistically different from each other. Both of them look at
the difference in means and the spread of the distributions (i.e., variance) across groups; however, the ways that they
determine the statistical significance are different.
• These tests are performed when 1) the samples are independent of each other and 2) have (approximately) normal
distributions or when the sample number is high (e.g., > 30 per group). More samples are better, but the tests can be
performed with as little as 3 samples per condition.
t-test Example

We want to determine whether the concentration of Proteins 1 – 4 in serum are significantly


different between healthy and diseased patients. A t-test is performed, which can be visually
explained by plotting the protein concentration on the X-axis and the frequency along the Y-axis of
the two proteins on the same graph (Figures 1 – 4).

Proteins 1 & 2 have the same difference in protein concentration means but different group
variances. Alternatively, Proteins 3 & 4 have similar variances but Protein 4 has a larger difference
in protein concentration means between the patient groups.

A t-test assigns a “t” test statistic value to each biomarker. A good differential biomarker,
represented by little to no overlap of the distributions and a large difference in means, would have a
high “t” value.

Which is a better biomarker of disease: Protein 1 or Protein 2? Protein 1

Which is a better biomarker of disease: Protein 3 or Protein 4? Protein 4

ANOVA TUTORIALS
42
https://www.youtube.com/watch?v=CS_BKChyPuc
https://www.youtube.com/watch?v=-yQb_ZJnFXw
Taguchi Method
The Taguchi Method
Genichi Taguchi (1986), developed a framework for statistical design of experiments adapted to
the particular requirements of engineering design. Taguchi suggested that the design process
consists of three phases: system design, parameter design and tolerance design.

• In the system design phase the basic concept is decided, using theoretical knowledge and experience to
calculate the basic parameter values to provide the performance required.
• Parameter design involves refining the values so that the performance is optimized in relation to factors
and variation which are not under the effective control of the designer, so that the design is ‘robust’ in
relation to these.
• Tolerance design is the final stage, in which the effects of random variation of manufacturing processes
and environments are evaluated, to determine whether the design of the product and the production
processes can be further optimized, particularly in relation to cost of the product and the production
processes.

• Note that the design process is considered to explicitly include the design of the production methods and
their control. Parameter and tolerance design are based on statistical design of experiments.

See Handbook Page 318 for examples and more information


44
The Taguchi Method
Taguchi Method is a process/product optimization method that is based on 8-steps of planning,
conducting and evaluating results of matrix experiments to determine the best levels of control
factors. The primary goal is to keep the variance in the output very low even in the presence of
noise inputs.

Taguchi separates variables into two types.


• Control factors are those variables which can be practically and economically controlled, such
as a controllable dimensional or electrical parameter.
• Noise factors are the variables which are difficult or expensive to control in practice, though
they can be controlled in an experiment, for example ambient temperature, or parameter
variation within a tolerance range.
The objective is then to determine the combination of control factor settings (design and process
variables) which will make the product have the maximum ‘robustness’ to the expected variation
in the noise factors.

The measure of robustness is the signal-to-noise ratio, which is analogous to the term as used in
control engineering.
http://www.ecs.umass.edu/mie/labs/mda/fea/sankar/chap2.html
45
https://www.isixsigma.com/methodology/robust-design-taguchi-method/introduction-robust-design-taguchi-method/
The Taguchi Method

https://slideplayer.com/slide/2813463/
46
Using Reliability Techniques to support the Design Process
Reliability Assessments in the Design Process
Reliability Engineering in the Plant Asset Lifecycle

Reliability Assessments
are an ongoing process
conducted throughout the
entire asset lifecycle.

49
Reliability Validation
• In reliability engineering validation usually deals with both functional and environmental specifications
and it is also set up to ensure that all the reliability requirements of the system are met.
• The goal of validation is to successfully resolve design and manufacturing issues in case they had been
overlooked at the previous design phases.
• Validation usually involves functional and environmental testing at a system level with the purpose of
ensuring that the design is production-ready.
• These activities may include test to failure or test to success, and are usually conducted at field stress
levels or as accelerated life testing (ALT).
• Reliability requirements also need to be demonstrated at this stage.
• Product validation is often done in two phases - design validation (DV) and process validation (PV).
• DV activities usually include the environmental, durability, capability and functional tests and are executed on a prototype.
• The PV tasks are similar to DV but are executed on pilot or production parts, preferably manufactured at the intended production
facilities. The intent is to validate that the production processes are fully capable of repeatedly producing products that meet
specifications.

50
Design vs. Process Validation

Design Validation Process Validation


Testing aimed at ensuring that a product or system fulfils the defined The analysis of data gathered throughout the design and
user needs and specified requirements, under specified operating manufacturing of a product in order to confirm that the process can
conditions. reliably output products of a determined standard.

51
Design FMEA (DFMEA)
• A properly applied Design FMEA (DFMEA) takes requirements, customer usage and environment
information as inputs and, through its findings, initiates and/or informs many reliability-centred
activities such as
• Physics of Failure (PoF),
• System Analysis,
• Reliability Prediction,
• Standards-based Reliability Prediction (using common military or commercial libraries, such as MIL-217, Bellcore and
Telcordia, to come up with rough MTBF estimates or to compare different design concepts when failure data is not yet
available).
• Life Testing and
• Accelerated Life Testing.
• When a design is evolutionary and does not involve many changes, a technique called Design
Review Based on Failure Modes (DRBFM) can be applied.

52
Process FMEA (PFMEA)
• Process FMEAs (PFMEAs) can be used to examine the ways the reliability and quality of a product
or service can be jeopardized by the manufacturing and assembly processes.
• Control Plans can be used to describe the actions that are required at each phase of the process to
assure that all process outputs will be in a state of control.
• Factory Audits are necessary to ensure that manufacturing activities (such as inspections, supplier
control, routine tests, storing finished products, Measurement System Analysis and record keeping)
are being implemented according to requirements.
• The manufacturing process is also prone to deviations. The reliability engineer ought to communicate to the
production engineer the specification limits on the KPIVs that would define a "reliability conforming" unit. The
production engineer is then responsible for ensuring that the manufacturing process does not deviate from the
specifications.
• Statistical Process Control (SPC) methods can be useful in this regard (discussed more in Lecture 6).
53
Human Reliability Considerations
• The term ‘human reliability’ is used to cover the situations in which people, as operators or maintainers, can
affect the correct or safe operation of systems. In these circumstances people are fallible, and can cause
component or system failure in many ways.
• Human reliability must be considered in any design in which human fallibility might affect reliability or safety.
• Design analyses such as FMECA and FTA should include specific consideration of human factors, such as the
possibility of incorrect operation or maintenance, ability to detect and respond to failure conditions, and
ergonomic or other factors that might influence them.
• Also, where human operation is involved, product design should be made in full consideration of physiological
and psychological factors in order to minimise the probability of human error in system operation.
• Human error probability can be minimised by training, supervision and motivation, so these must be
considered in the analysis.
• Human factor analyses can be used to highlight the need for specific training, independent checks, or operator
and maintainer instructions and warnings.
• A source for consideration on human factors would be incident and occurrence investigations – this can
provide very valuable information regarding “lessons learnt” in previous designs or product/process uses.

54
Safety Factors in Design
The Stress-Strength Interference Principle
• The Stress-Strength Interference principle states that a product fails when
the stress experienced by the product exceeds its strength.
• In order to reduce the failure probability (and thus increase the reliability),
we must reduce the interference between stress and strength.
• For distributed load and strength, 2 factors are defined. The safety
margin (SM) is the relative separation of the mean values of load and
strength and the loading roughness (LR) is the standard deviation of the
load; both are relative to the combined standard deviation of the load and
strength distributions.
• SM and LR, in theory, allow analysis of the way in which load and strength
distributions interfere, and so generate a probability of failure.
Safety Factor in Design
In design, if strength exceeds load, there should not be failures. This is the normal approach to design, in which the designer considers the likely
extreme values of load and strength, and ensures that an adequate safety factor is provided. Additional factors of safety may be applied, e.g.as
defined in pressure vessel design codes or electronic component de-rating rules. This approach is usually effective. Nevertheless, some failures
do occur which can be represented by the load-strength model. By definition, either the load was then too high or the strength too low.
Source: www.reliasoft.com 56
The Impact of Failure probability on Safety Margin Curve
These curves illustrate the sensitivity of reliability to safety margin, loading
roughness and the load and strength distributions.
• Once the safety margin exceeds a value of 3 to 5, depending upon the value of loading
roughness, the failure probability becomes infinitesimal. The item can then be said to be
intrinsically reliable.
• There is an intermediate region in which failure probability is very sensitive to changes in
loading roughness or safety margin.
• Conversely, at low safety margins the failure probability is high.

Failure probability–
safety margin curves Failure probability–
for asymmetric safety margin curves
distributions (loading for asymmetric
roughness = 0.3) distributions (loading
roughness = 0.9)

Source: Carter, 1997 57


Probability and Risk of Failure - Example

Source: https://ascelibrary.org/doi/10.1061/%28ASCE%291090-0241%282008%29134%3A12%281691%29 58
Considerations when “Safety Critical Design” is relevant
If a product is performing a safety-critical role, then failure of a key component can have dire consequences.
There are several approaches to minimizing the risk of catastrophic failure:
1) Over-specification: For product applications in the building and construction industry, it is standard practice to include a
‘x5’ safety factor in all material strength calculations. For example: a suspension bracket for a 10kg light fitting will be
designed to carry at least 50kg.*
2) Redundancy (parallel): Multiple identical components are used simultaneously, any one of which would be capable of
supporting normal product function. For example, a passenger lift has 4 cables carrying the lift cabin, all sharing the load.
Any one cable would be capable of carrying the full passenger lift load. A failure of up to 3 cables will not endanger the lift
occupants. Flight control and instrument systems in some aircraft adopt a similar strategy.
3) Redundancy (standby): A back-up system is held in reserve and comes into operation only when the main system fails,
for example stand-by generators in hospitals, and reserve parachutes. Standby redundancy is often described as a ‘belt &
braces’ approach.
4) Fail-safe design: Assumes an inherent risk of failure for which the cost of any of the above three strategies would be
prohibitively high. The product or system is designed to drop into a safe condition in the event of partial or total failure. For
example:
i. The gas supply to a domestic central heating boiler is shut off in the absence of a ‘healthy’ signal from the water pump, flame sensor,
water pressure sensor, or exhaust fan.
ii. Toys can be designed to fracture at pre-determined weak points so as to leave no sharp projections that would injure a child.
iii. Railway train brakes are released by vacuum, and applied by admitting air. If a brake pipe bursts, the admitted air automatically
applies the train brakes.

Source: A. Tayleo, B.SC. MA FRSA. Art and Engineering in Product Design - Designing for Reliability. 59
Understanding Design Margins

WHY is it important to understand


Design Margins?
• Creates the ability to explore
operating impact beyond operating
point limits
• Know where the problem plant is
(where failures are more than
expected)
• Reverse engineering V&V
• Modification & Scenario testing
• Adverse condition simulation
• Use Design Basis to identify
opportunities for reliability and
availability improvements
• Test new configurations or
alternative designs/options

Source: Parlour, 2007 60


Load Strength Analysis
Load–Strength analysis (LSA) is a procedure to ensure that all
load and strength aspects have been considered in deriving the
design, and if necessary, in the planning of tests. Load-Strength
analysis may begin at the early stages of the DESIGN phase and
continue through most of the DfR process as more data about
system characteristics become available.
The LSA should include the following:
• Determine the most likely worst case values
and patterns of variation of load and strength.
• Evaluate the safety margin for intrinsic reliability.
• Determine protection methods (load limit,
derating, screening, other quality control
methods).
• Identify and analyse strength degradation
modes.
• Test to failure to corroborate, analyse results.
• Correct or control (redesign, safe life, quality
control, maintenance, etc.).

Source: www.reliasoft.com 61
Learning reference: https://www.youtube.com/watch?v=gN6MfUe7x1o
Designing against Fatigue
Design for reliability under potential fatigue conditions means either ensuring that the distributed
load does not exceed the critical load or designing for a limited ‘safe life’, beyond which the item is
not likely to be used or will be replaced in accordance with a maintenance schedule. The following
list gives the most important aspects that must be taken into account:
• Knowledge must be obtained on the material fatigue properties, from the appropriate data sources, and, where
necessary, by test. And should consider processes (like machining) which might affect fatigue.
• Stress distributions must be controlled, by careful attention to design of stress concentration areas such as holes,
fixings and corners and fillets. Location of resonant anti-nodes in items subject to vibration must be identified.
• Design for ‘fail safe’, that is, the load can be taken by other members or the effect of fatigue failure otherwise
mitigated, until the failed component can be detected and repaired or replaced. This approach is common in aircraft
structural design.
• Design for ease of inspection to detect fatigue damage (cracks), and for ease of repair.
• Use of protective techniques, such as surface treatment to relieve surface stresses, increasing surface toughness,
or provision of ‘crack stoppers’, fillets added to reduce the stress at crack tips.
• Care in manufacture and maintenance to ensure that surfaces are not damaged by scratches, nicks, or impact.

62
Formal Design Reviews
Importance of Design Reviews
• Peer review format is an ideal approach, where the engineers performing the reviews work closely with the
designers. The reviewer should ideally be a reliability engineer who can be respected by the designer as a
competent member of a team whose joint objective is the excellence of the design.
• A team approach makes it possible for designs to be adequately reviewed and analysed before drawings are
signed off, beyond which stage it is always more difficult and more expensive to incorporate changes.
• Each formal review must be based upon analyses of the design as it stands and supported by test data, parts
assessments, and so on.
• The reviews should be planned well in advance, and the designers must be fully aware of the procedure.
• All people attending must also be briefed in advance, so they do not waste review time by trying to understand
basic features. To this end, all attendees must be provided with a copy of all formal analysis reports (reliability
prediction, load–strength analysis, PMP review, maintainability analysis, critical items list, FMECA, FTA) and a
description of the item, with appropriate design data such as drawings.
• The designer should give a short presentation of the design and clear up any general queries.
• The formal Design Review meeting to freeze a system/product design should be a decision-making forum, and
should not be not bogged down with discussion of trivial points.

64
Importance of Parts, Materials and Processes (PMP) Review
• All new parts, materials and processes called up in the design should be identified. The designer can assume
that a part or material will perform as specified in the brochure and that processes can be controlled to
comply with the design. The reliability and quality assurance (QA) staff must ensure that this belief is well-
founded.
• New parts, materials and processes must be formally approved for production and added to the approved
lists.
• Materials and processes must be assessed in relation to reliability. This includes personnel-induced failures
during operation, storage, handling and maintenance.
• Non-material failure modes should be considered as well (e.g. wear, protection mechanism failure, etc).
• Field return and warranty data analysis can also be an invaluable source in identifying and addressing
potential reliability problems
• The main reliability considerations include:
• Cyclical loading. Whenever loading is cyclical, including frequent impact loads, fatigue must be considered.
• External environment. The environmental conditions of storage and operation must be considered in relation to factors
such as corrosion and extreme temperature effects.
• Wear. The wear properties of materials must be considered for all moving parts in contact.

65
Scope of Formal Design Reviews
• Formal design reviews should be conducted at significant check-points (e.g. at “Baseline Stage Gates”) with the
purpose of ensuring that the product being developed will meet reliability requirements.
• Design reviews should cover all phases of the life cycle. The reviews should be formal, independent and objective
assessments of the product and its intended support conditions and should be carried out by appropriate experts.
• The information used for design reviews should include:
• The Requirements specification
• Current reliability predictions;
• Identified potential design or support weaknesses;
• Fault mode and effects, fault tree, stress and load, human factors and trade-off analyses;
• Status of previous review actions;
• Verification and test results.
• Design reviews should be conducted and carefully documented at the following stages:
• Preliminary design review;
• Detailed design review;
• Final design review;
• Manufacturing design review;
• Installation design review;
• Use (Hand-over) design review.

Experts in the various fields of reliability, maintainability and maintenance support should participate in design review activities. 66
The supplier may either be requested, or directed, by the customer to have the customer participate in the supplier’s design review activities.
Maintaining the Integrity of the Plant Design & Knowledge Base
Formal design reviews is a primary mechanism used to maintain, validate and verify the integrity of the
plant Design Base. Design review and Design Base integrity audits typically consider the following
elements:
Sufficient
(What is good enough to confirm Design Base integrity?
By what criteria is this measured and is it accepted by
the regulatory authority?)

Verifiable
(what is necessary to demonstrate that the Design Base is
Correct and Complete?)

Maintainable
(What is necessary to maintain the Design Base and who is
responsible for Design Base and acts as Design Authority?)

Sustained
(How do we ensure/confirm the Design Base integrity is intact?)

67
Requirements Management
Why Requirements Management Matters!

69
Requirements Definition
• Requirement - statement which translates or expresses a need and its associated constraints
and conditions. (ISO/IEC/IEEE 29148:2011, Item 4.1.17 Requirement)

• Design requirements - an engineering requirement reflected in design output information


(document and/or data) that defines the form, fit and function, including capabilities,
capacities, physical sizes and dimensions, limits and set points, specified by the design
authority for a structure, system or component of the facility. Each design requirement has a
design basis, documented or not. (IAEA-TECDOC-1335 Annex А)

• Requirement – part of text, which identifies needs in scope of operation functional


parameters, characteristics or limitations of product or process of Project.

• Project Requirements – Scope of all Requirements from Contract and regulatory normative
documentations witch applying for Project NPP and project management processes for
receiving licenses and permissions from supervisor of country where NPP is building.

70
The Requirements Eco-System

Requirements identification and Requirements analysis, also


definition is the process of sourcing, Requirements Requirements called requirements engineering, is the
documenting, analyzing, prioritizing and process of determining user
agreeing on requirements. It is a
Definition and Analysis &
expectations for a new or modified
continuous process throughout a project Identification Decomposition product. These features, called
as requirements may change as the requirements, must be quantifiable,
product or asset progresses through the relevant and detailed. In software
asset lifecycle. engineering, such requirements are
often called functional specifications.
Requirement
Requirements Commitments is the
action planned/undertaken by the Project
Requirements realisation represent
or Facility to meet the requirement put
how one or more requirements are
forward by stakeholders.
fulfilled in the design. This can take
various forms. It may include, for Requirements Requirements tracing is the ability to
example, a design document or drawing, Requirements describe and follow the life of a defined
Commitments Requirement, in both a forward and
test and inspection reports, etc. that Realisation and Tracing backward direction through the entire
show how a requirement was met.
Project/Asset’s lifecycle.

Requirements management is the process of documenting, analyzing, tracing, prioritizing and agreeing on requirements and then controlling change and communicating to 71
relevant stakeholders.
Requirements Criteria
Requirements that are defined should meet the following criteria:
• Completeness: All possible scenarios through the system are described, including exceptions.
• Consistency: There are no contradicting requirements.
• Clarity/Unambiguity: The specification can only be interpreted one way.
• Correctness: Requirements represent accurately the system or product that the client needs.
• Realism: The system or product can be implemented within constraints.
• Verifiability: Tests can be designed to
demonstrate that the system or product fulfils
its requirements.
• Trace-ability:
• Requirements can be traced to system/product
functions.
• System/product functions can be traced to
requirements.
• Dependencies among requirements, system and
product functions; and everything else in between
these objects can be tracked (forward & backward).
72
Requirements Definition – Vehicle Design Example
Requirements can cover a wide variety of criteria as shown below. It will typically cover aspects of:
• Functionality and Desired Features
• Reliability
• Supportability
• Usability
• Performance
• Regulatory/Legal
• Security
• Useful Life
• Constraints (environmental, design, safety,
user, client)
• Risk Mitigation and Management

73
Requirements Management Model
• The model clearly depicts
how there is a process of
requirements identification,
followed by a realization
phase and finally a V&V
phase to prove that
requirements were met.

• It follows the classical V-


Model concept.

74
All Projects start with a set of Requirements
Requirements
Management –
URS (URM)
Generate New
Document Declare
Design
Management Engineering
Documents &
Create/Update Process Baseline (PBD)
Drawings
Design Basis
(DBS)
Design
Perform Reliability Review
Engineering Process Change
Analysis (REL) Required?
• RAM Analysis
(RAM)
• Flow Simulation
Plant Design
Analysis (FLS)
Classification • Fault Tree Risk Analysis
(PDC) Analysis (FLT) Interface Reviews Change Management
• FMECA/FMEA Reviews (IRV) (RSK) Process Initiated
Design Plant (SP Analysis (FMA)
PID, SPEL, SPI, • HAZOP Analysis
SP 3D) (HAZ) Sources of Requirements: Classification of Requirements:
• WIEBULL
Define Plant Analysis (WBLS) • Requirements of Contract • safety class

Maintenance Lifecycle Analysis
(LCA) • Normative and regulatory • influence to environmental
Basis (PMB)
Documentation • discipline
Define Operating • Internal Requirements of Design • responsible organisation
Tech Spec (OTS)
organisation • priority

75
Typical Asset Lifecycle Challenges caused by poor Requirements
Management
URS, FEED AND DESIGN STAGE
 Poorly defined requirements and project scope definition
 Complexity of Regulatory approvals and timelines involved not considered
 No Lifecycle Information Management or Handover Strategy, “weak” Contracts
 Engineering and Design Information delivery standards and consistency
 Scope and design creep – undocumented changes/client requests
 Poor Requirements Management – relationships not identified or maintained
CONSTRUCTION
 Management of Design Changes (especially across multiple similar units)
 Weak Design Basis document/drawing control specifications in Contracts (not issued during Tender phase)
 Updating Design Basis drawings and documents with field changes made
 Poor interface & LOSS management (can wreak havoc on 3D model and integration)
 Supply Chain challenges not understood (lead times, delivery and construction sequences)
 Relationships between project/contract and engineering/design artefacts not managed
 Poor Risk Management practices (analysis and preventive/remedial action implementation)
COMMISSIONING
 Performance and compliance requirements not fully documented
 Poorly defined commissioning and decommissioning procedures and methods
 Poor management of data-books, compliance documentation and certificates (hardcopy)

76
Typical Asset Lifecycle Challenges caused by poor Requirements
Management (Cont.)
O&M
 Documents accepted at handover that does not reflect “as built”
 Handover done as “big bang” and not incremental – information too much to properly QC or establish completeness
 All information needed for the O&M lifecycle phase not handed over
 Duplication of information/data across multiple systems (duplicate instead of integrate)
 Design Basis not managed adequately during O&M – field changes allowed but not documented

LIFEX
 Poorly defined original Design Basis and supporting documents
 Changes made to the Design Basis not maintained or captured – flawed new design
 A lot of reverse engineering required to get to the original design requirements
 Equivalence Management – virtually impossible without the Design Basis information

DECOMMISSIONING
 Poorly defined environmental disposal requirements (or not captured at all)
 Funding and planning for decommissioning scope of work not considered adequately in the asset lifecycle
 Changing environmental and legislative requirements not documented/communicated or fully understood
 Disposal of documents prematurely done – not understanding the need to have them for decommissioning
 Decommissioning procedures not always maintained and stored for end-of-life use

77
Why Requirements Management Matters…a day in the life of a Mega-Project
URS VDSS/CDSS Use object-based information management system to manage all the
relationships to satisfy requirements set.
PBS/Tags • Dossiers (Information sets like weld records)
Transmittal • Requirements Management
Contract
Design • There is not only 1 change happening at a time…!
Manual Design Basis Package
Artefact(s) Benefits Realization
3D Model WBS Monitoring
Discipline Databooks

Compliance Certificates
Design Review & Close out & Compliance/
Release for Use Conformance Verification Quality Records
Contract Info Request (CIR)
Transmit/Distribute to all Non-Conformance
EPC’s Management
Technical Query
Guarantee/ Warranty
Management
Compensation Event Implement Change & Track Commissioning Pack
Progress
Construction Schedule & Implementation
Sequence ECN/FCN
Detail Design
Change S.o.W (Feasibility Modification Project
Risk Assessment Invoices
Study/ Concept Design/ Registration (own WBS)
SHEQ Assessment Options) Retention
Project Plan , Resources & Progress/Earned Value
Claim Claim/Comp. Event Approval
Budget Monitoring
Design Interface Impact
Assessment Claim/Comp. Event Review

DESIGN BASIS ENGINEERING SECONDARY BUSINESS 78


DESIGN BASIS
ARTEFACTS PROCESSES RELATIONSHIPS
Complexity of Requirements in Nuclear Plants - A Case Study
New Requirements Analyse and Identify Requirements (“Requirements Impact
Analysis) New/Updated Site Safety
Submission (Document) Report

Design Commitment
Design Basis
Documents

Create new IPI System RM Object New Requirements


(“Placeholder/Dossier” for New Compliance
Requirement) Commitment (RCC)

Acceptance Criteria
Existing New

Existing
RCC’s
Supporting
Documents
Compliance V&V
Other Site Requirements
Objects
Object-Orientated Relational Database
New Safety Case Existing Safety Case
Objects is the best possible solution to
Object Supporting manage the complexity of
Documents
requirements management in a
Nuclear Facility

79
Requirements Definition
• Requirements can be determined in many different ways, or through a combination of those
different ways.
• Project Requirements define how work will be managed – usually covered in Project Management Plan.
• Product Requirements provide high level features and capabilities that was committed for delivery to the customer.
These requirements will not specify how the features or capabilities will be designed.

• The system reliability requirement goal can be allocated


to the assembly level, component level or even down to
the failure mode level.
• Once the requirements have been defined, they must be
translated into design requirements and then into
manufacturing requirements. Requirements can be based on
• contracts,
• benchmarks,
• competitive analysis,
• customer expectations,
• cost,
• safety,
• best practices, etc.
Requirements Management – Processes, procedures and tools to record, consolidate, store, analyse and track regulatory requirements, design and
80
licensing bases, and other process plant documentation in order to safely and cost effectively manage changes.
accuracy What are documented design requirements? uncertainty
requirements The information to understand the design and…? requirements

thermo-hydraulic seismic functional interference material simulation installation Lay-out regulatory


requirements requirements requirements constraints requirements requirements requirements requirements requirements

standards diversity stress human factors pressure ALARA commissioning design life
requirements requirements requirements requirements requirements requirements requirements requirements

design constraints SAR isolation QA event frequency limit traceability security cost
and assumptions requirements requirements requirements requirements requirements requirements requirements

3D-CAD design decision PSA safety analysis load safety validation


requirements rationale requirements requirements requirements requirements requirements

system physics Requirements Technical Specifications are codified and HAZOPS interface
specifications requirements contains (explicit) design knowledge supported by: requirements requirements
• Artifacts (document outputs, reports)
component labeling • Information (facts, figures, records, tests) reliability operational
specifications requirements • Data (parameters, measurements, dimensions, uncertainties, etc.) requirements requirements

performance civil/structural electrical system in-service test redundancy FMEA qualification


requirements requirements requirements requirements requirements requirements requirements

I&C dynamic response independence mechanical accessibility maintenance fire protection


requirements requirements requirements requirements requirements requirements requirements

chemistry dimensional material defense in depth decommissioning operating limits and emergency response
requirements requirements requirements requirements requirements conditions requirements

81
Best Practices for Requirements Definition/Identification
• Identify and involve all stakeholders
• Requirements management is an ongoing, iterative process conducted throughout the product/asset
lifecycle.
• Defined Requirements should be reviewed and formally approved by the business owners or customer.
• Defined Requirements should be centrally and very clearly documented in a controlled tracking system
or Requirements Log. Where there is unclear requirements, the processes should exist to discuss this
between all stakeholders to achieve common agreement on the requirement.
• Each Requirement identified should have a unique identifier and should be recorded as a single entry
(against which commitments and realization of commitments can be tracked and measured).
• Trace-ability should be centrally documented in a control system or Requirements Log.
• Regular reviews should be conducted on Requirements and their traceability. Depending on the
complexity of the project, the review can be daily, but generally happens at least weekly.
82
Dealing with Non-Functional Requirements Specifications
Non-Functional Requirements (NFR)
• In systems engineering and requirements engineering, a non-functional requirement (NFR) is
a requirement that specifies criteria that can be used to judge the operation of a system, rather
than specific behaviours. They are contrasted with functional requirements that define specific
behaviour or functions.
• Non-functional requirements describe how the system works, while functional requirements
describe what the system should do.
• Non-functional requirements essentially specifies how the system should behave and that it is a
constraint upon the systems behaviour. One could also think of non-functional requirements as
quality attributes for of a system.
• Typically, Functional Requirements will specify a behaviour or function, e.g.:
“Display the name, total size, available space and format of a flash drive connected to the USB port.”

84
Functional vs. Non-Functional Requirements – An IT Example
Some of the more typical functional requirements include: Some typical non-functional requirements are:
• Business Rules • Performance
• Transaction corrections, adjustments and cancellations • Response Time
• Administrative functions • Throughput and Utilisation
• Authentication • Scalability
• Authorization levels • Capacity
• Audit Tracking • Availability
• External Interfaces • Reliability
• Certification Requirements • Recoverability
• Reporting Requirements • Maintainability
• Historical Data • Serviceability
• Legal or Regulatory Requirements • Security
• Regulatory
• Manageability
• Environmental
• Data Integrity
• Usability
• Interoperability

85
Requirements Identification Methods
Quality Function Deployment (QFD) in Requirements Management
• A commonly used methodology is the Quality Function Deployment (QFD) approach using what is
commonly called the House of Quality tool. It provides
• A requirements Planning capability
• A tool for graphic and integrated thinking
• A means to capture and preserve engineering thought processes during the product development cycle
• A means to communicate the thought processes to new members of the QFD team
• A means to inform management regarding inconsistencies between requirements, risks and the needs of the customer.

• Quality professionals refer to QFD by many names, including matrix product planning, decision
matrices, and customer-driven engineering.
• This is a systematic tool to translate customer requirements into functional requirements, physical
characteristics and process controls and then effectively responding to those needs and
expectations.
• In QFD, quality is a measure of customer satisfaction with a product or a service.

87
Quality Management Planning Tools used in QFD
The seven quality control tools used in Quality Function Deployment (QFD), listed in an order that moves from
abstract analysis to detailed planning, are:
• Affinity Diagram: organizes a large number of ideas into their natural relationships.
• Relations Diagram: shows cause-and-effect relationships and helps you analyse the natural links between different
aspects of a complex situation.
• Tree Diagram: breaks down broad categories into finer and finer levels of detail, helping you move your thinking step by
step from generalities to specifics.
• Matrix Diagram: shows the relationship between two, three or four groups of information and can give information about
the relationship, such as its strength, the roles played by various individuals, or measurements.
• Matrix Data Analysis: a complex mathematical technique for analysing matrices, often replaced in this list by the similar
prioritization matrix. One of the most rigorous, careful and time-consuming of decision-making tools, a prioritization matrix is
an L-shaped matrix that uses pairwise comparisons of a list of options to a set of criteria in order to choose the best
option(s).
• Arrow Diagram: shows the required order of tasks in a project or process, the best schedule for the entire project, and
potential scheduling and resource problems and their solutions.
• Process Decision Program Chart (PDPC): systematically identifies what might go wrong in a plan under development.

Source: Nancy R. Tague. The Quality Toolbox, Second Edition, ASQ Quality Press, 2004. 88
House of Quality Method – An Example
• All requirements are mapped into the “house” in order to derive
a view on how customer satisfaction can best be achieved.

Quality function deployment (QFD) makes use of the Kano model in terms of the structuring of the comprehensive QFD matrices. Mixing Kano types in 89
QFD matrices can lead to distortions in the customer weighting of product characteristics. Kano's model provides the insights into the dynamics of
customer preferences to understand these methodology dynamics.
The Voice of the Customer (VoC) – the KANO Model
The Kano model is a theory of product development and customer satisfaction developed in the
1980s by Professor Noriaki Kano, which classifies customer preferences into five categories.

• Must-be Quality: requirements that the customers expect and are taken
for granted - these are considered “Must-be” criteria because they are the
requirements that must be included and are the price of entry into a market
• One-dimensional Quality: These attributes result in satisfaction when
fulfilled and dissatisfaction when not fulfilled.
• Attractive Quality : These are attributes that are not normally expected.
Since these types of attributes of quality unexpectedly delight customers,
they are often unspoken.
• Indifferent Quality : Aspects that are key to the design and manufacturing
of a product, but consumers are not even aware of the requirement and
may not affect the customer directly.
• Reverse Quality: A high degree of achievement resulting in dissatisfaction
and directly relates to the fact that not all customers are alike. E.g., some
customers prefer high-tech products, while others prefer the basic model of
a product and will be dissatisfied if a product has too many extra features.

90
The Voice of the Customer – Affinity Diagrams
• An affinity diagram (sometimes also called a KJ method diagram) is a useful way to group tasks,
facts or ideas according to themes (and to evaluate whether there ae natural patterns or groupings
in the information). It is particularly useful when you have a large and complex problem that you
want to understand.
• The steps of creating an affinity diagram are:
• Write each item on a card.
• Group related cards into themes. Continue until all the cards are grouped (some may be in groups of 1 card).
If this is being done as a team, it is usually recommended that it is done in silence (to avoid unduly influencing each
other).
• Discuss any patterns that have arisen.
• For best results it is recommended that this activity is
carried out by a team of people with varied
backgrounds and functions.

91
The Voice of the Customer – Pair-wise Comparisons

• A pairwise comparison starts with preferential voting, which is an election method that requires
voters to rank all the requirements in order of their preference/importance.
• This is a very popular customer survey method.
• Some considerations:
• It can be very subjective.
• Questions can be designed to pressure the consumer towards
a specific outcome
• If not well-defined, its application in real life can be limited.
• A principal weakness is that the method fails to take into
account other preferences beyond the first choice.
• It does not also always provide clear results.

92
Techniques to set System Reliability Goals
Managing Customer Expectations

• Managing requirements is an
ongoing process.
• It is critical to keep on
engaging customers or users
of end product to ensure that
expectations and
requirements will be met.
• The Technology Hype Curve
shows the typical customer
behaviour when a new system
or product is introduced.

94
Reliability Allocation
• The first step in the design process is to translate the overall system reliability requirement into
reliability requirements for each of the subsystems. This process is known as reliability allocation.
• The problem is to establish a procedure that yields a unique or limited number of solutions by
which consistent and reasonable reliabilities may be allocated.
• The reliability parameters apportioned to the subsystems are used as guidelines to determine
design feasibility.
• If the allocated reliability for a specific subsystem cannot be achieved at the current state of
technology, then the system design must be modified and the allocations reassigned.
• This procedure is repeated until an allocation is achieved that satisfies the system level
requirement and all constraints, and results in subsystems that can be designed within the
reliability parameters defined for the design.
• It should be noted that the allocation process can, in turn, be performed at each of the lower levels
of the system hierarchy, e.g. equipment, module, component.

95
Reliability Allocation (Cont.)
In the event that it is found that even with re-allocation some of the individual subsystem
requirements cannot be met within the current design, then the designer must use one or any
number of the following approaches (assuming that they are not mutually exclusive) in order to
achieve the desired reliability:
1. Find more reliable component parts to use
2. Simplify the design by using fewer component parts, if this is possible without degrading performance
3. Apply component derating techniques to reduce the failure rates below the averages
4. Use redundancy for those cases where 1, 2 and 3 above do not apply

96
Equal Technique
• In the absence of definitive information on the system, other than the fact that n subsystems are to
be used in series, equal apportionment to each subsystem would seem reasonable.
• In this case, the nth root of the system reliability requirement would be apportioned to each of the n
subsystems.
• The equal apportionment technique assumes a series of n subsystems, each of which is to be
assigned the same reliability goal.
• A prime weakness of the method is that the subsystem goals are not assigned in accordance with
the degree of difficulty associated with achievement of these goals.

where
R* is the required system reliability
R*i is the reliability requirement apportioned to
subsystem i

97
AGREE Technique

• A method of apportionment for electronics systems is outlined by the Advisory Group on the
Reliability of Electronic Equipment (AGREE) takes into consideration both the
complexity and importance of each subsystem.
• The importance, wi, is the probability that the system fails given that a module, i, is critical and fails.
Where
i = a counter representing each module, i = 1, 2, 3 …, n
t = system operating time
R*(t) = system reliability requirement at time t
ti = operating time of module i
λi = failure rate of module i
wi = probability that the system fails given that module i is critical and fails

98
Feasibility of Objectives Technique
This technique was developed primarily as a method of allocating reliability without repair for
mechanical-electrical systems. In this method, subsystem allocation factors are computed as a
function of numerical ratings of system intricacy, state of the art, performance time, and
environmental conditions.
These ratings are estimated by the engineer on the basis of his experience. Each rating is on a
scale from 1 to 10, with values assigned as discussed:
• System Intricacy: Intricacy is evaluated by considering the probable number of parts or components making up the
system and also is judged by the assembled intricacy of these parts or components. The least intricate system is rated at
1, and a highly intricate system is rated at 10.
• State of the Art: The state of present engineering progress in all fields is considered. The least developed design or
method is a value of 10, and the most highly developed is assigned a value of 1.
• Performance Time: The element that operates for the entire mission time is rated 10, and the element that operates the
least time during the mission is rated at 1.
• Environment: Environmental conditions are also rated from 10 through 1. Elements expected to experience harsh and
very severe environments during their operation are rated as 10, and those expected to encounter the least severe
environments are rated as 1.
99
Feasibility of Objectives Technique (Cont.)
• The ratings are assigned by the design engineer based upon his engineering know-how and
experience. They may also be determined by a group of engineers using a voting method such as the
Delphi technique.
• An estimate is made of the types of parts and components likely to be used in the new system and what
effect their expected use has on their reliability. If particular components had proven to be unreliable in
a particular environment, the environmental rating is raised.
• The four ratings for each subsystem are multiplied together to give a rating for the subsystem. Each
subsystem rating will be between 1 and 100,000. The subsystem ratings are then normalized so that
their sum is 1.
where:
T is the operating duration.
λS is the system failure rate.
λi is the allocated subsystem i failure rate.
Ci is the percent weighting factors of the ith subsystem.
Wi is the composite rating for the ith subsystem.
N is the total number of subsystems.
rik is the kth rating result for the ith subsystem.

100
ARINC Technique

• The ARINC apportionment method was designed by ARINC Research Corporation, a subsidiary of
Aeronautical Radio, Inc.
• The method assumes that all subsystems are in series and have an exponential failure distribution.
• From the present allocation of the subsystems, allocation improved system failure rates are derived
based on weighting factors.

where:
n is the total number of subsystems.
λi is the present failure rate of the ith subsystem.
λS is the required system failure rate.
λ′i is the failure rate allocated to the ith subsystem.

101
Repairable Systems Allocation Technique

• Repairable Systems apportionment, is designed for repairable systems.


• Repairable Systems apportionment allocates subsystem failure rates to allow the system to meet
an availability objective for a repairable system.
• This technique assumes all subsystems to be in series, with exponential failure distributions and
constant repair rates.
• By determining the ratio of the allocated failure rate to the repair rate for each subsystem based on
a steady-state availability calculation, the failure rate allocated to each subsystem can be
determined.

where:
As is the required system availability.
Ai is the allocated availability for the ith subsystem.
n is the total number of subsystems.
θi is the ratio of allocated failure rate to the repair rate for the ith subsystem.
ui is the repair rate for the ith subsystem.
λi is the allocated failure rate for the ith subsystem.

102
Cost-Based RS-Allocation Technique/Method
• System reliability optimisation is possible through reliability allocation at the component level,
approaching it as a nonlinear programming problem using a general cost equation. This cost
function is easy to use since it is simple in its form, with parameters that can be easily quantified.
• The advantage of the model is that it can be applied to any system with high complexities. The
technique is effective for small and large-scale systems.
• The primary aim is to obtain a relationship for the cost of
each component as a function of its reliability. It therefore
closely resembles activity based costing (ABC).
• Once the reliability requirement for each component is
estimated, one can then decide whether to achieve this
reliability by fault tolerance or fault avoidance.

Source:L Reliability Allocation and Optimization for Complex Systems. A Mettas (2000) 103
Importance of Configuration Management
What is Configuration Management (CM)
“An integrated set of processes for the operation & support of (nuclear) plants. CM encompasses the
physical configuration change, design requirements change, and facility configuration information
change activities. It [requires that] the processes … to construct, operate, & maintain the facility are
controlled & managed.” - INPO
• Establishment of design control measures, processes and procedures to obtain a
construction and operating license. To achieve this Facility Configuration Information (FCI) in
the form of documents, databases, 2D and 3D models and other records must be:
• Identified, collected, and managed to maintain the design basis, licensing basis and overall safe
operation of the NPP
• Linked to Structures, Systems, & Components (SSCs) to satisfy the facility’s design and licensing
requirements
• Correlated with operating software, ISI/IST, Equipment Qualification, Motor-operated Valve testing, etc.
• The Organisation must have the ability to perform impact analyses for evaluating
how proposed physical design or regulatory changes will affect SSCs in the plant
• A CM culture should be ingrained in the organisation and requires early and
continued involvement by the Owner/Operator:
• Implementing CM early in the design/licensing phases
• Determining what data is to be handed over, use of the information, format, storage systems,
metadata schema to organize and search the data, etc.
• Actual progressive electronic FCI handover process leading to plant turnover.

Requirements Management and Design Basis is at the Heart of Regulatory Compliance 105
Configuration Management as enabler for Requirements Management
R&D Design Construction Operation
Calculations C. Design Detail Design Procurement Commissioning R&M

Requirements to Pump Planned Pump Typical Pump Actual Pump

Requirements to Pump – i.e. Planned Pump within a Typical Pump – what meets Actual Pump
why the designer chose functional location to meet design requirements COTS installed – pump on
Planned Pump parameters requirement (i.e. what the (manufacturer’s pump from the plant with unique
and Typical Pump. designer plan to use). catalog) serial number

Parameter example: Parameter example:


Parameter example: Parameter example:
• Model ID (Commodity • Serial Number
• Pump Type • Tag/FLOC (KKS)
Code) • Operational
• Requirements to Pressure • Design Pressure
• Maximum Pressure Pressure

Requirements Management and Design Basis is at the Heart of Regulatory Compliance – without a well established configuration management 106
programme, it would be extremely difficult to prove compliance to requirements throughout the entire asset lifecycle.
Proper Plant Configuration Management is mission critical for
Maintenance Basis Management
MODEL (CATALOGUE)
If this goes wrong, the ability to
Pump Catalogue
manage the Design Basis and
requirements are seriously
jeopardized

Rated Value

TAG (FLOC) ASSET


Operating Value/Capability

Design Specification Value

Managing this information makes sure we put the correct piece of equipment into the plant design functional location.
It helps us in cases where the old equipment can no longer be bought and we have to find an alternative.
Flexibility to put lower cost compatible units into plant that still meet the design specification. 107
2020/06/06
Confirms that we are not operating our plant outside of its capability as per the Design Basis
107
Lifecycle Impact of Inadequate CM

The main CM challenges are usually the result of the long-term operation of any process plant.
Issues are caused primarily by aging plant technology, plant modifications, the application of new
safety and operational requirements and in general by human factors arising from migration of plant
personnel and possible human failures.
Poor CM has the following impacts:
• Procurement is not able to procure the right materials when needed. This impacts project cost and
completion schedule.
• Installation (Fabrication & Construction) is not able to install items and packages when needed. This has
an impact on cost and schedule to complete the plant.
• Commissioning is delayed in the plant start-up phase. This impacts the plant project schedule and
immediate revenue opportunity.
• Operations is not able to perform maintenance actions, which impacts plant reliability with potential safety
consequences.
108
Reliability Baselines
Baseline Management

• A Baseline is a line that is a base for measurement or for construction, thus a specific point of
reference at a specific point in time.
• Baseline management is a key activity undertaken as part of the Configuration Management
Process. Configuration management is the process of managing change in plant, systems,
hardware, software, firmware, documentation, measurements, etc.
• As change requires an initial state and next state, the marking of significant states within a series of
several changes becomes important.
• The identification of significant states within the revision history of a configuration item is the
central purpose of baseline identification and management.
• Significant states are those that receive a formal approval status, either explicitly or implicitly and
approval is usually recognised publicly.
• Many projects even includes Financial Baselines. – and usually acts as stage gates to proceed
with a project/product development (or not).

110
Baseline Reviews
A large project will typically undergo a number of formal
reviews over the asset lifecycle to ensure that
requirements have been met.
• Concept Design Review
• System User Requirements Review
• Design Definition Review
• Preliminary Design Review
• Critical Design Review
• Production Readiness Review
• Test Readiness Review
• System Acceptance Review
• Commissioning Readiness Review
• Operational Readiness Review

The graphical example demonstrates the baseline reviews (technical) as well as the financial stage gates referred to in the previous slide. This model 111
suggest that both technical and financial acceptance criteria must be met before the project/product can move to the next stage in the asset lifecycle.
Typical Engineering Baselines
• Pre-Feasibility Baseline: An early engineering/project baseline that ensures a basic agreement between the
Client and the Engineer/Designer on what is deemed to be feasible options to investigate. This is typically used in cases
where there is potentially more than one solution to the Client requirements. This is also known as the “Feasibility Study”
phase. The Client will be presented with a number of options to solve their requirements, which will vary on technical
scope, cost, complexity, timelines, etc. The Client will make a decision at this point, given the financial, technical, time
constraints and other criteria on which option should be further explored.
• Concept Design Baseline: This is a basic design baseline where the engineer will perform some basic design
and generate a preliminary high level design for the Client based on the option chosen from the feasibility study. This will
outline the basic design and set the stage for the detail design requirements, and will contain high level drawings like PFD’s,
SLD’s, etc. and the typical expected performance criteria that will be achieved.
• Detail Design Baseline: A detail design baseline is one where the plant conceptual design is taken to the detailed
level required to fully build and successfully implement the design in reality. It will contain detailed level drawings and will
contain all the required design specifications to build and/or procure the plant, system and/or component. This is one of the
most critical baselines for the Engineering function, as a “Design Freeze” needs to be obtained at the end of this stage that
defines the way forward for all baselines that follow. Once Design Freeze is obtained, it also means that any changes to
the approved design must then go through a formal change management process where such changes have to be
reviewed and approved before it can be implemented.

112
Typical Engineering Baselines
• Procurement Baseline: A commercial engineering baseline that is created, which provides the Commercial /
Procurement group with a fixed and specified listing of plant, equipment, parts and resources that is required to successfully
execute the project.
• Construction/Build Baseline: A plant construction (EPC) baseline that defines a specific point in time where the
design are considered complete enough to commence construction and building of the plant, system and/or components.
This is usually the baseline where numerous design reviews have taken place that confirmed that the plant design is
construct-able, maintainable and operable.
• Commissioning/Start-up Baseline: A plant construction (EPC) baseline that defines a specific point in time
where construction activities are considered complete enough to commission a part of plant (within defined boundaries).
This baseline usually confirms that all critical plant components and plant isolations are in place to allow for successful
commissioning of the unit or system. It will also confirm that the plant is ready for start-up and thus handover to the
Owner/Operator of the plant.
• Owner Handover Baseline: A construction baseline point where the plant is deemed ready for hand-over to the
Owner/operator to use the plant/system for production purposes. At this point, the plant performance baseline has
been captured and confirmed to be aligned with the original performance criteria specified in the URS. This
handover generally requires not only a hand-over of the PHYSICAL asset, but also the INFORMATION asset that was
generated over the project lifecycle.
113
Software Reliability in Engineering Systems

“In architecting a new software program, all the serious mistakes are
made on the first day.”
Robert Spinrad, 1988
History of RE in Software and Engineering Systems
History of Software in Engineering Systems

• Software is now part of the operating system of a very wide range of products and systems, and
this trend continues to accelerate with the opportunities presented by low cost micro-controller
devices.
• Software is relatively inexpensive to develop, costs very little to copy, weighs nothing, and does
not fail in the ways that hardware does.
• The software ‘technology’ used today is the same basic sequential digital logic first applied in
the earliest computers. The only significant changes have been in the speed and word length
capability of processors and the amount of memory available, which in turn have enabled the
development of high-level computer languages and modern operating systems.

116
Requirements Management in Software

As with any other


product, management
of requirements in
software development
is critical to ensure
that an end-state
solution is delivered
that meet the end-
user requirements.

117
Typical common software system requirements

The basic requirements that need to be considered as part of design interface include:
• Reliability
• Maintainability
• Standardisation
• Interoperability
• Safety
• Security
• Usability
• Environmental (specifically in disposal phase of ICT infrastructure) and HAZMAT (some IT
hardware have highly toxic materials within the product)
• Privacy, particularly for computer systems
• Legal

118
System Development Lifecycle (SDLC)

119
The Waterfall Model

The Waterfall model is a classic SDLC model, which was first proposed by H.D. Benington at a symposium on advanced programming methods for digital 120
computers in 1956. The original waterfall model has 5 stages: requirements analysis, design, implementation, verification, maintenance. (https://tech-
talk.org/2015/01/21/system-development-life-cycle-sdlc-approaches)
Software System Analysis and Design Approaches

The Systems (or Software) Development Life Cycle (SDLC) is a domain of competency used in systems engineering, information systems and software engineering 121
to describe a process for planning, creating, testing, and deploying an information system.
Validation and Verification – the V-Model Approach in Software

122
Validation and Verification – The Prototyping Approach
• Software prototype based
approaches fall into the
scope of iterative software
development paradigms. It
focuses on iteratively
improving a software
prototype with inputs from
the stake-holders while
fine tuning it.
• It is similar to the RAD
approach to software
development where there
is less emphasis on
planning tasks and more
emphasis on
development.
• Both are iterative in
nature, in how they
develop the final system.
123
Design Trade-off in Software – Example Software Languages

• As with many other


engineering applications,
software engineering also
entails design trade-offs to
achieve the desired reliability.
• The example shows how
different program languages
will impact reliability and cost
of developing software.

124
Source: Addison-Wesley. 2012
Software Reliability in Engineering Systems
Software System Items/Components
Software item data is fundamental to reliability assessment based on software properties or on
stochastic reliability models, and is also important for assessment based on process models. A
software item may be any of the following:
• System: a complete software entity such as an application program, utility program, tool, operating system, embedded
control program, etc. It is free-standing in that it is not essentially part of a larger system (although it will have interfaces to
other hardware and software items). It may be regarded by the developer as a separate commercial product.
• Sub-system: a self-contained part of a larger system, with a defined function and a defined interface to other sub-
systems. For some purposes a sub-system may be regarded as a system in its own right, and may consist of smaller sub-
systems. A system may therefore have a hierarchical structure, consisting of many levels of sub-system.
• Module: the smallest self-contained unit of software. A module is the lowest level of sub-system. It may possess its own
internal structure but is regarded as atomic for purposes of reliability assessment. The level of structure below which the
system is not further decomposed depends on the level of detail at which it has been decided to perform the assessment.
• For some types of software, e.g. object-oriented, client-server based, window-based, etc. the parts may be described
differently, e.g. “class of object” may be used in place of sub-system or module.
• Certain software documents, e.g. functional specifications, user manuals, etc., are part of the software product and
should be managed and recorded as such in the overall software configuration management system, with appropriate
identifiers, version numbers, titles, etc. Documents may be regarded as “sub-system” for recording purposes.
126
Basic Software Concepts
• Every copy of a computer program is identical to the original, so failures due to variability cannot
occur.
• Also, software does not degrade, except in a few special cases, and when it does it is easy to
restore it to its original standard. Therefore, a correct program will run indefinitely without failure,
and so will all copies of it.
• However, software can fail to perform the function intended, due to undetected errors.
• When a software error (‘bug’) does exist, it exists in all copies of the program, and if it is such as to
cause failure in certain circumstances, the program will always fail when those circumstances
occur.
• Software failures can also occur as a function of the machine environment or the fact that some
software were not upgraded with patches or fixes.
• When software is an integral part of a hardware-software system, system failures might be caused
by hardware failures or by software errors. When humans are also part of the system they can also
cause failures.
127
Software Reliability and Obsolescence
• Software reliability, does not show the same
characteristics as IT hardware (which
follows the usual Weibull “bathtub” curve).
• Software does not have an increasing
failure rate as hardware does. Towards the
end-of life phase, software is approaching
obsolescence; there are no motivation for
any upgrades or changes to the software.
Therefore, the failure rate will not change.
• In the useful-life phase, software will
experience a drastic increase in failure rate
each time an upgrade is made.
• The failure rate levels off gradually, partly
because of the defects found and fixed after
the upgrades.

https://users.ece.cmu.edu/~koopman/des_s99/sw_reliability/ 128
Hardware and Software Reliability Characteristics
Hardware Software
Failures can be caused by deficiencies in design, production, use and Failures are primarily due to design faults.
maintenance.
Failures can be due to wear or other energy-related phenomena. There are no wear-out phenomena. Software failures occur without
warning
No two items are identical. Failures can be caused by variation There is no variation: all copies of a program are identical (there may
be software patches/fixes and potential version issues that can create
variation)
Repairs can be made to make equipment more reliable There is no repair. The only solution is redesign (reprogramming)

Reliability can depend on burn-in or wear-out phenomena; that is, Reliability is not so time-dependent
failure rates can be decreasing, constant or increasing with respect to
time
Reliability may be time-related, with failures occurring as a function of Reliability is not time related. Failures occur when a specific program
operating step or path is executed or a specific input condition is encountered,
which triggers a failure.
Reliability may be related to environmental factors (temperature, The external environment does not affect reliability except insofar as it
vibration, humidity, etc.) might affect program inputs.
Reliability can be predicted, in principle but mostly with large Reliability cannot be predicted from any physical bases, since it
uncertainty, from knowledge of design, parts, usage, and entirely depends on human factors in design.
environmental stress factors

129
Hardware and Software Reliability Characteristics
Hardware Software
Reliability can be improved by redundancy. The successful use of Reliability cannot be improved by redundancy if the parallel paths are
redundancy presumes ready detection, isolation, and switching identical, since if one path fails, the other will have the error.
of assets.
Failures can occur in components of a system in a pattern that is, to Failures are rarely predictable from analyses of separate statements.
some extent, predictable from the stresses on the components and Errors are likely to exist randomly throughout the program, and any
other factors. Reliability critical lists are useful to identify high risk statement may be in error. Most errors lie on the boundary of the
items. program or in its exception handling. Reliability critical lists are not
appropriate.
Hardware interfaces are visual. Software interfaces are conceptual rather than visual.

Computer-aided design systems exist that can be used to create and There are no (or very limited) computerized methods for software
analyse designs. design and analysis.
Hardware products use standard components as basic building blocks. There are no standard parts in software, although there are
standardised logic structures. Software reuse is being deployed, but
on a limited basis.

130
Software in Engineering Systems
Engineering Software vs. other Software Applications
The software that forms an integral part or sub-system of an engineering system is in some
important ways different from software in other applications, such as banking, airline booking,
logistics, etc. Differences are:
• Engineering programs are ‘real time’: they must operate in the system timescale, as determined by the system
clock and as constrained by signal propagation and other delays (switches, actuators, etc.).
• Engineering programs share a wider range of interfaces with the system hardware. In addition to basic
hardware, other interfaces include measurement sensors, A/D and D/A converters, signal analysers, switches,
connectors, etc.
• Engineering programs might be ‘embedded’ at different levels within a system.
• There is often scope for alternative solutions to design problems, involving decisions on which tasks should be
performed by hardware (or humans) and which by software.
• Engineering software must sometimes work in electrically ‘noisy’ environments and thus exposed to high
electro-magnetic fields, so that data might be corrupted.
• Engineering programs are generally, though not always, rather smaller and simpler than most other
applications.

132
Fallacies of Distributed ICT systems
• The network is reliable
• Latency is zero
• Bandwidth is infinite
• The network is secure
• Topology doesn't change
• There is one administrator
• Transport cost is zero
• The network is homogeneous

Many of these fallacies drive the design of Chaos Engineering experiments such as “packet-loss attacks” and
“latency attacks”. For example, network outages can cause a range of failures for applications that severely
impact customers. Applications may stall while they wait endlessly for a packet. Applications may permanently
consume memory or other Linux system resources. And even after a network outage has passed, applications
may fail to retry stalled operations, or may retry too aggressively. Applications may even require a manual
restart. Each of these examples need to be tested and prepared for.
https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/ 133
Product-Based Software Reliability Assessment
Various product-based methods may be used to assess the contribution of software to system
reliability; these make use of three main types of information: software properties, fault data, and
failure data.
• Software Properties: Assessment based on software product properties analyses the form, structure, content and
complexity of the software itself. Knowledge of the software and system structure together with estimates of the reliability of
individual parts can be used to produce a combined assessment of total software and system reliability.
• Fault Data: Fault data, like software property data, can be available early in software development, and has been found
useful by many organisations for predicting the fault and failure levels likely to be experienced later in development, and in
the use of the software following delivery. E.g., an abnormally high number of faults found in design review of a software
component could result from a complex design, which could in turn lead to further problems at later stages. The use of fault
data methods is dependent on the availability of data from previous projects.
• Failure Data: Once the software has reached the stage of being executable, statistical methods can be used to assess
current reliability and predict future reliability growth from records of system failure and the extent of use of the system.
This can be done during UAT/FAT and also in service. Failure data methods can only be used when a system already exists
and is exhibiting failure. They are therefore unsuitable during the early life of a system and on very highly reliable systems
since the failure sample is likely to be too small to permit meaningful analysis. This is one of the reasons that confidence
building throughout development is so important.
134
IT System Reliability Considerations
A further complexity with IT system reliability is that it consists of several layers of architecture – each
which in their own right can significantly impact reliability. This is captured in the OSI Reference Model.
• The upper layers of the OSI model deal with application
issues and generally are implemented only in software.
The highest layer, application, is closest to the end user.
Both users and application-layer processes interact with
software applications that contain a communications
component. The term upper layer is sometimes used to
refer to any layer above another layer in the OSI model.
• The lower layers of the OSI model handle data transport
issues. The physical layer and data-link layer are
implemented in hardware and software. The other lower
layers generally are implemented only in software. The
lowest layer, the physical layer, is closest to the physical
network medium (the network cabling, for example, and is
responsible for actually placing information on the medium.

135
IT Application Architecture
Reliability of IT Applications will also be
affected by the various levels of the IT
application architecture.

E.g. poor process flow through the


system can result in slow end-user
response when performing transactions
in the system.

Other areas of consideration that will


affect application reliability would also
be:
• Business Architecture
• Data Architecture
• Technology Architecture

136
Engineering Software Configuration Management
The configuration management of engineering software is of crucial importance. Configuration
management is the set of procedures and practices concerned with keeping track of the modification
of the software system and its components.
• It is necessary to be able to identify clearly each software entity whose reliability is to be measured.
• In the course of development, each sub-system or module will be modified repeatedly, and several versions of each
component will exist. It is important to know which version of each component is included in a given version of the total
system.
• It is essential to know which version of a system has been released for use, and to ensure that it incorporates the
appropriate version of each component and the corresponding version of documentation.
• An important consideration is that the reliability of a system will usually change when it is modified, so that any
measurement of the reliability of a system applies to a specific version.
• It is therefore necessary to define a baseline version (e.g. V5.R7) whose identity only changes following a substantial
modification. Effectively the version identifier has two parts, the baseline identifier (In the Example it is Version 5) and a
further identifier (R7 in the example) to depict the detailed modification state. It is important that the identification of
baselines is kept simple to permit easy referencing, e.g. in failure reports.
• Engineering judgement should be used to determine what constitutes a baseline change. The modification of a large
part of the software or the addition of new modules to enhance its function would almost certainly necessitate a new
baseline.
137
Preventing Software Errors
Mistakes, faults and Errors - Relationship
Many failures of modern complex digital
systems are associated with software failures.
Mistake: A human error by performing (or not) a required
action at a specific point in time in the activity sequence,
leading to a non-conformance between input product and
output product.
Fault: A design fault located in a software component. A
software fault remains latent until a particular combination
of inputs, operator actions, other environmental
circumstances and internal states referred to as the trigger
coincide during test, trial or
operation and activate the fault.
Error: A software bug is an error, flaw, failure or fault in a
computer program or system that causes it to produce an
incorrect or unexpected result, or to behave in unintended
ways.
Failure: System failure due to the activation of a design
fault in a software component – thus an event of an item
ceasing to perform a required function or provide a
required service in full or in part.

139
Software Failure Characteristics
• They are due to latent design faults in software components of the system. These design faults are
caused by human error during development or maintenance of the system.
• They are transient. If the trigger is removed the system can recover and resume normal service.
• They are systematic since until the latent fault is removed by corrective maintenance the system will fail
again in a similar mode whenever the trigger is encountered.
• They are random since the trigger for each fault is encountered at random.
• They tend to be infrequent since the trigger is usually a very rare combination of operating circumstances.
• Their modes and effects tend to be unpredictable so that it is difficult to provide fault-tolerant design
features or safety devices to guard against them automatically.
• Although they tend to be infrequent their consequences can be catastrophic.
• Software failures are systematic because they can be reproduced at will by replicating the trigger.
• Systematic failures are often considered to be purely deterministic.
• IT system physical hardware failures generally occur at random and hardware reliability can therefore be
measured using a probabilistic or stochastic approach, e.g. by estimating a mean time to failure (MTTF)
or failure rate for the stochastic process of failure.
• Software reliability can be measured in the same way only if software failures can be considered to occur
randomly. Activation of any given software fault is a unique event in the operational life of the system.
Therefore in order to describe software failure as a random process the Bayesian approach is appropriate.
Debugging frequently involves reproducing a failure condition by deliberately subjecting the system to the same operating conditions that were 140
established when a failure was observed during operation.
https://betterexplained.com/articles/an-intuitive-and-short-explanation-of-bayes-theorem/
Physical vs Design Failures
Complex systems can fail for two fundamentally different reasons.
• Physical failure: a hardware component fails, for example a resistor `shorts', or a logic gate “sticks”. After its
individual failure, the component is faulty; there is a fault in the system. Repair consists of replacement of the
faulty component to restore the system to its previous functioning state. After repair the system should continue to
function and should not necessarily fail again on encountering the same circumstances as those that led to the
previous failure. Failures of this kind are often called from a statistical viewpoint “random failures”.
• Design failure: a fault in the design of the system is activated in response to certain conditions. The fault may
have been present for some time, although latent. Repair consists of a modification to the design of the system to
remove the fault. Although such “corrective maintenance” may introduce new faults, it generally improves the
design (and increases the reliability) of the system. This kind of failure is often called “systematic” because unless
the design is changed to remove the fault, the same failure will recur if the same circumstances arise.
Failure in manufacture is normally attributable to hardware and are therefore generally not
considered in the context of software.
A software failure is a system design failure due to a fault that is located in a software component.
Since software is a part of the design of the system (although once compiled and loaded, it has a
physical representation), it can only undergo design failure.
Many failures of modern complex digital systems are associated with software failures.

141
Typical Software Errors
Software System Design:
• Errors can occur as a result of incorrect interpretation of the specification, or incomplete or incorrect logic.
• Errors can also occur if the software cannot handle data inputs that are incorrect but possible, such as
missing or incorrect bits.
• An important reliability feature of software system design is robustness, the term used to describe the
capability of a program to withstand error conditions without serious effect, such as becoming locked in a
loop or ‘crashing’.
• The robustness of the program will depend upon the design, since it is at this stage that the paths to be
taken by the program under error conditions are determined.

Software Code Generation:


Code generation is a prime source of errors, since a typical program involves a large number of code
statements. Changes to code can have dire consequences. The likelihood of injecting new faults can run as
high as 50 %, and tends to be highest for small code changes. Typical errors can be:
• Typographical errors.
• Incorrect numerical values, for example, 0.1 for 0.01.
• Omission of symbols, for example, parentheses.
• Inclusion of variables which are not declared, or not initialized at the start of program run.
• Inclusion of expressions which can become indeterminate, such as division by a value which can become zero.
• Accidental shared use of memory locations.

142
Software Error Propagation
This example shows how a system
invokes a sub-system environment (SS1)
which in turn invokes a system module
module (M1.2).
If M1.2 contains a latent error, it will be activated
under a particular set of circumstances. It will
then be unable to perform its required function
on behalf of SS1.
This results in an internal environment state fault
in SS1 and it cannot execute the instruction
issued from System level.
If it propagates to system level, the error will
result in the system to have an interface error
between itself and its sub-system environments.
If the system encounters the same operational
circumstances again a similar failure will recur,
unless the fault from module M1.2 is removed.

This example assumes that the system is not fault tolerant. It might be possible to design a system so that it can detect and contain internal errors and 143
recover from them automatically, so preventing a system level failure although local failures of one or more components might occur.
Preventing Software Errors
Specification Errors: Typically more than half the errors recorded during software development
originate in the specification. Typical criteria to be considered are:
• The software specification must describe fully and accurately the requirements of the program.
• The program must reflect the requirements exactly.
• There are no safety margins in software design as in hardware design.
• The specification must be logically complete.
• A software specification must cover all the possible input conditions and output requirements, and
this usually requires much more detailed consideration than for a hardware specification.
• The specification must be consistent. It must not give conflicting information or use different
conventions in different sections.
• The specification must not include requirements that are not testable, for example, accuracy or speed
requirements that are beyond the capability of the hardware.
• The specification should be more than just a description of program requirements. It should describe
the structure to be used, the program test requirements and documentation needed during
development and test, as well as basic requirements such as the programming language, and inputs
and outputs.

144
Improving Software Reliability
Structure:
• Structured programming is an approach that constrains the programmer to using certain clear, well-defined
approaches to program design, rather than allowing total freedom to design ‘clever’ programs which might be
complex, difficult to understand or inspect, and prone to error.
• Structured programming leads to fewer errors, and to clearer, more easily maintained software. On the other
hand, structured programs might be less efficient in terms of speed or memory requirements.
Modularity:
• Modular programming breaks the program requirement down into separate, smaller program requirements,
or modules, each of which can be separately specified, written and tested.
• The overall problem is thus made easier to understand and this is a very important factor in reducing the
scope for error and for easing the task of checking.
• The separate modules can be written and tested in a shorter time.
• The optimum size of a module depends upon the function of the module and is not solely determined by the
number of program elements. The size will usually be determined to some extent by where convenient
interfaces can be introduced.
Replication:
• Sometimes existing software, for example from a different or previous application, can be used, rather than
having to write a new program or module.
• This approach can lead to savings in development costs and time – but it is important to test its use to be
“fit for purpose” for the new requirements.
145
Improving Software Reliability (Cont.)
Programming Style:
• Programming style is an expression used to cover the general approach to program design and coding.
• A disciplined programming style can have a great influence on software reliability and maintainability, and it
is therefore important that style is covered in software design guides and design reviews, and in
programmer training.
Fault Tolerance:
• Programs can be written so that errors do not cause serious problems or complete failure of the program.
• A program should be able to find its way gracefully out of an error condition and indicate the error source.
• Where safety is a factor, it is important that the program sets up safe conditions when an error occurs.
Redundancy:
• Fault tolerance can also be provided by program redundancy.
• For high integrity systems separately coded programs can be arranged to run simultaneously on separate
but connected controllers, or in a time-sharing mode on one controller.
• A voting or selection routine can be used to select the output to be used. This approach is also called
program diversity.
Language:
• The selection of the computer language to be used can affect the reliability of software.
• Programmable logic controllers (PLCs) are often used in place of processors. PLC-based systems also
avoid the need for the other requirements of processor-based systems, such as operating system
software, memory, and so on, so they can be simpler and more robust, and easier to test.
146
Data Reliability and Integrity

“Without data, you’re just another person with an opinion”.


W.E. Deming
The Importance of Data to Make Decisions

Actionable
Information and Data
Interpretation – KPIs
The Data model shows how Asset
Asset & Fleet Optimisation

data is handled at discrete Optimization

levels at PLC/SCADA system


Visualisation &
level, but how its value System dashboards
High level picture of Business

increases to information and Combined Information analytics


knowledge as it is further Business Intelligence From multiple business info sources

used, analysed and Advanced Analytics & Process


Expert Systems
interpreted for business Plant Optimization

decision making Engineering Applications


Applied information and Basic
Plant Performance Analytics

Plant and System Health


System Monitoring And Condition Monitoring

Discrete data Data Capture and Consolidation


Limited value DCS and Historians Post Event Analysis

148
Data vs. Knowledge
The DIKW pyramid refers loosely to a class of models for representing purported
structural and/or functional relationships between data, information, knowledge, and
wisdom. "Typically information is defined in terms of data, knowledge in terms of
information, and wisdom in terms of knowledge"
The following example describes a military decision support system, but the
architecture and underlying conceptual idea are transferable to other
application domains:

• The value chain starts with data quality describing the information within
the underlying command and control systems.
• Information quality tracks the completeness, correctness, currency,
consistency and precision of the data items and information statements
available.
• Knowledge quality deals with procedural knowledge and information
embedded in the command and control system such as templates for
adversary forces, assumptions about entities such as ranges and
weapons, and doctrinal assumptions, often coded as rules.
• Awareness quality measures the degree of using the information and
knowledge embedded within the command and control system. Awareness
is explicitly placed in the cognitive domain.
Source: https://en.wikipedia.org/wiki/DIKW_pyramid & US Army Knowledge Managers 149
Industrial Internet – A Business Model Game Changer

Inherent to the new Industrial internet is design of an IoT infrastructure that is highly
reliability and interconnected.

Data quality and availability is also deemed mission critical for success in the new emerging
economy.

Gartner further identifies the emergence of an entirely new class of business models based on smart machine technologies, advanced analytics and big
150
data, and views this as one of the big “game changers” for all businesses operating in the new industrial revolution space of the “Internet of Things
(IoT)”.
Data Handling and Data Integration
• Everything plan, process and product has technical data associated with it.
• Technical data includes engineering data, product data, contract data, and logistics data.
• Technical data management includes identification and control of data requirements; the timely
and economical acquisition of all system/product related data; the assurance of the adequacy of
data for its intended use; distribution or communication of the data to the point of use; and actual
data analysis.
• The integration of technical data into all aspects of the system/product configuration management
program occurs both because of the efforts of program managers and technical experts. The
challenge is to ensure that technical data is appropriately and correctly acquired, shared, used,
and disposed of.
• Typical technical data integration activities would be:
• The development of a technical data rights strategy with specific focus on intellectual property (IP) rights;
• Attention to security and access of technical data – both to prevent unauthorised usage and to ensure personnel with
the need for access the information have the correct user rights and access control rights in the information
system(s);
• Processes to integrate engineering data with logistics data to allow for feedback on operational and support
information, as well as on failures of the product/system occurring in the field;
• Establishing procedures to integrate the program’s performance based life cycle metrics to the appropriate technical
data which can be used to improve outcomes.
151
Data Reliability in Software Systems
Data reliability (or information integrity) is an important aspect of the reliability of software-based
systems. When digitally coded data are transmitted, there are two sources of degradation:
• The data might not be processed in time, so that processing errors are generated. This can arise,
for example, if data arrive at a processing point (a ‘server’, e.g. a microprocessor or a memory
address decoder) at a higher rate than the server can process.
• The data might be corrupted in transmission or in memory by digital bits being lost or inverted, or
by spurious bits being added. This can happen if there is noise in the transmission system, for
example, from electromagnetic interference or defects in memory.

CARAT Principle for Data


• Complete: Completeness is routinely monitored informally as well as more formally – the intent is to ensure that all meta-data and
records fields are completed.
• Available (Accessible): This is a measure of how easy it is to access or retrieve data.
• Relevant: Evaluating data that is relevant to the purpose it is intended to be used for.
• Accurate (Valid): This is usually achieved by auditing data against intended use. This requires physical examination of data records.
• Timeous: Measuring if data can be provided at the point in time it is needed for analysis and decision making.

152
Software Checking and Testing
Software Testing
• The objectives of software testing are to ensure that the system complies with the requirements
and to detect remaining errors.
• Testing that a program will operate correctly over the range of system conditions is an essential
part of the software and system development process.
• Software testing must be planned and executed in a disciplined way since, even with the most
careful design effort, errors are likely to remain in any reasonably large program, due to the
impracticability of finding all errors by checking. Integration testing is essential for cases where
software covers various modules to ensure that the total software solution works as designed.
• Tests to be performed must be selected carefully to verify correct operation under the likely range
of operating and input conditions, whilst being economical.
• The software test process should be iterative, whilst code is being produced. Code should be
tested as soon as it is written, to ensure that errors can be corrected quickly by the programmer
who wrote it, and it is easier to devise effective tests for smaller, well-specified sections of code
than for large programs.
154
Test Criteria and Methods
• The software tests must include:
• All requirements defined in the specification (‘must do’ and ‘must not do’ conditions).
• Operation at extreme conditions (timing, input parameter values and rates of change, memory utilization).
• Ranges of possible input sequences.
• Fault tolerance (error recovery).
• The test specifications (for modules, integration/verification, validation) must state every test condition to be applied, and the
test reports must indicate the result of each test.
• Formal configuration control should be started when integration testing commences, to ensure that all changes are
documented and that all program copies at the current version number are identical.
• Software can be tested at different levels:
• White box testing involves testing at the detailed structural level, for aspects such as data and control flow, memory
allocation, look-ups, etc. It is performed on modules or small system elements to demonstrate correctness at these levels.
• Verification is the term sometimes used to cover all testing in a development or simulated environment, e.g. using a host or
lab computer. Verification can include module and integration testing. Many times called Factory Acceptance Testing (FAT).
• Validation or black box testing covers testing in the real environment, including running on the target computer, using the
operational input and output devices, other components and connections. Validation is applicable only to integration testing,
and it covers the hardware/software interface aspects. Many times referred to as User Acceptance Testing (UAT).
155
Use of Software Sneak Analysis (SA)
• Most software sneak patterns are related to branching instructions, such as GOTO or IF
THEN/ELSE statements. The conditions leading to and deriving from such statements, as well as
the statements themselves, are important clues in the SA.
• Six basic sneak patterns exist, as indicated:

Software sneak conditions are:


1 Sneak output. The wrong output is generated.
2 Sneak inhibit. Undesired inhibit of an input or
output.
3 Sneak timing. The wrong output is generated
because of its timing or incorrect input timing.
4 Sneak message. A program message
incorrectly reports the state of the system.

156
Social Engineering Considerations
The biggest and most under-acknowledged threat to data security is a phenomenon called social engineering.
This use of human emotional-manipulation has increased cyber criminals’ access to sensitive data. In fact, 12
people every second fall victim to cyber-crime, according to Microsoft. And in 2016, 43% of documented
breaches involved social engineering attacks, as reported in the 2017 Verizon DBIR.
Andersson and Reimers (2014) found that employees often do not
see themselves as part of the organization Information Security
"effort" and often take actions that ignore organizational information
security best interests. Research shows Information security culture
needs to be improved continuously. To manage the information
security culture, five steps should be taken and should be part of
the software design and testing process:
• Pre-Evaluation: to identify the awareness of information security within
employees and to analysis current security policy.
• Strategic Planning: to come up a better awareness-program, set clear
targets. Clustering people is helpful to achieve it.
• Operative Planning: set a good security culture based on internal
communication, management-buy-in, and security awareness and training
program.
• Implementation: four stages should be used to implement the information
security culture. They are commitment of the management, communication
with organizational members, courses for all organizational members, and
commitment of the employees.

157
Social Engineering Considerations
Organizations can reduce their security risks by:

• Standard Security Framework: Establishing frameworks of trust on an employee/personnel level (i.e., specify
and train personnel when/where/why/how sensitive information should be handled)
• Scrutinizing Information: Identifying which information is sensitive and evaluating its exposure to social
engineering and breakdowns in security systems (building, computer system, etc.)
• Security Protocols: Establishing security protocols, policies, and procedures for handling sensitive information.
• Training to Employees Training employees in security protocols relevant to their position. (e.g., in situations
such as tailgating, if a person's identity cannot be verified, then employees must be trained to politely refuse.)
• Event Test: Performing unannounced, periodic tests of the security framework.
• Inoculation: Preventing social engineering and other fraudulent tricks or traps by instilling a resistance to
persuasion attempts through exposure to similar or related attempts.
• Review: Reviewing the above steps regularly - no solutions to information integrity are perfect.
• Waste Management: Using a waste management service that has dumpsters with locks on them, with keys to
them limited only to the waste management company and the cleaning staff. Locating the dumpster either in
view of employees so that trying to access it carries a risk of being seen or caught, or behind a locked gate or
fence where the person must trespass before they can attempt to access the dumpster.
158
Source: https://en.wikipedia.org/wiki/Social_engineering_(security)
Software Reliability Prediction and Measurement
Software Reliability Challenges
• The reliability of a program depends not only upon whether or not errors exist but upon the probability that an
existing error will affect the output, and the nature of the effect.
• A well-structured modular program is much easier to check and test, and is less prone to error in the first
place, than an unstructured program designed for the same function.
• The biggest challenge with software reliability modelling is the fact that errors can originate in the
specification, the design and the coding.
• The logical limitations inherent in the prediction of reliability are even more severe with software and
programs, since there are no physical or logical connections between past data sets and future expectations,
as there would be with many hardware failure modes.
• Very few reliability calculation methods have been generally accepted or standardised by the software
engineering community.
• Software reliability growth models have been shown to yield accurate estimates of failure rate provided the
following conditions are satisfied:
• a) the environment does not change over time;
• b) a reasonably large number of failures is observed (implying a significant failure rate).
The estimates can be improved by using recalibration to remove bias
160
Software Properties affecting Reliability
Software product properties describe the state of a system at any point in time without regard as to
how it was achieved.
Typical examples of software product properties expected to affect the software reliability are:
• code size;
• degree of conformance to accepted or predefined notions of good structure;
• type of software, such as that constrained to real time operation;
• language characteristics.

• Data from above examples are used by software product property models to estimate the level
of reliability likely to be achieved by a given piece of software.
• Process-based assessment of software reliability aims to use information available at whatever
stage of the software development cycle has been reached, to determine the confidence (or
otherwise) in the reliability of the software being developed.

For application and selection criteria in using the different models, refer to BS 5760: Part 8: 1998 161
Data used for Reliability Assessment
• The data collection to be carried out should be derived from specific assessment objectives established
for each project, and wherever possible in accordance with existing practices and procedures.
• The methods of assessment to be applied should be chosen in advance of setting up any data
collection programme. The measures that are to be evaluated should be defined precisely. These will
usually be indirect measures. The direct measures from which they are derived should be determined
from their definitions. These in turn will determine what raw data is required.
• Good practice dictates that the overall control of data collection is co-ordinated to ensure consistency of
approach and quality management of data. Data should be collected automatically whenever possible.
• Data required will typically include:
• Failure and fault data, including times and circumstances of failure and details of actual or potential consequences;
• Process data, such as methods and techniques used and resources employed;
• Product data, i.e. information about size, structure, languages used and the version which failed, for example.
• Data will have many sources, including personnel employed at various stages of development, existing
management systems and system users (i.e. those who fill in failure and fault reports).

For application and selection criteria in using the different models, refer to BS 5760: Part 8: 1998 162
Software System Reliability Data to be collected
To estimate software reliability using stochastic reliability models the following types of data should
be collected.
• Products: The identity and baseline version of each software item to be assessed
• Installations: A record of each physical set of hardware equipment on which a copy of the software to be
assessed is being executed.
• Failures: A record of every occasion on which the system departed from its required behaviour.
• Faults: A record of every software fault which has been detected.
• Changes: A record of every modification made to each software item either to remove a fault or for other
purposes.
• System use: A record of the amount of use of each software item on each installation.

• Just as with other products and equipment, this information should ideally be stored in a FRACAS
platform for easy data access and retrieval. The same data-sets as required otherwise for failure
data recording will apply here as well.

For application and selection criteria in using the different models, refer to BS 5760: Part 8: 1998) 163
Software System Reliability Data – Execution Time
In addition to the typical FRACAS meta-data defined, it is important to capture execution time issues, of which
examples are given below based on the type of system.
• Elapsed time (i.e. real time or wall-clock time): This is meaningful only for software which is in use 24 hours a day, seven days a week. For
software which is in use for a fixed period every day, e.g. during prime shift, elapsed time may be proportional to the amount of use.
• System operating time: for software which is executed the whole time that the installation is in operation, e.g. real-time control software,
embedded software, or operating system software.
• Normalized system operating time: where software is in use on several installations which incorporate hardware processors of different
speeds, the operating time from each installation may be corrected by a factor to take account of the power of the processor.
• Program loaded time: for software which is in use in a single programming environment in which it is intermittently loaded, executed, and
then deleted.
• Processor time: for software in a multi-processing environment, where the operating system has the facility to record the amount of
processor time consumed by each process.
• Hands-on time: for interactive software where idle time is of no interest, or where the class of failures being recorded are those due to
problems experienced by the user as a result of human-computer interface faults.
• Transaction count: for interactive software which sends a response to each query from the user.
• Object instruction count: where a profiling facility is available to record the number of object code instructions executed.
• Source instruction count: where software instrumentation in the form of recording probes compiled into the software at various points has
been implemented, or where the software is run on a test-bed which records the number of source instructions executed.
• Number of demands: for “one-shot” software which is called upon periodically to perform a specific task, e.g. safety protection system
software monitoring an industrial process.

For application and selection criteria in using the different models, refer to BS 5760: Part 8: 1998 164
Software Failure
• Software has no physical parts that can wear out.
• Software failures are therefore simply design failures in which the faults responsible are located in
a software component of the system.
• These faults can remain latent for a very long time until activated by the following specific
conditions:
• the program module containing the fault is being executed;
• some particular subset of all possible inputs is being processed;
• the system is in a particular internal state, i.e. the variables have particular values.
• Knowledge of these precise conditions would allow the failure to be reproduced, and therefore
software failures are systematic failures.
• The conditions are not known in advance, however, and arise randomly with a probability which
depends on the software environment
• It can therefore be argued that software failures in effect constitute a random process, that the
theory of probability can be applied and that the occurrence of failures can be described using the
usual concepts of failure rate, mean time to failure, etc.
165
Software Reliability Techniques
Several techniques are available for the assessment of reliability of software. These are:
• Software development process models
• Inspection statistics
• Qualitative assessment of good practice
• Formal methods
• Software property models
• Software science
• Complexity measures
• Quality factors
• Fault tolerance
• Stochastic reliability models
• General statistical techniques
• Black-box parametric models
• Structural models
• There are two distinct but related problems with very high reliability: how to achieve it and how to know that it
has been achieved. For software neither of these problems is readily resolved at present.

166
Software Reliability Model Selection Criteria
The following criteria are recommended as a basis for evaluating a model for a particular software
development application.
• Predictive accuracy describes the capability of the model to predict future failure behaviour and should be determined
by comparing failure rates and failure intervals predicted by the model with actual values observed.
• Usefulness refers to the ability of the model to estimate quantities needed by managers and engineers in planning and
managing software development projects. The degree of usefulness needs to be assessed from the importance of the
measures provided.
• Quality of assumptions implicit in the model should be checked by determining the degree to which it is supported by
actual data. The clarity and explicitness of an assumption should be judged to determine whether a model applies to
particular circumstances.
• Applicability indicates the potential for use of the model across a range of different systems and different development
environments. If a model gives outstanding results but for only a narrow range of systems or environments the model
should not necessarily be discounted.
• Simplicity of a model has three aspects. The most important is that it should be simple to collect the data that is required
for the model. Secondly, the model should be simple in concept so that personnel without extensive mathematical
backgrounds are able to understand the nature of the model and its assumptions. Finally a model should be easy to
implement as a program that is a practical management and engineering tool.
For application and selection criteria in using the different models, refer to BS 5760: Part 8: 1998 167
Software Reliability Assessment Methods
• Fault Density: Fault density is a product metric usually expressed as faults per thousand lines of code or
faults/KLOC. Given that the size of the software is known, such a measure is equivalent to an estimate of the
total number of faults in the software.
• Complexity Metrics: Various methods have been proposed for measuring the complexity of software and
relating these to reliability (Halstead’s Software Science and McCabe’s cyclomatic complexity measure). Both
methods have generally been found to be disappointing as accurate predictors of software reliability.
• RBD’s: Block diagrams are used to assess the reliability of a system from a knowledge of the reliabilities of
its components. It is essential that software components are not omitted from such an analysis.
• FMEA/FMECA: Software components can be treated as black boxes or, in more detailed analysis, modules
within a software subsystem can be dealt with individually. The methods investigate the effect on the system
of certain fault modes of each component.
• FTA: FTA can be used to identify single point failures of the system. Safety-critical systems should generally
be designed in such a way that software does not lead to hazardous conditions.
• Markov Technique: Structural software reliability models are an application of Markov techniques. They
assume that software consists of a series of modules, and that control is passed from module to module in the
course of executing the program.

168
Stochastic Reliability Models
Stochastic reliability models can be classified according
to the assumptions made regarding the mechanism of
failure and the ways in which the process of failure is
modelled mathematically.
• General statistical techniques are general purpose methods
of statistical analysis, and can be applied to many types of
data, not solely to failure histories. They make no assumptions
about the mechanism of failure, and so do not really qualify as
`models'.
• Black-box parametric models disregard the internal
structure of the system and represent only its externally
observable failure behaviour. They model the mechanism of
failure using formulae which incorporate parameters, i.e.
unknown quantities which are estimated from the failure data.
• Structural models mathematically represent the internal
interactions between components of the system and combine
their individual levels of reliability to estimate total system
reliability. Since structural models treat the components as
black-boxes, they are dependent on black-box models.

For more information on Stochastic Reliability Model methods and procedures , refer to BS 5760: Part 8: 1998. An extensive list of models, methods 169
their advantages and disadvantages are covered in this standard.
Software Reliability Calculation Methods
Some reliability methods have been proposed but have not been widely accepted or standardised
yet:
• The Poisson Model (Time-Related): It is assumed that errors can exist randomly in a code structure and
that their appearance is a function of the time the program is run. Since software errors are not time-related in
the way that physical (hardware) failure processes are, the use of time-related models for software errors is
problematical.
• The Musa Model: The Musa model uses program execution time as the independent variable.
• The Jelinski–Moranda and Schick–Wolverton Models: Exponential-type models focussing on a hazard
function.
• Littlewood Models: Littlewood attempts to take account of the fact that different program errors have
different probabilities of causing failure.
• Point Process Analysis: Since a program can be viewed as a repairable system, with errors being detected
and corrected in a time continuum, the method of point process analysis can be applied to software reliability
measurement and analysis. (This may be the most feasible method to use)
• Musa et al. (1987) is a good reference for more information on software reliability prediction and measurement.
Poisson Model: https://www.youtube.com/watch?v=U-bMvccsn08
https://www.youtube.com/watch?v=5fuJhPutUas 170
https://www.youtube.com/watch?v=XD_NS-tfjRg
Hardware and Software Interfaces
Challenges due to Hardware/Software Interfaces
• Failures can occur which are difficult to diagnose to hardware or software causes, due to interactions between
the two system elements.
• Hardware meets software most closely within electronic devices such as processors and memories. A failure
of a memory device, regardless of the input, can cause failures which appear to be due to software errors.
• If the program is known to work under the input conditions, electronic fault-finding techniques can be used to
trace the faulty device.
• There are times, particularly during development, when the diagnosis is not clear-cut, and the software and
hardware both need to be checked. Timing errors, either due to device faults or software errors can also lead
to this situation.
• Memory devices (also called “firmware”) of all types, whether optical or magnetic media or semiconductor
memory devices, can cause system failures. Firmware failures can lead to system failures which occur only
under certain operating conditions, thus appearing to be due to software errors.
• Redundancy can be provided to data or logic held in memory by arranging for redundant memory within the
operating store or by providing independent, parallel memory devices. The program logic then has to be
designed to store and access the redundant memory correctly, so the program becomes more complex.

172
IT Hardware and Software Lifecycle Management Considerations
Software/Application Lifecycle Considerations
• Introducing and implementing Integrated ALM in development organisations is not easy for
following reasons:
• Multi-vendor tools use various technologies such as command line interface, desktop application, client-server, or web
based that run on different platforms such as Windows, Linux, and UNIX.
• Software being produced uses a wide range of technologies such as .NET based desktop application, Java based web
application, or a COBOL based mainframe application.
• Tools use various data repositories such as proprietary file structures, XML, Excel, or relational databases of various
flavours.
• Tools are geographically distributed as the development groups and team members from multiple corporate entities
are isolated.
• Using integration middleware does away with complex and costly integrations, overcoming the
limitations of point-to-point and single vendor tools integration.
• It not only increases connectivity and adds flexibility to gain better control of the applications, but also provides a user
with codeless configuration facility.
• It also provide better tool accessibility and re-configurability, future tool enhancements, seamless integration flow,
lightweight integration, open messaging models, easy plug-in and plug-out integration service in SOA framework

174
Software/Application Lifecycle Considerations

• Application lifecycle management (ALM) is the product lifecycle management (governance,


development, and maintenance) of computer programs. It encompasses requirements
management, software architecture, computer programming, software testing, software
maintenance, change management, continuous integration, project management, and release
management.
• Real-time collaboration, access to centralized data repository, cross-tool and cross-project visibility,
better project monitoring and reporting are the key to develop quality software in less time
• Integrated application lifecycle management is a totally integrated set of tools and processes that
help organizations manage the entire lifecycle of an application development. It connects different
teams, activities, platforms, tools, and processes involved in a software development project.
• The three aspects of ALM — governance, development and operations — when connected to each
other maximizes the business value of software.

175
Typical IT Hype Cycles

The ICT environment is fast becoming one of the technology spaces where redundancy can happen much faster than
anticipated. The customer base and level of “tech savvy” also allows for very little opportunity to leave known reliability and
product usage issues not to be addressed before it is released to customers.

176
Gartner Data Science Hype Cycle
IT Hardware Lifecycle Considerations
• ITAM business practices are process-driven and matured through
iterative and focused improvements.
• Having Hardware Asset Management processes in place can save
an organisation a fortune, both on hardware and subsequent
software. Actively and correctly managing hardware assets
throughout its lifecycle can lead to a reduction in the amount of
money spent on the hardware during its lifecycle.
• Reduced maintenance costs
• Reduced hardware budgets
• Reduced software spend
• Potential to save money through disposal processes
• An important aspect is capturing the financial information about
the hardware life cycle, which helps the organisation in making
business decisions based on meaningful and measurable financial
objectives. This also helps organisations to budget for the next
years IT budgets in both software and hardware assets

177
Industry experts recommend no more than a 5-year refresh cycle for computers, servers, and most IT hardware.
IT Infrastructure End of Life Management

178
IT Infrastructure End of Life Management
• When products, hardware and software reaches
end of life, very careful consideration needs to be
given to the impact of this.
• The IT Case Study shows the consequence of not considering
timeous replacement of IT technology.
• It will be compounded if there are multiple elements of the
architecture that is at “end of life” stage.

• Some companies see the benefit of EOL for


revenue generation but this will at some point also
be outweighed by the cost of maintaining the
support organisation (and keeping spares).

179
Closing Comments

Q&A

You might also like