13-ReliabilityAndSafety
13-ReliabilityAndSafety
λP = λD x πT x πS x πC x πQ x πE = 6.96 x 10-8
λP = λb x πT x πA x πQ x πE = 1.92 x 10-7
Summarize conclusions about the reliability of these components and/or the circuit in general. Suggest
design or analysis refinements that would realistically improve the reliability of the design.
FMEA
• Failure Mode Effects Analysis
• Bottom-up review of a system
• Examine components for failure
modes
• Note how failures propagate
through system
• Study effects on system behavior
• Leads to design review and
possibly changes to eliminate
weaknesses
FMECA
• Addition of criticality analysis
• Not necessary to examine every component
multiple components may have same failure effect
• Rearrange design into functional blocks
consider component failures within those blocks that may be critical
• Create chart listing possible failures
block, failure mode, possible cause, failure effects, method of
detection, criticality, and probability*
LTC3631
LTC3631
LTC3631
LTC3631
Failure
No. Possible Causes Failure Effects Detection Method Criticality
Mode
Unpredictable effect,
backplate supply
3 Vout > 4.5 failed regulator potential for HIGH
voltage > 4.5 V
component damage
FMECA ANALYSIS
Output Drive for Connection Between RC and W (or Y / G)
Failure
No. Possible Causes Failure Effects Detection Method Criticality
Mode
partial one MOSFET failed Unpredictable effect, may not “half-wave” control
3 MEDIUM
open open be able to operate HVAC contact when ”on”
LTC3631
CLICKER QUIZ
Question 6
If the zener diode (highlighted in yellow) fails open, possible effects include:
A. nominal effect – may be undetected
B. backplate power supply will be 0 V
C. HVAC control contact stuck closed, resulting in continuous heat/cool
D. excessive current drawn from HVAC control contact
E. none of the above
LTC3631
CLICKER QUIZ
Question 7
If any of the capacitors (highlighted in blue) fails shorted, possible effects include:
A. nominal effect – may be undetected
B. backplate power supply will be 0 V
C. HVAC control contact stuck closed, resulting in continuous heat/cool
D. B and C
E. none of the above
LTC3631
FAILURE REPORTS
Customer Complaints Documenting That Critical Failures Can and Do Occur
• I can't even begin to say how upset I am to have to title the Nest Learning Thermostat as "The
Worst Thermostat EVER." For the "cool" factor and appearance it was in "A" in my book. I
installed it in November 2014 and it worked like a charm... for 4 weeks. Then we came home to
a house that was 80+ degrees in winter (in Buffalo no less) and found "the base unit was
malfunctioning" preventing the nest from shutting off. The "overnight" Fed-Ex replacement
arrived in 2 days which meant I had to manually turn on and off the furnace from the circuit
breaker. The new nest worked great... for 3 weeks before it did the same thing. Another call to
nest with their crazy long wait customer service stated this was a known issue and another unit
would be sent... "overnight." Four (4) days later FedEx showed with my third unit in the same
number of months and it worked again...well. Yesterday, after only 2 1/2 weeks from install, the
Nest again malfunctioned and my phone call to their customer support agent and "senior" agent
finally concluded my energy effecient Heil forced air gas furnace was "incompatable" to the nest.
What?!?!? I have finally had it and went straight to Home Depot and purchased a Honeywell
Smart Thermostat as a replacement. My last Honeywell thermostat lasted over 20 years and I'm
just hopeful this one will last longer then the Nest's.
RELIABILITY & SAFETY ANALYSIS REPORT
• Failure Mode, Effects, and Criticality Analysis (FMECA)
Failure Modes: Divide your schematic into functional blocks (e.g. power circuits, sensor blocks,
microcontroller block) – include this illustration as Appendix A Break the schematic into small enough blocks
so that details are readable. Determine all possible failure conditions of each functional block. Indicate the
components that could possibly be responsible for such a failure (e.g., a shorted bypass capacitor might
cause a voltage drop, but cannot cause a voltage increase).
Effects: For each failure mode above, determine the possible effects, if any, on any major components in
other parts of the design (e.g., damage the microcontroller or fry a resistor) as well as effects on the overall
operation of the project (e.g, audio volume increases to maximum). For some failure modes, it is
acceptable to declare the effects unpredictable. “Method of detection” of a particular failure mode should be
observable from the operation of the device, unless there is particular circuitry intended to detect such a
failure.
Criticality: Begin by defining at least two criticality levels for types of failures in the output of your design.
Define an acceptable failure rate λ for each level of failure. These are up to you and somewhat arbitrary, but
keep in mind λ < 10-9 is standard for any failure that could potentially injure the user. Failures not affecting
user safety do not usually require λ < 10-9.
FEMCA Worksheet: Include your completed FEMCA Worksheet as Appendix B. In the body of the report,
explain your choice of criticality levels and any assumptions that affected your analysis of several failure
modes. Assumptions affecting just individual failure modes can be included in the comments in the table.
SOFTWARE RELIABILITY
Revisiting How Nest Learns
SOFTWARE RELIABILITY
Discussion
• Potential non-determinism associated
with multithreaded software
Large set of input variables (sensors)
and states
Effect of sensor malfunction on
learning ability and impact on
program behavior
potential to learn “bad habits”?
ability to recognize and “clear”
incorrectly learned behavior?
Standard testing may not reveal
latent software bugs
FAILURE REPORTS
Customer Complaints Documenting That Software Failures Can and Do Occur
• “The NEST product was an interesting and fun gadget for a year and a half ... until control of it was
taken away by someone during one of the coldest days of the year. As the house got colder and
colder I worked through the NEST website looking for tech support to no avail. Finally Googling
"NEST help" got me a contact number. During three hours of troubleshooting I found out that this
thermostat was part of an energy savings program. NEST thought the thermostat was controlled by
my local utility. I contacted my local utility and they had no idea what I was talking about. I then went
back to NEST and they still had no idea who was controlling the thermostat or how low the
"Controller" whoever that was would let the temp fall. I worked with them a little longer in an attempt
to opt out of this energy saving program and after three hours I told them thank you very much, but
your time is up. I then replaced this thermostat with a conventional programmable thermostat. The
NEST product is not ready for prime time.”
• WOWWW The coldest day of the year, this is the second time NEST shut down heating system and
said it wanted us to call nest service to come fix heating system. I had to reconnect old thermostat
which corrected the issue. what a scam .;.; im wondering who had control of my house ???
SOFTWARE RELIABILITY
Watchdog Timer
• Role of watchdog timer is to reset processor if “strobe timeout” occurs
• Problem: watchdogs integral to microcontroller are no more reliable
than microcontroller itself
• External watchdogs “better”, but have to make sure that it is prevented
from being strobed in the event of failures/bugs
• Possible solution: make watchdog respond to a “key” (that would be
difficult for failed software/bug to generate)
THE REST OF THE STORY…
• Designing a functional product represents about 30% of the design effort
• Making sure a product always fails in a safe, predictable manner takes
the remaining 70%
• Law of diminishing returns: exercise good judgment in adding safety
features
• Keep in balance: safety features and possibility of “nuisance alarms”
(failures resulting from added complexity)
• Utilize built-in self-test (BIST)
MAINTAINABILITY
• Reliability predication indicates how many problems per day will need to
be serviced after, say, 10,000 units have been shipped
• Keep customers happy with quick repair turn-around time (TAT)
• Repair will most likely be by replacement (“line replaceable units” – LRU)
• Maintainability analysis generates data showing the time needed to
identify the faulty LRU, the time to replace it, and the time to re-test the
system
• Mean-time-to-repair (MTTR)
STANDARDS AND COMPLIANCE
Example Category Relevant to ECE 477 Projects