Analytic Fault Detection
Analytic Fault Detection
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
288 IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 2, JUNE 2010
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
SIMON AND SIMON: ANALYTIC CONFUSION MATRIX BOUNDS FOR FDI USING AN SSR APPROACH 289
TABLE I
TYPICAL CONFUSION MATRIX FORMAT, WHERE THE ROWS CORRESPOND
TO FAULT CONDITIONS, AND THE COLUMNS CORRESPOND TO FAULT
ISOLATION RESULTS
The TPR is defined as the probability that fault is correctly III. CONFUSION MATRIX BOUNDS
detected given that it occurs. This approach does This section derives analytic confusion matrix bounds for our
not take fault isolation into account. The FNR is defined as the SSR-based FDI algorithm. Section III-A deals with the no-fault
probability that fault is not detected given that it case, and derives bounds for the correct no-fault rate (CNR),
occurs. These probabilities can be written as which is the probability that no fault is detected given that no
fault occurs. It also derives bounds for the FPR, which is the
probability that one or more faults are detected given that no
(3) fault occurred. Finally, it derives an upper bound for the no-fault
misclassification rate, which is the probability that a given fault
Fig. 2 illustrates TPR, and FNR for a chi-squared SSR. The FNR is isolated given that no fault occurred. Section III-B deals with
is the area to the left of the user-specified threshold , and the fault case, and derives bounds for the correction classifica-
the TPR is the area to the right of the threshold. tion rate (CCR), which is the probability that a given fault is
correctly isolated given that it occurred. Section III-C also deals
B. Confusion Matrix with the fault case, and derives upper bounds for the fault mis-
A confusion matrix specifies the likelihood of isolating each classification rate, which is the probability that an incorrect fault
fault, and can be used to quantify the performance of an FDI al- is isolated given that some other fault occurred. Section III-D
gorithm. A typical confusion matrix is shown in Table I. The summarizes the bounds, and their use in the confusion matrix;
rows correspond to fault conditions, and the columns corre- and Section III-E discusses the required computational effort.
spond to fault isolation results. The element in the th row and
th column is the probability that fault is isolated when fault A. No-Fault Case
occurs. Ideally, the confusion matrix would be an identity ma- 1) Correct No-Fault Rate: First, suppose that only two fault
trix, which would indicate perfect fault isolation. detection algorithms, , and , are running. Algorithm at-
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
290 IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 2, JUNE 2010
(6)
tempts to detect fault using sensors, and threshold . We
use the notation
If is empty, and and are not empty, then
(7)
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
SIMON AND SIMON: ANALYTIC CONFUSION MATRIX BOUNDS FOR FDI USING AN SSR APPROACH 291
B. Correct Fault Classification Rates Lemma 3: If neither , , nor are empty, then
Given that some fault occurs, we might isolate the correct
fault, or we might isolate an incorrect fault. The probability of
isolating the correct fault is called the correct classification rate
(CCR). In this section, we derive lower, and upper bounds for
the CCR.
1) Lower Bounds for the Correct Classification Rate: Sup- (16)
pose we have only two fault detection algorithms, and , and
fault occurs. Consider the probability that is larger than where is defined analogously to , shown in (13). If is
relative to their thresholds. We call this probability the marginal empty, but and are not empty, then
detection rate . Note that we are not considering whether
or not the SSRs exceed their threshold; we are only considering (17)
how large the SSRs are relative to their thresholds. The marginal
detection rate is given as
If is empty, but and are not empty, then
(10)
(11)
(18)
where
(19)
(13)
Proof: Equation (16) can be obtained using Lemmas 5, 6,
If is empty, and is not empty, then 7, and 10, which are in the Appendix. Equation (17) follows
from the -independence of , and . Equation (18) can be
(14) obtained using Lemmas 5, 7, and 11. Equation (19) can be ob-
tained using Lemmas 5, 6, and 11.
If is empty, and is not empty, then
The preceding lemma leads to the following result for the
(15) correct fault isolation rate.
Theorem 4: If we have fault detection algorithms, and
Proof: Equation (12) can be obtained using Lemmas 5, and fault occurs, the probability that fault is correctly detected
6, which are in the Appendix. Equations (14), and (15) follow and isolated can be bounded as
from (11).
The preceding lemma leads to the following result for the
correct fault isolation rate.
Theorem 3: If we have fault detection algorithms, and Proof: See the Appendix.
fault occurs, the probability that fault is correctly isolated is
bounded as C. Fault Misclassification Rates
In this section, we derive upper bounds for the probability that
a fault is incorrectly isolated. If fault occurs, the probability
that fault is detected and isolated is called the misclassification
Proof: See the Appendix. rate .
2) Upper Bounds for the Correction Classification Rate: First, suppose that we have two fault detection algorithms: ,
Next, we find an upper bound for the CCR. To begin, suppose and . The misclassification rate can then be written as
that we have only two fault detectors: algorithms , and .
Given that fault occurs, the probability that it is correctly
isolated is called the marginal CCR. This CCR can be written
as
(20)
where the prime symbol on denotes that only two detection
algorithms are used.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
292 IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 2, JUNE 2010
Lemma 4: If neither , , nor are empty, then • for , and is the probability that fault
is incorrectly isolated given that fault occurs, and its
upper bound is given in Theorem 5.
• for is the probability that no fault is isolated
given that fault occurs, and its upper bound is given in
Theorem 6.
(21)
E. Computational Effort
If is empty, but and are not empty, then
Usually, confusion matrices are obtained through simula-
tions. To derive an experimental confusion matrix with faults,
the number of matrix elements that need to be calculated is
(22) on the order of . Also, the required number of simulations
for each matrix element calculation is on the order of . This
If is empty, but and are not empty, then size is because, as the number of possible faults increases, the
number of simulations required to obtain the same statistical
(23) accuracy increases in direct proportion. Therefore, the compu-
tational effort required for the experimental determination of a
If is empty, but and are not empty, then confusion matrix is on the order of .
The bounds derived in this paper also require computational
effort that is on the order of . This size is because each of
(24) the bounds summarized in Section III-D required computational
effort on the order of , and the number of matrix elements is on
Proof: Equation (21) can be obtained using Lemmas 5, 6, the order of . Note that this size does not include the sensor
7, and 10, which are in the Appendix. Equation (22) can be ob- selection algorithm shown in Fig. 3, which requires the off-line
tained using Lemmas 5, 7, and 11. Equation (23) follows from solution of a discrete minimization problem.
(20), and the -independence of , and . Equation (24) fol-
lows from Lemmas 5, 6, and 11. IV. SIMULATION RESULTS
The preceding lemma leads to the following results for the In this section, we use simulation results to verify the theoret-
fault misclassification rate. ical bounds of the preceding sections. We consider the problem
Theorem 5: If we have fault detection algorithms, of isolating an aircraft turbofan engine fault, which is modeled by
and fault occurs, the probability that fault will be incorrectly the NASA Commercial Modular Aero-Propulsion System Sim-
detected and isolated can be bounded as ulation (C-MAPSS) [25]. There are five possible faults that can
occur: fan, low pressure compressor (LPC), high pressure com-
pressor (HPC), high pressure turbine (HPT), and low pressure
Proof: See the Appendix. turbine (LPT). These five faults entail shifts of both efficiency,
Theorem 6: The probability that no fault is detected and flow capacity from nominal values. The fault magnitudes that
when fault occurs can be bounded from above as we try to detect are 2.5% for the fan, 20% for the LPC, 2% for the
HPC, 1.5% for the HPT, and 2% for the LPT. These magnitudes
were chosen to give reasonable fault detection ability.
The available sensors, and their standard deviations are shown
Proof: See the Appendix. in Table II. Recall that our FDI algorithm assumes that the sensor
noises are -independent. In reality, they may have some correla-
D. Summary of Confusion Matrix Bounds
tion. For example, if the aircraft is operating in high humidity, all
Recall the confusion matrix in Table I. The rows correspond of the pressure sensors may be slightly biased in a similar fashion.
to fault conditions, and the columns correspond to fault isola- However, the sensor noise correlation is a second order effect,
tion results. The element in the th row and th column is the and so we make the simplifying but standard assumption that
probability that fault is isolated when fault occurs. The pre- the correlations are zero. This assumption is conceptually sim-
vious sections derived the following bounds. ilar to our simplifying assumption of Gaussian noise.
• CNR is the probability that a no-fault condition is cor- The fault influence coefficient matrix shown in Table III was
rectly indicated given that no fault occurs, and its lower, generated using C-MAPSS, and is based on [26]. The numbers
and upper bounds are given in Theorem 1. in Table III are the partial derivatives of the sensor outputs with
• for is the probability that fault is incorrectly respect to the fault conditions, normalized to the fault percent-
isolated given that no fault occurs, and its upper bound is ages discussed above, and normalized to one standard deviation
given in Theorem 2. of the sensor noise.
• for is the probability that fault is cor- We used the algorithm shown in Fig. 3 to select sensors for
rectly isolated given that it occurs, and its lower, and upper each fault with a maximum allowable FPR of 0.0001. As an ex-
bounds are given in Theorems 3 and 4. ample, consider the fan fault with the normalized fault signatures
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
SIMON AND SIMON: ANALYTIC CONFUSION MATRIX BOUNDS FOR FDI USING AN SSR APPROACH 293
TABLE II TABLE V
AIRCRAFT ENGINE SENSORS, AND STANDARD DEVIATIONS AS A PERCENTAGE SENSOR SETS FOR FAULT DETECTION GIVING THE LARGEST TPR FOR EACH
OF THEIR NOMINAL VALUES FAULT GIVEN THE CONSTRAINT THAT FPR 0.0001
TABLE VI
LOWER BOUNDS OF DIAGONAL CONFUSION MATRIX ELEMENTS WHERE
ROWS SPECIFY THE ACTUAL FAULT CONDITION, AND COLUMNS SPECIFY
TABLE III THE DIAGNOSIS
FAULT SIGNATURES OF FIVE DIFFERENT FAULT CONDITIONS, WITH MEAN
SENSOR VALUE RESIDUALS NORMALIZED TO ONE STANDARD DEVIATION
TABLE VII
UPPER BOUNDS OF THE CONFUSION MATRIX ELEMENTS WHERE ROWS SPECIFY
THE ACTUAL FAULT CONDITION, AND COLUMNS SPECIFY THE DIAGNOSIS
TABLE IV
POTENTIAL SENSOR SETS FOR DETECTING A FAN FAULT
TABLE VIII
EXPERIMENTAL CONFUSION MATRIX USING SSR-BASED DI WHERE ROWS
SPECIFY THE ACTUAL FAULT CONDITION, AND COLUMNS SPECIFY THE
DIAGNOSIS, BASED ON 100,000 SIMULATIONS OF EACH FAULT
shown in Table III. The sensors with the largest fault signatures in
descending order are Ps30, Wf, T30, P15, P24, T48, Nc, and T24.
This gives eight potential sensor sets for detecting a fan fault:
the first potential set uses only sensor Ps30, the second poten-
tial set uses Ps30 and Wf, and so on. The potential sensor sets,
along with their detection thresholds, and TPRs, are shown in
Table IV. Table IV shows that using five sensors gives the largest
TPR given the constraint that FPR 0.0001. The thresholds were
determined by constraining FPR 0.0001. Using five sensors elements. Table VII shows the theoretical upper bounds of the
gives the largest TPR subject to the FPR constraint. confusion matrix. Table VIII shows the experimental confusion
This process described in the previous paragraph was re- matrix. These tables show that the theoretical results derived in
peated for each fault shown in Table III. The resulting sensor this paper give reasonably tight bounds to the experimental con-
sets are shown in Table V. Note that, given a FPR constraint, the fusion matrix values.
detection threshold is a function only of the number of sensors Recall that we used a FPR of 0.0001 to choose our sensor
in each sensor set; the detection threshold is not a function of sets, and detection thresholds. Therefore, the first five elements
the specific fault signatures. This result is illustrated in Fig. 1, in the last row of Table VII are guaranteed to be no greater than
where it is seen that is a function only of , and (the 0.0001. Further, the element in the lower right corner of Table VI
number of sensors). is guaranteed to be no greater than .
We used the fault isolation method shown in Fig. 4, along Note that it is possible for an element in the experimental con-
with the theorems in the previous sections to obtain lower, and fusion matrix in Table VIII to lie outside the bounds shown in
upper bounds for the confusion matrix as summarized in Sec- Tables VI and VII (for example, see the numbers in the fourth
tion III-D. We also ran 100,000 simulations to obtain an exper- row, and first column in Tables VII and VIII). This result is true
imental confusion matrix. Table VI shows the theoretical lower because the numbers in Table VIII are experimentally obtained
bounds of the diagonal elements of the confusion matrix. Lower on the basis of a finite number of simulations, and are guaranteed
bounds of the off-diagonal elements were not obtained because to lie within their theoretical bounds only as the number of simu-
we are typically more interested in upper bounds of off-diagonal lations approaches infinity. In fact, that is one of the strengths of
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
294 IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 2, JUNE 2010
TABLE IX quantified. This paper derives bounds, but does not guarantee
EXPERIMENTAL CONFUSION MATRIX USING THE PARITY-SPACE APPROACH how loose or tight those bounds are. Second, the bounds could
FOR FDI, BASED ON 100,000 SIMULATIONS OF EACH FAULT
be modified to be tighter. Third, bounds could be attempted for
methods other than the FDI algorithm proposed here. The fault
isolation method we used isolates the fault that has the largest
SSR relative to its detection threshold. Other fault isolation
methods could normalize the relative SSR to its standard devia-
tion, or could normalize the SSR to its detection threshold. Our
FDI method is static, which means that faults are isolated using
measurements at a single time. Better fault isolation might be
achieved if dynamic system information is used.
the analytic method proposed in this paper. The analytic bounds
are definite, but simulations are subject to random effects. Also,
APPENDIX
simulations can give misleading conclusions if the simulation
has errors. One common simulation error is the non-random- We use the following lemmas to derive the results of this
ness of commonly used pseudorandom number generators [27]. paper. We use the notation , and to denote the pdf,
To summarize the SSR-based FDI algorithm, the user speci- and CDF of the random variable evaluated at . If the random
fies the maximum FPR for each fault, and then finds the sensor variable is clear from the context, we shorten the notation to
set that has the largest TPR given the FPR constraint. Analytic , and respectively. These lemmas can be proven using
confusion matrix bounds are then obtained using the theory in standard definitions, and results from probability theory [24].
this paper. If the results are not satisfactory, the user can it- Lemma 5: The probability that a realization of the random
erate by changing the maximum FPR constraint. For example, variable is greater than a realization of the random variable
if a TPR is too small, then the user will have to increase the is given as
FPR constraint. If the confusion matrix bounds of fault isolation
probabilities are not satisfactory, the user will have to iterate on
the FPR constraints to obtain different confusion matrix bounds.
We also generated FDI results using the parity space approach where is the joint pdf of , and . If , and are -in-
[20] to explore the relative performance of our new SSR-based dependent, this result can be written as
FDI approach. The parity space approach uses all sensors for all
fault detectors, and we set the detection thresholds to achieve an
FPR of 0.0001 to be consistent with the SSR-based approach.
Results are shown in Table IX. A comparison of Tables VIII
and IX shows that the parity space approach generally performs
better than the SSR-based approach, although the results are Lemma 6: If , where is a random variable, and
comparable. The confusion matrix in Table VIII for the SSR- is a constant, then
based algorithm has a condition number of 1.83, while the ma-
trix in Table IX for the parity space approach has a condition
number of 1.65. This result shows that the confusion matrix for
the parity space approach is about 9.8% closer to perfect than Lemma 7: If , where is a random variable, and
the confusion matrix for the SSR-based approach. is a constant, then
V. CONCLUSION
This paper has introduced a new FDI algorithm, and derived
analytical confusion matrix bounds. The main contribution of
this paper is the generation of analytic confusion matrix bounds, Lemma 8: If , where and are -independent
and the possibility that our methodology could be adapted to random variables, then
other FDI algorithms. Usually, confusion matrices are obtained
with simulations. Such simulations have several potential draw-
backs. First, they can be time consuming. Second, they can give
Lemma 9: If , where is a random variable,
misleading conclusions if not enough simulations are run to
and is a constant, then
give statistically significant results. Third, they can give mis-
leading conclusions if the simulation has errors (for example, if
the output of the random number generator does not satisfy sta-
tistical tests for randomness). The theoretical confusion matrix
bounds derived in this paper do not depend on a random number
generator, and can be used in place of simulations.
Further work in this area could follow several directions.
First, the tightness of the confusion matrix bounds could be where is the continuous-time impulse function.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
SIMON AND SIMON: ANALYTIC CONFUSION MATRIX BOUNDS FOR FDI USING AN SSR APPROACH 295
Lemma 10: If , where and are -indepen- where the inequality comes from the positive dependence of
dent random variables, then , . The probability that fault is isolated given
that fault occurred can be written as
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.
296 IEEE TRANSACTIONS ON RELIABILITY, VOL. 59, NO. 2, JUNE 2010
[10] W. Fenton, T. MicGinnity, and L. Maguire, “Fault diagnosis of elec- [24] A. Papoulis and S. Pillai, Probability, Random Variables, and Sto-
tronic systems using intelligent techniques: A review,” IEEE Trans. chastic Processes. : McGraw-Hill, 2002.
Systems, Man and Cybernetics: Part C – Applications and Reviews, [25] D. Frederick, J. DeCastro, and J. Litt, User’s Guide for the Commer-
vol. 31, pp. 269–281, Aug. 2001. cial Modular Aero-Propulsion System Simulation (C-MAPSS) NASA
[11] H. Schneider and P. Frank, “Observer-based supervision and fault de- Technical Memorandum TM-2007-215026.
tection in robots using nonlinear and fuzzy-logic residual evaluation,” [26] D. L. Simon, J. Bird, C. Davison, A. Volponi, and R. Iverson, “Bench-
IEEE Trans. Control System Technology, vol. 4, pp. 274–282, May marking gas path diagnostic methods: A public approach,” presented at
1996. the ASME Turbo Expo, Jun. 2008, Paper GT2008-51360, unpublished.
[12] M. Napolitano, C. Neppach, V. Casdorph, S. Naylor, M. Innocenti, and [27] P. Savicky and M. Robnik-Sikonja, “Learning random numbers: A
G. Silvestri, “Neural-network-based scheme for sensor failure detec- Matlab anomaly,” Applied Artificial Intelligence, vol. 22, pp. 254–265,
tion, identification and accommodation,” Journal of Guidance, Control Mar. 2008.
and Dynamics, vol. 18, pp. 1280–1286, Nov. 1995. [28] C. Lai and M. Xie, “Concepts of stochastic dependence in reliability
[13] Z. Yangping, Z. Bingquan, and W. DongXin, “Application of genetic analysis,” in Handbook of Reliability Engineering, H. Pham, Ed. :
algorithms to fault diagnosis in nuclear power plants,” Reliability En- Springer, 2003, pp. 141–156.
gineering and System Safety, vol. 67, pp. 153–160, Feb. 2000.
[14] W. Gui, C. Yang, J. Teng, and W. Yu, “Intelligent fault diagnosis in
lead-zinc smelting process,” in IFAC Symposium on Fault Detection,
Supervision and Safety of Technical Processes, Beijing, Aug. 30–Sep.
1 2006, pp. 234–239.
[15] S. Lu and B. Huang, “Condition monitoring of model predictive con- Dan Simon (S’89–M’90–SM’01) received a B.S. degree from Arizona State
trol systems using Markov models,” in IFAC Symposium on Fault De- University (1982), an M.S. degree from the University of Washington (1987),
tection, Supervision and Safety of Technical Processes, Beijing, Aug. and a Ph.D. degree from Syracuse University (1991), all in electrical engi-
30–Sep. 1 2006, pp. 264–269. neering. He worked in industry for 14 years at Boeing, TRW, and several
[16] R. Isermann, “Supervision, fault-detection and fault-diagnosis small companies. His industrial experience includes work in the aerospace,
methods—An introduction,” Control Engineering Practice, vol. 5, pp. automotive, agricultural, GPS, biomedical, process control, and software fields.
639–652, May 1997. In 1999, he moved from industry to academia, where he is now a professor in
[17] X. Deng and X. Tian, “Multivariate statistical process monitoring using the Electrical and Computer Engineering Department at Cleveland State Uni-
multi-scale kernel principal component analysis,” in IFAC Symposium versity. His teaching and research involves embedded systems, control systems,
on Fault Detection, Supervision and Safety of Technical Processes, Bei- and computer intelligence. He has published about 80 refereed conference and
jing, Aug. 30–Sep. 1 2006, pp. 108–113. journal papers, and is the author of the text Optimal State Estimation (John
[18] A. Pernestal, M. Nyberg, and B. Wahlberg, “A Bayesian approach Wiley & Sons, 2006).
to fault isolation—Structure estimation and inference,” in IFAC
Symposium on Fault Detection, Supervision and Safety of Technical
Processes, Beijing, Aug. 30–Sep. 1 2006, pp. 450–455.
[19] S. Campbell and R. Nikoukhah, Auxiliary Signal Design for Failure Donald L. Simon received a B.S. degree from Youngstown State University
Detection. : Princeton University Press, 2004. (1987), and an M.S. degree from Cleveland State University (1990), both
[20] F. Gustafsson, “Statistical signal processing approaches to fault detec- in electrical engineering. During his career as an employee of the US Army
tion,” Annual Reviews in Control, vol. 31, pp. 41–54, Apr. 2007. Research Laboratory (1987–2007), and the NASA Glenn Research Center
[21] J. Gertler, Fault Detection and Diagnosis in Engineering Systems. : (2007–present), he has focused on the development of advanced control, and
CRC, 1998. health management technologies for current, and future aerospace propulsion
[22] D. Simon and D. L. Simon, “Analytic confusion matrix bounds for systems. His specific research interests are in aircraft gas turbine engine
fault detection and isolation using a sum-of-squared-residuals ap- performance diagnostics, and performance estimation. He currently leads
proach, NASA Technical Memorandum TM-2009-215655,” Jul. 2009. the propulsion gas path health management research effort ongoing under
[23] M. Abramowitz and I. Stegun, Handbook of Mathematical Func- the NASA Aviation Safety Program, Integrated Vehicle Health Management
tions. : Dover, 1965. Project.
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on March 01,2021 at 09:42:15 UTC from IEEE Xplore. Restrictions apply.