0% found this document useful (0 votes)
377 views138 pages

Identification of Dynamic Systems, Theory and Formulation

This document provides a summary of statistical methods for system identification. It introduces concepts such as parameter identification, different types of system models including explicit function and state space models. It discusses parameter estimation and other approaches to system identification such as maximum likelihood estimation. The document is intended to provide practicing data analysts with the background to effectively use statistical system identification techniques.

Uploaded by

nohaycuentas
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
377 views138 pages

Identification of Dynamic Systems, Theory and Formulation

This document provides a summary of statistical methods for system identification. It introduces concepts such as parameter identification, different types of system models including explicit function and state space models. It discusses parameter estimation and other approaches to system identification such as maximum likelihood estimation. The document is intended to provide practicing data analysts with the background to effectively use statistical system identification techniques.

Uploaded by

nohaycuentas
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

NASA Reference - Publication 1138

February 1985

Identification of Dynamic Systems


Theory a-nd Formulation
Richard E, Maine: and Kenneth W.

(NASA-RP-113E) 1 C E N ' i l E i C B : I C b S F D Y B I P . 1 , SYSTdlJS, T H 2 l i E i ' A L L f C3r"tiLE'; i C S j E 3 S A ) 138 g BC Ai7/t!P A C 1 CSCI 1 L E

NASA Reference Publication 1138

1 Identification of
Dynamic Systems
Theory and Formulation
Richard E. Maine and Kenneth W. Iliff
Ames Research Center Dryden Flight Research Facility Edwards, Calfornia

N a r ~ c l ~A e r o ~ a u t l c s al aQd Space Adrn~n~srr ~ c n ,I

NS AA

Scientific and Tecnnicql Information Branch


% '

PREFACE The subject o f system i d e n t i f i c a t i o n i s too broad t o be covered completely i n one book. This document i s r e s t r i c t e d t o s t a t i s t i c a l system i d e n t i f i c a t i o n ; t h a t i s . methods derived from p r o b a b i l i s t i c mathematical statements o f the problem. W w i l l be p r i m a r i l y interested i n maximum-likelihood and r e l a t e d estimators. e S t a t i s t i c a l methods are becoming increasingly important w i t h the p r o l i f e r a t i o n o f high-speed, general-purpose d i g i t a l computers. Problems t h a t were once solved by hand-pS,tting the data and drawing a l i n e through them are now done by t e l l i n g a computer t o fit the best l i n e through the data (or by some completely d i f f e r e n t , are well-suited t o computer formerly impractical method). S t a t i s t i c a l approaches t o system i c i ~ n t i f i c a t i o n appl ication. Automated s t a t i s t i c a l algorithms can solve more complicated problems more rapidly-and sometimes more accurately- than the older manual methods. There i s a danger, however, o f the engineer's l o s i n g t h e i n t u i t i v e feel f o r the system t h a t arises from long hours o f working c l o s e l y w i t h the data. To use s t a t i s t i c a l estimat i o n algorithms e f f e c t i v e l y , the engineer must have not only a good grasp o f t h e system under analysis, but also a thorough understanding o f the a n a l y t i c t o o l s used. The analyst must s t r i v e t o understand how the system behaves and what c h a r a c t e r i s t i c s o f the data influence the s t a t i s t i c a l estimators i n order t o evaluate the v a l i d i t y and meaning o f the r e s u l t s . Our primary aim i n t h i s document i s t o provide t h e p r a c t i c i n g data analyst w i t h the background necessary t o make e f f e c t i v e use o f s t a t i s t i c a l system i d e n t i f i c a t i o n techniques, p a r t i c u l a r l y maximum-likelihood and r e l a t e d estimators. The i n t e n t i s t o present the theory i n a manner t h a t aids i n t u i t i v e understanding a t a concrete l e v e l useful i n application. Theoretical r i g o r has not been sacrificed, b u t we have t r i e d t o avoid "elegant" proofs t h a t may require three l i n e s t o w r i i e , but 3 years o f study t o comprehend the underlying theory. I n p a r t i c u l a r , such t h e o r e t i c a l l y i n t r i g u i n g subjects as martingales and measure theory a r e ignored. Several excellent volumes on these subjects are availnble. inciudiny Balakrishnan (1973). Royden (1968). Rudin (1974). and Kushner (1971). W assume t h a t the reader has a thorough background i n l i n e a r algebra and calculus (Paige, Swift, and e Slobko. 1974; Apostol. 1969; Nering. 1969; and Wilkinson. 1 x 5 ) . including complete f a m i l i a r i t y w i t h matrix operations, vector spaces. inner products, norms, gradients, eigenvalues, and r e l a t e d subjects. The reader should be f a m i l i a r w i t h the concept o f function spaces as types o f abstract vector spaces (Luenberger, 1969), e but does n o t need expertise i n functional analysis. W also assume f a m i l i a r i t y w i t h concepts o f deterministic dynamic systems (Zadeh and Desoer, 1963; Wiberg. 1971; and Levan. 1983). Chapter 1 introduces the basic concepts o f system i d e n t i f i c a t i o n . Chapter 2 i s an i n t r o d u c t i o n t o numeric a l optiwization methods, which a r e important t o system i d e n t i f i c a t i o n . Chapter 3 reviews basic concepts from p r o b a b i l i t y theory. The treatment i s necessarily abbreviated, and previous f a m i l i a r i t y w i t h p r o b a b i l i t y theory i s assumed. Chapters 4-10 present the body o f the theory. Chapter 4 defines the concept o f an estimator and some o f the basic properties o f estimators. Chapter 5 discusses estimation as a s t a t i c problem i n which time i s not involved. Chapter 6 presents some s i n p l e r e s u l t s on stochastic processes. Chapter 7 covers t h e s t a t e estimat i o n problem f o r dynamic systems w i t h known c o e f f i c i e n t s . W f i r s t pose i t as a s t a t i c estimation problem. e e drawing on the r e s u l t s from Chapter 5. W then show how a recursive formulation r e s u l t s i n a simpler s o l u t i o n process, a r r i v i n g a t the same state estimate. The d e r i v a t i o n used f o r the recursive s t a t e estimator (Kalman f i l t e r ) does n o t r e q u i r e a background i n stochastic processes; only basic p r o b a b i l i t y and the r e s u l t s from Chapter 5 are used. Chapters 8-10 presont the parameter estimation problem f o r dynamic systems. Each chapter covers one o f e the basic es:imation algorithms. W have considered parameter estimation as a problem i n i t s own r i g h t , r a t h e r than f o r c i n g i t i n t o the form o f a nonlinear f i l t e r i n g problem. The general nonlinear f i l t e r i n g problem i s m r e d i f f i c u h than parameter estimation f o r l i n e a r systems, and i t requires ad hoc approximations f o r p r a c t i c a l implemntqtion. He f e e l t h a t our approach i s more natural and i s easier t o understand. Chapter 11 examines the accuracy o f the estimates. The enphasis i n t h i s chapter i s on evaluating the accuracy and analyzing causes o f poor accuracy. The chapter also includes b r i e f discussions about the r o l e s o f model structure determination and experiment design.

iii

TABLE OF CONTENTS Page

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii NOMENCLATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i x 1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 SYSTEM IDENTIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 PARAMETER IDENTIFlCATlON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 TYPES OF SYSTEM WDELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 E x p l i c i t F u n c t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 PARAMETER ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 OTHER APPROACHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
PREFACE
2.0

3.0

4.0

5.0

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 THE STATIC ESTIMTION PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.1 J o i n t D i s t r i b u t i o n o f Z and 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.2 A poetoriol.i Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.3 Maximum Likelihood Esttmator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.4 Comparison o f Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 PARTITIONING IN ESTIMATION PROBLEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 4.2.5 COmON 4.3.1 1.3.2 4.3.3 4.3.4 Bayesian Optimal Estimators Asymptotic Properties ESTIMATORS A posteriori Expected Value Bayesian Minimm Risk Maxhnum a p3stsrioz-i P r o b a b i l i t y Maximum Likelihood 5.3 5.2.1 Measurement P a r t i t t o n t n g 5.2.2 Application t o Linear Gaussian System 5.2.3 Parameter P a r t t t i o n i n g LIMITING CASES AND SINGULARITIES 5.3.1 Singular P 5.3.2 Singular GG' 5.3.3 Singular CPC* + GG* 5.3.4 Infinite P 5.3.5 I n f i n i t e GG* 5.3.6 Singular Ce(GG*)-'C + P"

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 BASIC PRINCIPLES FROM PPOBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 PROBABILITY SPF.LES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 P r o b a b i - i t y T r i p l e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Cdnditicnal P r o b a b i l i t i e s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 SCALAR RANDOn VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 D i s t r i b u t i o n and Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 JOINT RANDOn VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 D i s t r i b u t i o n and Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Marginal and Conditional D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 S t a t i s t i c a l Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 TRANSFORMTION OF VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 GAUSSIAN VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.1 Standard Gaussian D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.2 General Gaussian D i s t r i b u t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.4 Central L i m i t Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 STATISTICAL ESTIMTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 DEFINITION O A ESTIMATOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F N 4.2 PROPERTIES OF ESTIMTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Minimum Variance Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.3 Cramer-Rao Inequality ( E f f i c i e n t Estimators) . . . . . . . . . . . . . . . . . . . . . 37
OPTIMIZATION METHODS 2.1 ONE-DIMENSIONAL SEARCHES 2.2 DIRECT METHODS 2.3 GRADIENT METHODS 2.4 SECOND ORDER METHODS 2.4.1 Newton-Raphson 2.4.2 Invariance 2.4.3 S i n g u l a r i t i e s 2.4.4 Quasi-Newton Methods 2.5 S M SF SQUARES 2.5.1 Linear Case 2.5.2 Nonl inear Case 2.6 CONVERGENCE IMPROVEMENT

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .................. 52 ..... 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . .. .. .. .. .. .. .. ................. .. .. .. .. .. .. .. . . . . . . . . . 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... .. . 55 . 56 . . . . . .. .. .. .. .. ............... .. .. .. .. .. . . . . . . . .. .. .. .. ........ 57 ... .... . . . . . . . . . . . . . . . . ....... . . . . . . . . . . . 58 58


P ' I N a U G E BLW& NOT m

6.0

7.0

8.0

9.0

10.0

. . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . 58 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . 61 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 STOCHASTICPROCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1 DISCRETE TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1 Linear Systems Forced by Gaussian White Noise . . . . . . . . . . . . . . . . . . 69 6.1.2 Nonlinear Systems and Non-Gaussian Noise . . . . . . . . . . . . . . . . . . . . . 70 6.2 CONTINUOUSTIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.1 Linear Systems Forced by White Noise . . . . . . . . . . . . . . . . . . . . . . . 70 6.2.2 Additive White Measurement Noise . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2.3 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 STATE ESTIMATION F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 O 7.1 EXPLICIT FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 RECURSIVE FORMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2.1 Prediction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.2.2 Correction Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.3 Kalman F i l t e r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.4 Alternate Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2.5 Innovations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3 STEADY-STATE F R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 OM 7.4 CONTINUOUSTIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.5 CONTINUOUS/OISCRETE TIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 SMOOTHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.7 NONLINEAR SYSTEMS AN0 NON-GAUSSIAN NOISE . . . . . . . . . . . . . . . . . . . . . . . . 86 O OUTPUT ERROR METHOD F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.1 DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.2 INITIAL CONDITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3 COMPUTATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.1 Gauss-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.3.2 SystemResponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.3.3 F i n i t e Difference Response Gradient . . . . . . . . . . . . . . . . . . . . . . . 93 8.3.4 Analytic Response Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4 U K O N G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 NN W 8.5 CHARACTERISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 FILTER ERROR METHOD F R DYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 O 9.1 DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1.1 S t a t i c Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 9.1.2 Derivation by Recursive Factoring . . . . . . . . . . . . . . . . . . . . . . . . 98 9.1.3 Derivation Using the Innovation . . . . . . . . . . . . . . . . . . . . . . . . . 98 9.1.4 Steady-State Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.1.5 Cost Function Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.2 COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 9.3 FORMULATION A A FILTERING PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 S EQUATION ERROR METHOD FOR OYNAMIC SYSTEMS . . . . . . . . . . . . . . . . . . . . . . . . . . 1C1 10.1 PROCESS-NOISE APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.1.2 Special Case o f F i l t e r Error . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.2 GENERAL EQUATION ERROR FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 10.2.1 Discrete State-Equation E r r o r . . . . . . . . . . . . . . . . . . . . . . . . . 104 10.2.2 Continuous/Discrete State-Equation Error . . . . . . . . . . . . . . . . . . . . 104 10.2.3 Observation-Equation E r r o r . . . . . . . . . . . . . . . . . 10.3 COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 NONLINEAR SYSTEMS YlTH ADDITIVE GAUSSIAN NOISE 5.4.1 J o i n t D i s t r i b u t i o n o f Z and c 5.4.2 Estimators 5.4.3 Computation o f the Estimates 5.4.4 S i n g u l a r i t i e s 5.4.5 P a r t i t i o n i n g MULTIPLICATIVE GAUSSIAN NOISE (ESTIMATION OF VARIANCE) NON-GAUSSIAN NOISE

11.0 ACCURACY O THE ESTIMATES F 11.1 CONFIDENCE REGIONS 11.1.1 Random Parameter Vector 11.1.2 Nonrandom Parameter Vector 11.1.3 Gaussian Approximation 11.1.4 Nonstatistical Derivation 11.2 ANALYSIS O THE CONFIDENCE ELLIPSOID F 11.2.1 S e n s i t i v i t y 11.2.2 Correlation 11.2.3 Cramer-Rao Bound 11.3 OTHER MEASURES O ACCURACY F 11.3.1 Bias 11.3.2 Scatter 11.3.3 Engineering Judgment 11.4 MODEL STRUCTURE DETERMINATION 11.5 EXPERIMENT DESIGN

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

A.0

MATRIX RESULTS A.1 M T R I X INVERSION L E N A.2 MATRIX DIFFERENTIATION

REFERENCES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131

vll

NOMENCLATURE SYWOLS
I t i s impractical t o l i s t a l l o f the symbols used i n t h i s document. The f o l l o w i n g are symbols o f p a r t i c u l a r significance and those used consistently i n l a r g e portions o f the document. I n several specialized situations, the same sylnbols are used w i t h d i f f e r e n t meanings not included i n t h i s l i s t .

s t a b i l i t y matrix control matrix bias s t a t e observation matrix control observation mutrix expected value e r r o r vector system function process noise covarian:e matrix p w b a b i l i t j d i s t r i b u t i o n function o f x system state function measurement noise covariance matrix

B
b(.)
C

D
E( .) e F(.) FF* Fx(.) f(.)
GG*

system observation function !I(.) h(.


J(.)

equation e r r o r function cost function Fisher information matrix p r i o r mean of

process noise vector p r i o r covariance o f

6 , o r covariance o f f i l t e r e d x

p r o b a b i l i t y density function o f x, short notation p r o b a b i l i t y density function o f x, f u l l n o t a t i o n covariance o f predicted x covariance o f innovation time system input dynamic system input vector concatenated innovation vector innovation vector parameter vector i n s t a t i c models dynamic system state vector system response concatenated response vector dynan~icsystem response vector sample i n t e r v a l measurement noise vector s t a t e t r a n s i t i o n matrlx

i n p u t t r a n s i t i o n matrix vector o f unknown parameters set o f possible parameter values random noise vector probabi 1it y space predicted estimate ( i n f i l t e r i n g contexts) optimum ( i n optimization contexts), o r estimate ( i n e s t i m a t i o r contexts), o r f i l t e r e d estimate ( i n f i l t e r i n g contexts) smoothed estimate Subscript

i n d i c a t e s dependence on 5

Abbreviations and acronyms arg corr cov exp In


MX X

value o f

t h a t maximizes the f o l l o w i n g f u n c t i o n

correlation covariance exponential natural logarithm maximum a p o s t e r i o r i p r o b a b i l i t y mximum-1 i k e l ihood estimator mean-square e r r o r variance

MAP
ML E mse var

Mathematical n o t a t i o n f(.) the e n t i r e f u n c t i o n transpose Ox gradient w i t h respect t o t h e vector x ( r e s u l t i s a row vector when the operand i s a scalar, o r a matrix when the operand i s a column vector) second gradient w i t h respect t o series sumnation series product 3.14159 x

f, as opposed t o the value of the f u n c t i o n a t a p a r t i c u l a r p o i n t

+ I:
v2

n
n

...

u n

set union set i n t e r s e c t i o n subset element o f a set the set o f a l l inner product conditioned on ( i n probabil f t y contexts) absolute value o r determinant volumeelement right-hand l i m i t a t vector w i t h
ti

c
E

fx:cl
( . * a )

such t h a t c o n d i t i o n c

holds

1 1. I
dl.1

tt
n-vector

n elements

x(')

4th element o f the vector

x, or

i t h row o f t h e matrix x

A lower case subscript generally indicates an element o f a sequence

1 .o CHAPTER 1 1.0 INTRODUCTION the deduction of system c h a r a c t e r i s t i c s from measured data. because i t i s t h e opposite of the problem o f computing the Gauss (1809, p. 85) r e f e r s t o "the inverse problem, t h a t i s place." The inverse problem might be phrased as, "Given the general terms, system i d e n t i f i c a t i o n i s seen as a simple obscure area o f mathematics.

System i d e n t i f i c a t i o n i s broadly defined as It i s comnonly referred t o as an inverse problem response of a system w i t h known characteristics. when the t r u e i s t o be derived from the apparent answer. what was the question?" Phrased i n such concept used i n everyday l i f e , rather than as an

3dknou

The system i s your body, and the c h a r a c t e r i s t i c o f i n t e r e s t i s perform an experiment by placing t h e system on a mechanical transducer i n the bathroom which g i v e i as output a p o s i t i o n approximately proportional t o the system mass and the l o c a l g r a v i t a t i o n a l f i e l d . Based on previous comparisons w i t h the doctor's scales, you know t h a t your scale cons i s t e n t l y reads 2 l b high, so ynu subtract t h i s f i g u r e from the reading. The r e s u l t i s s t i l l somewhat higher than expected, so you step o f f o f the scales and then repeat t h e experiment. The new reading i s more "reasonable" and from i t you obtain an estimate o f the system mass.

Exam l e 1.0-1

This simple examble a c t u a l l y includes several important p r i n c i p l e s o f system i d e n t i f i c a t i o n ; f o r instance, the r e s u l t i n g estimates are biased (as defined i n Chapter 4). Example 1.0-2 The "guess your weight" booth a t the f a i r .

The weight guesser's illstrumentation and estimation algorithm are more d i f f i c u l t t o describe precisely. but they are used t c solve the same system i d e n t i f i c a t i o n problem. Example 1.0-3 hewton's deduction o f the theory o f gravity.

Newton's problem was much more d i f f i c u l t than the f i r s t two examples. He had t o deduce n o t j u s t a s i n g l e number, but a l s o the form o f the equations describing the system. Newton was a t r u e expert i n system i d e n t i f i c a t i c n (among other things). As apparent from the above examples, system i d e q t i f i c a t i o n i s as much an a r t as a science. This p o i n t i s o f t e n forgotten by s c i e n t i s t s who prove elegant mathematical theorems about a model t h a t doesn't adequately represent the t r u e system t o begin with. On the other hand, engineers who r e j e c t what they consider t o be "ivory tower theory" are foregoing t o o l s t h a t could give d e f i n i t e answers t o some questions, and h i n t s t o a i d i n the understanding o f others. System i d e n t i f i c a t i o n i s c l o s e l y t i e d t o control theory, p a r t i a l l y by some comnon methodology, and part i a l l y by the use o f i d e n t i f i e d system models f o r c o n t r o l design. Before you can design a c o n t r o l l e r f o r a system, you must have some n o t i o n o f the equations describing t h e system. Another ~orrmonpurpose o f system i d e n t i f i c a t i o n i s t o help gain atl understanding o f how a system works. Newton's investigations were more along t h i s l i n e . ( I t i s u n l i k e l y t h a t he wanted t o control the motion o f the planets. ) The a p p l i c a t i o n o f system i d e n t i f i c a t i o n techniques i s strongly dependent on t h e puriose f o r which the r e s u l t s are intended; r a d i c a l l y d i f f e r e n t system models and i d e n t i f i c a t i o n techniques r d y be appropriate f o r d i f f e r e n t purposes r e l a t e d t o the same system. The a i r c r a f t control s:'%tem designer n i l 1 be unimpressed when given a model based on inputs t h a t cannot be influenced, outputs t h a t ,lnnot be measured, aspects o f the system t h a t the designer does not want t o control. and a complicated model i n a form not amenable t o control analysis techniques. The same model might be i d e a l f o r the aerodynamicist studying the f l o w around the vehicle. The f i r s t and most important step o f any system i d e n t i f i c a t i o n a p p l i c a t i o n i s t o define i t s purpose. Following t h i s chapter's overview, t n i s document presents one aspect o f the science o f system i d e n t i f i c a t i o n - t h e theory o f s t a t i s t i c a l estimation. The theory's main purpose i s t o help the engineer understand the system, n o t t o serve as a formula f o r consistently producing t h e required results. Therefore, our exposition o f the theory, although r i g o r o u s l y defensible, emphasizes i n t u i t i v e understanding rather than mathematical sophistication. The f o l l o w i n g comnents o f Luenberger '1969, p. 2) a l s o apply t o the theory o f system identification: Some readers may look w i t h great expectation toward functional analysis, hoping t o discover new powerful techniques t h a t w i l l enable them t o solve important problems beyond the reach o f simpler mathematical analysis. Such hopes are r a r e l y r e a l i z e d i n practice. The primary u t i l i t y o f functional analysis i s i t s r o l e as a u n i f y i n g d i s c i p l i n e , gathering a number o f apparently diverse, specialized mathematical t r i c k s i n t o one o r a few geometric principles.

...

With good i n t u i t i v e understanding, which a r i s e s from such u n i f i c a t i o n , the reader w i l l be b e t t e r equipped t o extend the ideas t o other areas where the solutions, although simple, were n o t formerly obvious. The l i t e r a t u r e o f the f i e l d o f t e n uses t h e terms "system i d e n t i f i c a t i o n , " "parameter i d e n t i f i c a t i o n , " and "parameter sstimation" interchangeably. The f o l l o w i n g sections define and d i f f e r e n t i a t e these broad terns. The m a j o r i t y of the l i t e r a t u r e i n the f i e l d , i n c l u d i n g most o f t h i s document, addresses t h e f l e l d most prec l s e l y c a l l e d parameter estimation.

2 1.1 SYSTEM IDENTIFICATION

W begin by phrasing the system i d e n t i f i c a t i o n problem i n formal mathematical terms. There are three e elements essential t o a system i d e n t i f i c a t i o n problem: a system, an experiment, and a response. W define e these elements here i n broad, abstract, s e t - t h e o r e t i c terns, before introducing more concrete f o m s i n Section 1.3. L e t U represent some experiment, taken from the s e t @ o f possible experiments on the system. U could represent a d i s c r e t e event, such as stepping on the scales; o r a value, such as a voltage applied. U could a l s o be a vector f u n c t i o n o f time, such as the motions o f the c o n t r o l surfaces w h i l e an a i r p l a n e i f;lown through a maljeuver. I n systems terminology. U i s the i n p u t t o the system. (We w i l l use the term5 input," "control, and "experiment" more o r l e s s interchangeably.) Observe the response Z o f the system,,to the experiment. As w i t h U. Z could be represented i n man:! forms i n c l u d i n g as a d i s c r e t e event (e.9.. the system blew up") o r as a measured time function. I t i s ,-n element o f the set @ o f possible responses. (We a l s o use the terms "output8' o r "measurement" f o r 2.) The abstract system i s a map ( f u n c t i o n ) F from the set o f possible experiments t o the set o f possible responses. F: that i s

@+a

(1.1-1)

The system i d e n t i f i c a t i o n problem i s t o reconstruct the f u n c t i o n F from a c o l l e c t i o n o f experiments Ui and the corresponding system responses 21. This i s tne purest form o f the "black box" i d e n t i f i c a t i o n problem. W are asked t o i a e n t i f y the system w i t h no information a t a l l about i t s i n t e r n a l structure, as i f e the system were i n a black box whlch we could n o t see i n t o . Our only information i s the inputs and outputs. An obvious s o l u t i o n i s t o perform a l l o f the experiments i n @ and simp:y tabulate the responses. This i s u s u a l l y impossible because the set @ i s too l a r g e ( t y p i c a l l y , i n f i n i t e ) . Also, we m y n o t have complete freedom i n s e l e c t i n g the Ui. Furthermore, even i f t h i s approach were possible. the t a b u l a r fotmat o f the r e s u l t would generally be inconvenient and o f l i t t l e help i n understanding the s t r u c t u r e o f the system.
I f we cannot perfonn a l l o f the experiments i n 0 , the system i d e n t i f i c a t i o n problem i s impossible without f u r t h e r information. Since we have made no assumptions about the form o f F, we cannot be sure o f i t s behavior without checking every point.

f+++
a) 2.1

Exam l e 1 1 1 The i n p u t U and output Z o f a system are both represented y rea va ued scalar variables. When an i n p u t o f 1.0 i s applied. the output i s 1.0. When an i n p u t o f -1.0 i s applied, the output i s a l s o 1.0. Without f u r t h e r information we cannot t e l l which o f the f o l l o w i n g representations ( o r an i n f i n i t e number o f others) o f the system i s correct. (independent o f
1

U)

d)

The response depends on the time I n t e r v a l between applylng U and measuring 2, which we f o r g o t t o consider.

Exa l e 1.1-2 and output o are dh---nTheminputnWhencannotinputf t lansgsystemamongscalar tlmei sfunctlons nterval (I,-). the I cos(t), the output sin(t). Wlthout more I f o t l o we dls uish a) z(t) = cos(t) Independent o f
U

i7d-F-

Exam l e 1.1-3 The lnput and output o f a system are Integers I n the range o r every Input r x c e p t U = 37, w measure the output and f i n d It e equal t o the lnput. W have no mathematlcal basls f o r drawlng any concluslon e about the response t o the lnput U = 37. W could guess t h a t the output mlght e be Z = 37, b u t there I s no mathematical j u s t l f l c a t l o n f o r t h l s guess I n the problem as f o m l a t e d .

Our l n a b l l l t y t o draw any concluslons I n the above examples ( p a r t l c u l a r l y Example (1.1-3). whlch seems so obvious l n t u l t i v e l y ) points out the fnadequacy o f the pure black-box statement o f the system I d e n t l f l c a t l o n e problem. W cannot reconstruct the functlon F without some guldance on chooslng a p a r t l c u l a r f u n c t i r l r from the l n f l n i t e number o f functlons conslstent w l t h the r e s u l t s o f the experiments performed. W have seen t h a t the pure black box system l d e n t l f l c a t i o n problem, where absolutely no Information I s e given about the l n t e r n a l structure o f the system. I s impossible t o solve. The lnformatlon needed t o construct the system function F I s thus composed o f two parts: I n f o m t l o n whlch I s assumed. and Information whlch i s deduced from tCe experimental data. These two I n f o m t l o n sources can c l o s e l y interact. For Instance, the experlmental data could contradlct the assumptlons made, r e q u l r t n g a r e v l s l o n o f the assumptlons, o r the data could be used t o select one of a set o f candidate assunptlons (hypotheses). Such l n t e r a c t l o n tends t o obscure the r o l e o f the assumptlon, m k l n g I t seem as though a l l o f the l n f o m t l o n was obtalned from the experlmental data, and thus has a purely objective v a l l d l t y . I n f a c t , t h l s I s never t h e case. R e a l l s t l c s l l y , most o f the l n f o m t l o n used f o r constructlng the system functlon F w l l l be assumptlons based on knowledge o f the nature o f the physlcal processes o f the system. System l d e n t l f l c a t i o n technology based on experlmental data I s used only t o f l l l i n the relatively small gaps I n our knowledge o f the system. From t h i s perspective, we recognlze system l d e n t l f l c a t i o n as an extremely useful t o o l f o r f l l l l n g I n such knowledge gaps, rather than as a panacea whlch w l l l a u t o l ~ t i c a t l y e l l us everything we need t o know about a system. The c a p a b l l l t i e s o f some modern t techniques may l n v l t e the vlew o f system l d e n t l f l c a t i o n as a cure-all. because the underlying assumptlons are subtle and seldom e x p l l c l t l y stated. Exa l e 1.1-4 t f much dh--To37theReturnr noIasl the problemfom nexample (1.1-3).u l rl ebeeinlngly,t l onothalgonow e ge Inte behavlor o the system I s r q d t o deduce t a t Z w t l l be when U 37; Indeed, y cannon system d e n t l f l c a n rlthms would make such a deduction. I n f a c t , the assunptlons m d e are numerous. The s p e c l f l c a t l o n of the set o f posslble lnputs and outputs already lmplles many assumptlons about the system; f o r Instance, t h a t there are no t r a n s l e n t effects, o r t h a t such e f f e c t s are unlnportant. The problem statement does not allow f o r an event such as t h e system output's oscillating through several values. W have also m d e an assumptlon o f r e p e a t a b l l l t y . e Perhaps the same experiment redone tomorrow would produce d i f f e r e n t r e s u l t s , depending on some f a c t o r n o t considered. Encompassing a11 o f the other e assumptlons I s the assumption o f s i m p l l c l t y . W have applled ":cam's Razor and found the simplest s y s t a consistent w i t h the data. (me can e a s i l y lmaglne useful systems t h a t select s p e c l f i c Inputs f o r speclal treatment. e Nothing I n the data has eliminated such systems. W can see t h a t the assumpt l o n s play the l a r g e s t r o l e I n solvlng t h l s problem. Granted the assunptlnn t h a t we want the s l n p l e s t conslstent r e s u l t . the deductlon from the dat: t h a t Z = U 4s t r l v l a l . Two general types o f assunptlons e x i s t . The f i r s t consists o f r e s t r l c t l o n s on the allowable forms o f the f u n c t l o n F. Presumbly, such r e s t r l c t l o n s would r e f l e c t the knowledge o f what functlons are reasonable consldtrlng the physics o f the system. The second type o f assunption i s sonw criterion f o r selecting a " b a t " functlon frm those conslstent w l t h the experlmental resu?t s . I n the f o l l w l n g sections. we w i l l m t h a t these two approaches are caphlned-restrlcting the s e t o f functlons considered, and then selecting a best cholce from t h l s set.

For physlcal s y s t w , i n f o m t i o n about the general form o f the s y s t m function F can o f t e n be derived from knowledge o f the s y s t a . Speclflc n w r l c a l values, however, a r e s o n w t l n s p x b l b l t ' i v e l y d i f f i c u l t t o compute t m ~ r e t l c a l l y l t h w t m k i n unacceptable approxlmrtions. Therefore, the w s t widely used area o f w s y s t m I d m t l f i c a t i o n I s the subfie!d c a l l e d parameter I d e n t l f l c a t i o n .

I n parameter I d e n t l f i c a t l o n . the form o f the system function I s assumed t o be known. Thls f u n c t i o n cont a i n s a f l n i t e number o f parameters, the values o f which must be deduced from experlmental data. Let c be a vector w i t h the unknown parameters as I t s elements. Then the system response Z I s a known e function o f the l n p u t U and the parameter vector C. W can r e s t a t e t h i s I n 3 more convenlent, but comp l e t e l y q u l v a l e n t way. For etch value o f the parameter vector 6, the system response Z I s a known function o f the Input U. (The functlon can be d l f f e r e n t f o r d l f f e r e n t values of :.) Ue say t h a t the function I s parameterized by I and w r i t e

The functlon FE(U) i s r e f e r r e d t o as the assumed system model. The subscript n o t a t i o n f o r c i s used purely f o r convenlence t o Indicate t h e special r o l e of C. The functlon could be equivalently w r i t t e n as F(c.U). The parameter i d e n t i f l c a t l o n problem i s then t o deduce the value o f c based on measurement o f t h e responses 21 t o a set of inputs U1. Thls problem o f i d e n t i f y i n g the parameter vector c I s much l e s s ambitious tnan the system i d e n t i f l c a t l o n problem o f constructing the e n t i r e F functlon from experlmental data; i t i s Imre I n l I n e w i t h the amount of information t h a t reasonably can be expected t o be obtained from experimental data. Oeduclng the value o f E amounts t o solving the f o l ' . ~ i n g e t o f simultaneous and generally nonlinear s equations.

where N I s the number o f experiments performed. Note t h a t the only variable I n these equations I s the parame t e r vector C. The U and Zi represent the s p e c i f i c Input used and response measured f o r the I t h experlment. This I s q u l t e d i i f e r e n t from Equatlon (1.2-1) whlch expresses a general r e l a t i o n s h i p among the three variables U. 2 , and C.

-7 response --

Exam l e 1.2-1 I n the problem o f example (1.1-1). s a 1lnear function o f the Input Z = FE(U) + a, + a,U

assume we are given t h a t the

The parameter vector i s c = (a,,a,)*, W were given t h a t U = -1 and U e t i o n (1.2-2) expands t o

the values of a and a, being unknown. , +1 both r e s u l t i n Z = 1; thus Equa-

Thls system i s easy t o solve and glves F(U) = 1 (Independent o f U).

b,

and a, = 0.

Thus we have

Exa l e 1.2-2 I n the p t ~ o l e m exarple (1.1-2i, of h y r e s e n t e d as

assume we know t h a t the sys-

or, equlvalently, expressing

Z as an e x p l l c l t functlon o f U.

The unknown parameter vector f o r t h i s s stem i s (a,b *. SIr~ce u ( t ) = cos(t) resulted i n z ( t ) s i n ( t j . Equatlon (1.2-21 btcomes

for all ~

tc(-0.9).

This equatlon i s unlquely solved by a = 0-

r n d b = -1.

m l 1.2-3 I n the problem o f F x a g l e (1.1-3). assume t h a t the system can e represented by a polynomial o f order 10 o r less.

The unknown parameter vector t s I (a, ,a,.. .a,,)*. Using the experlmental data described I n Example 1.6. Equatlon (1.2-2) becomes

This systemof equations i s uniquely solved by a, = 0, a, through a , a l l q u a l l i n g 0.

1, and a,

As w i t h any s e t o f equations, there are three possible r e s u l t s from Equation (1.2-2). F i r s t , there can be a unique solution, as i n each of the exanples above. Second, there could be mu1t i p l e solutions, i n which case d t h e r more experiments must be p e r f o m d o r more rrsunptions would be necessary t o r e s t r i c t the set o f allowable solutions o r t o p i c k a best sotution i n some sense. The t h i r d p o s s i b i l i t y i s t h a t there could be no solutions, the experimental data being inconsistent w i t h the assumed equations. This s i t u a t i o n w i l l require a basic change i n our w y o f t h i n k i n g a b w t the problem. There w i l l almost never be an exact solution w i t h r e a l data, so the f i r s t two p o s s i b i l i t i e s are sanewhat acadealic. The remainder o f the document. and Section 1.4 i n particular, w i l l address the general s i t u a t i o n where Equation (1.2-2) need not nave an exact solution. The p o s s i b i l i t i e s o f one o r more solutions are p a r t o f t h e general case. Exa l e 1.2 4 I n the problem o f Example (1.1-11, assume we are given t h a t
+ response i s a quadratic function o f the input tL e

= (a,,a .a )*. h . :.:re given t h a t U -1 and The parameter vector i s these data Equation (1.2-2) expands .o ' U = +1 both r e s u l t i n 2 = 1. ~ i k h

1 = FE(-1) = a,

- a,

+ a,

From t h i s information we can deduce t h a t uniquely determined. The values might be experiment U = 0. A1 ternately, we might system consistent w i t h the data a v t i l a b l e and a, = 1.

a, = 0, b u t a and a a r e not , determined by performfng the decide t h a t the lowest order i s preferred, g i v i n g a, = 0 we a r e given given t h a t experiment no parameter

-I?+--.s t a t t e response i

I n the problem o f Example (1.1-1). assume t h a t a l i n e a r function o f the input. W were e U = -1 and U = +1 both r e s u l t i n Z = 1. Suppose t h a t the U = 0 i s p e r f o m d and r e s u l t s i n Z = 0.95. There are then values consistent w i t h the data. txa l e 1.2-5

1.3

TYPES OF SYSTEM MODELS

Although tne basic concept of system modelfrog i s q u i t e general, more useful r e s u l t s can be obtained by examining s p e c i f i c types o f system models. C l a r i t y o f exposition i s a l s o improved by using s p e c i f i c models. even when we can obtain the r e s u l t i n a more general context. This section describes some o f t h e broad classes o f system model forms which are o f t e n used i n parameter i d e n t i f i c a t i o n . 1.3.1 E x p l i c i t Functicn

The most basic type o f system model i s the e x p l i c i t function. The response Z i s w r i t t e n as a known e x p l i c i t function o f t h e input U and the parameter vector c. This type o f model correspo~dsexactly t o Equation (1.2-1):

I n the simplest subset o f the e x p l i c i t function models, the response i s a l i n e a r function o f the parameter vector

I n t h i s equation. f(U) i s a matrix which i s a known function (nonlinear i n general) o f the input. This i s the type o f model used i n l i n e a r regression. Many systems can be put i n t o t h i s e a s i l y analyzed form, even though the systems might appear q u i t e complex a t f i r s t glance. A comnon exanple o f a m d ? l 1inear i n i t s parameters i s a f i n i t e polynomial expansion o f Z o f U. i n terms

I n t h i s case, f(U) i s the row vector ( I , U, n o t i n the input U. 1.3.2 State Space

u2...un).

Note t h a t

Z i s l i n e a r i n the parameters

Ej, b u t

State-space models are very useful f o r dynamic systems; t h a t i s , systems w i t h responses t h a t are time functions. Wiberg (1971) and Zadeh and Desoer (1963) g i v e general discussions o f state-space models. T i m can be treated as e i t h e r a continuous o r d i s c r e t i z e d v a r i a b l e I n dynamic models; the theories o f discrete- and continuous-time systems are q u i t e d i f f e r e n t .

The general f o m f o r a c o n t i n u o u s - t i r state-space .ode1 i s

, n where f and g are arbitrary know functions. The i n i t i a l condition x can be know o r c ~ k a function o f t. The variable x ( t ) i s defined as the state o f the system a t time t. Equation (1.3-3b) i s called the state equation. and (1.3-3c) i s called the observation equation. The measured system response i s z. The state i s not considered t o be measured; i t i s an internal system variable. Howver, g[x(t).u(t).t.(] = x(t) i s a legitimate observation function, the measurement can be equal t o the state i f so desired. Discrete-tine state space models are similar t o continuous-time models, except that the d i f f e r e n t i a l equations are replaced by difference equations. The general form i s

The system variables are defined only a t the discrete times t i This document i s largely concerned w i t h continuous-time dynamic systems d e s c r i k d by differential Equations (1.3-3b). The systen response. however, i s measured a t discrete time points, and the caaputations are done i n a d i g i t a l coaputer. Thus, sone features of both discrete- and continuous-time systems are pertinent. The system equations are
A(?,)
=
X,

(1.3-511) (1.3-5b)
i = 1.2

i ( t ) = f[x(t).u(t).t.cl z(ti)
= g[x(ti),u(tij.ti,~]

....

(1.3-k)

The response ti) i s considered t o be defined only a t the discrete time points ti, although tne state x ( t ) i s defined i n continuous time. M w i l l see that the theory o f paramter i d e n t i f i c a t i o n f o r continuous-time system with discrete obsere vations i s v i r t u a l l y identical t o the theory f o r discrete-time systems i n spite o f the superficial d i f f e r e w e s i n the system equation f o m . The theory o f continuous-time observations requires much deeper mtheaatical background and w i l l only be outlined i n t h i s Jocunent. Since practical -pplication o f the a l g o r i t h s developed generally requires a d i g i t a l conputer, the continuous-time theory i s o f secondary ilportance.
A important subset o f systems described by state space equations i s the set o f linear dynamic systems. n Although the equations are sometimes rewritten i n foms convenient f o r d i f f e r e n t applications, a l l linear dynamic system models can be written i n the following foms: the continuous-time f o r n i s

The matrix A i s called the s t a b i l i t y matrix. B i s called the control matrix, and C and D are called state and control cbservation matrices, respectively. The hiscrete-time fonn i s dt,)
a

x,

(1.3-7a)

The matrices 4 and v are called the system transition matrices. The form f o r continuous systems with discrete observations i s identical t o Equation (1.3-6). except that the observation i s defined only a t the discrete t i m e polnts. I n ill three form, A B, C, 0, 4, and Y are matrix functions o f the parameter , vecro- t. These m t r l c e s are functions o f time i n general, but f o r notational simplicity, we ni:l not exf 4 t ? y lndicate the t i n e dependence unless i t i s important t o a dfscussion. The continuous-time and discrete-time state-equation forms are closely related. I n many applications. the discrete-time fonn of Equation (1.3-7) i s used as a discretized approximation t o Equation (1.3-6). In thi, case, the t r a n s i t i o n m t r l c e s 4 and r are related t o the A and B matrices by the equations

Ye discuss t h i s relationship i n more d e t a i l i n Section 7.5. I n a similar wnner, Equation (1.3-4) i s s a t i l e s viewed as an approximation t o Equation (1.3-3). Although the p r i n c i p l e i n the nonlinear case i s the wt as i n the linear case. ue cannot w r i t e precise expressions f o r the relationship i n such s i l p l e c l o K d f o m as :n the l i m a r case.

Standardized canonical f o m o f the state-space equations ( Y i k r g . 1971) play an i q o r t a n t r o l e i n s o r approaches t o parameter estimation. Ue w i l l not t l p h s i z e canonical forms i n t h i s Qcucnt. The k s i c theory o f paraaeter i d e n t i f i c a t i o n i s the same, whether canonical f o m are used o r not. I n same applications. canonical f o m s are useful. o r even necessary. Such forms, however. destroy any internal relationship between the m d e l structure and the system. retaining only the external response characteristics. F i d e l i t y t o the internal 6s well as t o the external system characteristics i s a s i w i f i c a n t a i d t o engineering judgment and t o the incorporation o f known facts about the system. both o f which play crucial roles i n s y s t a identification. For instance, we might kncw the values o f many locations o f the A m t r i x i n i t s 'naturalm form. Yhen the A matrix i s t r a n s f o r p d t o a canonical forn, these siaple facts generallj becop unwieldy equations which cannot reasonably be used. When there i s 1i t t l e useful knowledge o f the internal system structure, the use of more appropriate. canonical forms b e c o r ~ s

O t k r types of system a d e l s are used i n various applications. This Qclvlcnt w i l l not cover thew explici t l y , but many o f the ideas and results from e x p l i c i t function and state space models can be applied t o other m d e l types. One o f these alternate m d e l classes deserves special mention because o f i t s wide use. This i s the class of auto-regressive moving average (W) d e i s and related variants (Hajdasinski. Eykhoff. Damen, and van den m Bool. 1982). Discrete-time ARllA lodels are i n the general form

Discrete-time nodels can be readily rewritten as l i n e a r state s y c e lndels ( S c k p p e . 1973). so a l l o f the theory which we w i l l develop f o r state space models i s d i r e c t l y applicable.

The exaaples i n Section 1.2 e r e carefully chosen t o have exact solutions. Real data i s seldom so obliging. No matter how careful we have been i n selecting the f o m o f the a s s d system model, i t w i l l not be an exact representation o f the system. The experimental data w i l l not be consistent with the assumed m d e l form f o r any value o f the parameter vector c. The model may be close. but i t w i l l not be exact, i f f o r no other reason than that the measurements o f the response w i l l be made w i t h real, and thus inperfect. instruments. The theoretical developmnt seeas t o have arrived a t a cul-de-sac. The black box system i d e n t i f i c a t i o n problem was not feasible because there were too many soluticns consistent with the data. To reolove t h i s d i f f i ~ u i t y ,i t was necessary t o assme a model form and define the problem as parameter identification. With the astuned node). however, there are no solutions consistent w i t h the data.
b e need t o r e t a i n the concept of an assumed -1 structure i n order t o reduce the scope of the pmblea. y e t avoid the i n f l e x i b i l i t y o f requiring that the m d e l exactly reproduce the experimental data. W do t h i s e by using the assued model structure, but acknowledging that i t i s inperfect. The assued m d e l structure should include tho essential characteristics o f the true system. The selection o f these essential characteri s t i c s i s the m s t significant engineering j u m t i n system analysis. A good exaaple i s Gauss' (1809, p. x i ) j u s t i f i c a t i o n that the major axis o f a comctary e l l i p s e i s not an essential parameter, and t h a t a s i n p l i f ied parabolic m d e l i s therefore appropriato:

There existed, i n point o f fact. no s u f f i c i e n t reason why i t should be taken f o r granted that the paths o f conets are exactly parabolic: on the contrary. i t m s t be regarded as i n the highest degree iapmbable that nature should ever have favored such an hypothesis. Since, nevertheless, i t was known, that the phemmna o f a heavenly body moving i n an e l l i p s e o r hyperbola. the major axis o f which i s very great r e l a t i v e l y t o the parameter, d i f f e r s very l i t t l e near the perihelion fm the motion i n a parabola o f which the vertex i s a t the same distance from the focus; and t h a t t h i s difference becomes the more inconsiderable the greater the r a t i o o f the axis t o the p a r w t e r : and since. mreover. experience has shorn that between the observed moticn and the motion computed i n the parabolic o r b i t , there remained differences scarcely ever reater than those which might safely be attribute& t o errors o f observation errors quite considerable i n most cases): astronomers have thought proper t o r e t a i n the parabola, and very properly, because there are no mans whatever o f ascertaining satisfactorily what. I f any, are the differences from a parabola.

Chapter 11 discusses some aspects of t h i s selection, including theoretical aids t o making such judglents.

Ye need t o determine how t o select the value o f c which makes the mathematical model the "best"

GIven the assumed m d e l structure, the primary question I s how t o t r e a t inperfections i n the model.

representation o f the essential characteristics o f the system. Me also need t o evaluate the error i n the determination o f c due t o the ummdeled e f f e c t s present i n the experimental data. These needs introduce several mw concepts. One concept i s that o f a 'best' representation as opposed t o the correct representation. It i s often i.possible t o define a single correct representation, even i n principle, because we have acknowledged the assumed model structure t o be ilperfect and we have constrained ourselves t o work within t h i s structure. Thus c does not have a correct value. As k t o n (1970) says on t h i s subject. A f a v o r i t e form of lunacy amng aeronautical engineers produces countless a t t e q t s t o decide w h a t d i f f e r e n t i a l equation governs the l o t i o n o f some physical object, such as a helicopter r o t o r But arguacnts about which d i f f e r e n t i a l equation represents truth, together with t h e i r f i t t i n g calculations. are uasted ti*.

....

- % r eIF % %i and.

Estimating the radius of the Earth. The Earth i s not a perthus, does not have a radius. Therefore. tne problem of estimating the radius of the Earth has no correct answer. Nonetheless, a representation o f the Earth as a sphere i s a useful s i r p l i f i c a t i o n f o r many purposes.
Ex

l e 1.4-1

Even the concept o f the "bestu representation overstates the meaning o f our estimates because there i s no universal c r i t e r i o n f o r defining a single best representation (thus our quotes around "best"). Many system i d e n t i f i c a t i o n nethods ertablish an o p t i m l i t y c r i t e r i o n and use nrrmerical optimization methods t o commute the optimal estimates as defined by the criterion; indeed most o f t h i s d o c m t i s devoted t o such optiiaal e s t i mators o r approximations t o them. To be avoided, however, i s the c m n a t t i t u d e that optimal (by some c r i terion) i s synonymus w i t h correct, and t h a t any nonoptimal estimator i s therefore wrong. Klein (1975) uses the term "adequate model" t o suqgest that the appropriate judgmnt on an i d e n t i f i e d m d e l i s whether the lode1 i s adcquate f o r i t s intended purpose. I n addition t o these concepts o f the correct. best. o r adequate values o f c, we have the s a w h a t related issue o f errors i n the determination of c caused by the presence o f umodeled effects i n the experimental data. Even i f a correct value o f 6 i s defined i n principle, i t may not be possible t o determine t h i s value exactly from the experimental data due t o contamination o f the data by umodeled effects. W can no* define the task as to deternine the best estimate o f c obtainable from the data, o r perhaps e an adequate estimate o f c, rather than to determine the correct value o f 6 . This revised problem i s more properly called parameter estimation than parameter identification. (Both terms are often used interchangeImplied subproblems o f parameter estimation include the d e f i n i t i o n o f the c r i t e r i a f o r best o r ably.! adequate, and the characterization o f potential errors i n the estimates.

tjta l e 1.4-2 Reconsider the problem of example (1.2-5). Although there i s no inear model exactly consistent with the data, modeling the output as a constant value o f 1 appears a reasonable a p p r o x i ~ t i o n and agrees exactly w i t h two o f the three data points.

One approach t o parameter estination i s to minimize the error between the model response and the actual measured response, using a least squares o r sme similar ad hoc criterion. The values o f the paramter vector c which r e s u l t i n the minimum error are called the best estimates. Gauss (1809. p. 162) introduced t h i s idea:

Finally, as a l l our observations, on account of the iaperfection o f the instruments and o f the senses, are only approximtions t o the truth, an o r b i t based only on the six absolutely necessary data may s t i l l be l i a b l e t o considerable errors. I n order t o diminish these as lnuch as possible, and t h s t o reach the greatest precision attainable, no other method w i l l be given except t o accuuulate the greatest n u h e r o f the most perfect observations, and t o adjust the elements, not so ds t o s a t i s f y t h i s o r t h a t set of observations with absolute exactness, but so as t o agree with a l l i n the best possible manner. This approach i s easy t o understand without extensive matheratical background, and i t can produce excellent results. It i s r e s t r i c t e d t o deterministic nodels so that the nodel response can be calculated.
A alternate approach t o parameter estimation introduces p r o b a b i l i s t i c concepts i n order t o take advann tage o f the extensive theory o f s t a t i s t i c a l estimation. We should note that, from Gauss's time, these two approaches have been intimately linked. The sentence inmediateiy following the above exposition i n Theoria l b t u s (buss. 1809. p. 162) i s

For which purpose, we w i l l shar i n the t h i r d section how, according t o the principles o f the calculbs o f probabilities, such an agreement m y be obtained, as w i l l be. i f i n no one place perfect, yet i n a l l places the s t r i c t e s t possible. I n the s t a t i s t i c a l approach, a l l o f the effects not included i n the deterministic system m d e l are modeled as randm noise; the characteristics o f the noise and i t s position i n the system equations vary for d i f f e r e n t applications. The probabilistic treatment solves the perplexing problem o f how t o examine the effect o f the u m d e l e d portion o f the systea without f i r s t modeling it. The formerly urrodeled portion i s modeled probab i l i s t i c a l l y , which allows description o f i t s general characteristics such as magnitude and frequency content. without requiring a detailed model. Systems such as this, which involve both tile and randomness, are referred t o as stochastic systems. This document w i l l examine a small p a r t o f the extensive theory o f stochastic systens, which can be used t o define e s t i m t e s o f the unknown parameters and t o characterize the properties o f these estimates.

Although t h i s document w i l l devote s i g n i f i c a n t time t o the treatment of the probabilistic approach, t h i s approach should not be oversold. It i s currently popular t o disparage m d e l - f i t t i n g approaches as nonrigorous 811dwithout theoretical basis. Such attitudes ignore two important facts: f i r s t , i n many o f the m s t commn situations, the "sophisticated" probabilistic approach arrives a t the safe estimation algorithn as the m d e l f i t t i n g approaches. This f a c t i s often obscured by the use o f buzz words and unenlightening notation. apparently for fear that the theoretical e f f o r t w i l l be considered as wasted. Our view i s that such relationships should be erphasized and c l e a r l y explained. The two approaches c o a p l m n t each other, and the engineer who ~nderstandsboth i s best equipped t o handle real world problems. The n o d e l - f i t t i n g approach gives good i n t b i +.ive understanding o f such problems as nude1i n g error, algori t convergence, and identif i a b i l i t y , along h ~ t k r s . The probabilistic approach contributes quantitative characterization of the properties of the e s t i aates (the accuracy), and an understanding o f how these characteristics are affected by vartous factors. The second f a c t ignored by those who disparage nodel f i t t i n g i s that the p r o b a b i l i s t i c approach involves j u s t as many (or more) u n j u s t i f i e d ud ii assumptions. Behind the smug f r o n t o f mathematical r i g o r and sophist i c a t i o n l i e patently ridiculous assumptions about the systcs. The contan~inating noise seldom has any o f the characteristics (Gaussian, white, etc.) assumed simply i n order t o get results i n a usable form. Nore basic i s the f a c t t h a t the contaminating noise i s not necessarily randcm noise a t a l l . I t i s a composite o f a l l o f the otherwise u w d e l e d portions o f the system output, some of which might be " t r u l y " random (deferring the philosophical question o f whether t r u l y random events exist). but s o ~ p f which are certainly deterministic o even a t the macroscopic level. I n l i g h t of t h i s consideration. the 'rigoru o f the p r o b a b i l i s t i c approach i s tarnished from the start, no m t t e r how precise the inner mathematics. Contrary t o the iapressions often given, the probabilistic approach i s not the single correct answer, but i s one o f the possible avenues t h a t can give useful results. making on the average as many u n j u s t i f i e d o r b l a t a n t l y false assuptions as the alterna-tives. b y e s (1736. p. 9). i n an essay reprinted by Barnard (1958). made a classical statement on the r o l e o f assuwtions i n mathematics:
I t i s not the business o f the Mathematician t o dispute whether quantities do

i n f a c t ever vary i n the manner t h a t i s supposed, but only whether the notion o f t h e i r doing so be i n t e l l i g i b l e ; which being allowed, he has a r i g h t t o take i t f o r granted, and then see what deductions he can make fm t h a t suppcsition.. .He i s not inquiring how things are i n matter of fact, but supposing things t o be i n a certain way, what are the consequences t o be deduced from them; and a l l that i s t o be demanded o f him i s , t h a t h i s suppositions be i n t e l l i g i b l e , and h i s inferences j u s t from the suppositions he makes.

The denrands on the applications engineer are somewhat different, and more i n l i n e with Bayes' (1736, p. 50) l a t e r statement i n the same document. So f a r as Hathematics do not tend t o make men more sober and rational thinkers. wiser and better men, they are only t o be considered as an amusement, which ought not t o take us o f f from serious business. A few words are necessary i n defense o f the probabilistic approach, l e s t the reader decide t h a t i t i s not worthwhile t o pursue. The main issue i s the description o f deterministic phenomena as random. This disagrees with common mdern perceptions o f the meaning and use o f randmess f o r physical situations, i n which random and deterministic phenomena are considered as quite d i s t i n c t and well delineated. 3ur viewpoint owes m r e t o the e a r l i e r philosophy o f probability theory- t h a t i t i s a useful tool f o r studying cwplicated phenomena inherently random ( i f anything i s inherently random). Cramer (1946, p. 141) gives a classic which need not exposition of nis philosophy: large and inportant groups o f ra,dom [The following i s descriptive of] experiments. Small variations i n the i n i t i a l state o f the observed units. which cannot be detected by our instruments, may produce considerable changes i n the f i n a l result. The conplicated character o f the laws o f the observed phenomena may render exact calculation practically, i f not theoretically. impossible. Uncontrollable action by small disturbing factors may lead t o irregular deviations from a presumed "true value".

...

I t i s , o f course, clear that there i s no sharp d i s t i n c t i o n between these v a r i w s mdcs o f randmess. Whether w ascribe e.g. the fluctuations observed e ihe results o f a series o f shots a t a target mainly t o wll variations i n t , - . i n i t i a l state o f the projectile, t o the complicated nature o f the b a l l i s t i c laws, or t o the action o f small disturbing factors, i s largely a matter o f taste. The essential thing i s that, i n a l l cases where one o r more o f these circumstances are present, an exact prediction o f the results o f individual experiments becomes impossible, and the irregular fluctuations characteristic o f random e;periments wi 11 appear.
We s h l l now see that, i n cares o f t h i s character, there appears amidst a l l i r r e g u l a r i t y o f f l t ~ c t u a t i o n sa certain typical form o f regularity t h a t w i l l serve as the basis o f the mathematical tleory o f s t a t i s t i c s .

The probabilistic mtllods allow quantitative analysis o f the general behavior o f these canplicated phenomena, evm though we ? .e unable t o model the exact behavior.

1.5

O H R APPROACHES TE

Our aim i n t h i s document i s t o present a u n i f i e d viewpoint o f the system i d e n t i f i c a t i o n ideas leading t o maxinwn-likelihood estimation o f the parameters o f dynamic systems, and o f the application o f these ideas. There are many conpletely d i f f e r e n t approaches t o i d e n t i f i c a t i o n o f dynamic systems. There are innumerable books and papers i n the system i d e n t i f i c a t i o n l i t e r a t u r e . Eykhoff (1974) and Astrom and Eykhoff (1970) give surveys of the f i e l d . However. much of the work i n system i d e n t i f i c a t i o n i s pub1ished outside of the wneral body of system identification 1iterature. Many techniques have been Q v e l oped f o r specific areas o f application by researchers oriented more toward the application area than toward general system i d e n t i f i c a t i o n problem. These specializea techniques are part o f the larger f i e l d o f system identification, although they are usually not labeled as such. (Sometimes they are recognizable as special cases o r applications o f more general results.) I n the area most familiar t o us, a i r c r a f t s t a b i l i t y and cont r o l 4 e r i ~ a t i v e swere estimated from f l i g h t data long before such estimation was c l a s s i f i e d as a system ident, Fication problem (Doetsch. 1953; Etkin. 1958; Flack. 1959; Greenberg. 1951; Rampy and Berry, 1964; Holowicz. 1966; and Holowicz and Holleman. 1958).
We do not even attempt here the monumental task o f surveying the large body o f system i d e n t i f i c a t i o n techniques. Suffice i t t o say that other approaches exist, soate e x p l i c i t l y labeled as system i d e n t i f i c a t i o n techniques, and some not so labeled. Ue feel that w are better equipped t o make a useful contribution by e presenting, i n an organized and comprehensible mnner, the viewpoint with which we are most familiar. This orientation does not constitute a dismissal o f other viewpoints.
We have sunetimes been asked t o refute claims that, i n some specific application, a silnple technique such as regression obtained superior results t o a "sophisticated" technique bearing impressive-sounding credentials as an optimal nonlinear m a x i m likelihood estimator. The i u p l i c a t i o n i s t h a t simple i s scinehow synonymous with poor, and sophisticated i s synonymous with good, associations that w completely disavow. Indeec, the e opposite association seems more often dppropriate, and we t r y t o present the maximum likelihood estimator i n a simple l i g h t . Ye believe t h a t these methods are a l l tools t o be used when they help do the job. Ye have used quotations from Gauss several t'rnes i n t h i s chapter t o i l l u s t r a t e h i s insight i n t o what are s t i l l some o f the important issues o f th.? day, and w w i l l close the chapter w i t h y e t another (Gauss. 1809, p. 108): e

...we hope. therefore, i t w i l l not be disagreeable t o the reader, that. besides the solution t o be given hereafter, which seems t o leave nothing further t o be desired, we have thought proper t o preserve also the one o f which w have made e frequent use before the former suggested i t s e l f t o me. I t i s always p r o f i t a b l e t o approach the more d i f f i c u l t problems i n several ways, and not t o despise the good although preferring the better.

CHAPTER 2 2.0 OPTIMIZATION METHODS

Most o f the est!mators i n t h i s book r e q u i r e the minimization o r maximization o f a nonlinear function. Sometimes we can w r i t e i;n e x p l i c i t expression f o r the minimum o r maximum polnt. I n many cases, however, we must use an i t e r a t i v e numerical algorithm t o f i n d t h e solution. Therefore a background i n optimization methods i s mandatory f o r appreciation o f the various estimators. Optimization i s a major f i e l d i n i t s own r i g h t and we do not attempt a thorough treatment o r even a survey o f the f i e l d i n t h i s chapter. Our purpose i s t o b r i e f l y introduce a few o f t h e optimization techniques m s t p e r t i n e n t t o parameter estimation. Several o f the conclusions we draw about tne r e l a t i v e m e r i t s o f various algorithms a r e influenced by the general structure of parameter estimation problems and, thus, might not be s~rpportablei n a broader context of optimizing a r b i t r a r y functions. Numerous books such as Rao (1979), Luenberger (1969). Luenberger (1972). Dixon (1972). and Polak (1971) cover t h e d e t a i l e d d e r i v a t i o n and analysis o f the techniques discussed here and others. These books give mor r thorough treatments o f t h e optimization methods than we have room f o r here, b u t a r e not 0 r i e n t ~ dspecific; l y t o parameter estimation problems. For those involved i n the a p p l i c a t i o n o f estimation theory, and p a r t i c u l a r l y for those who w i l l be w r i t i n g computer programs for parameter estimation, w strongly recomnend reading several o f these books. The u t i l i t y and e f f i e ciency o f a parameter estimation program depend strongly on i t s optimization algorithms. The material i q t h i s chapter should be s u f f i c i e n t f o r a general understanding o f t h e problems and the kinds o f algorithms used, b u t not f o r the d e t a i l s o f e f f i c i e n t application. The basic optimization problem i s t o f i n d the value o f t h e vector x t h a t gives the smallest o r l a r g e s t value o f the scalar-valued function J(x). By convention we w i l l t a l k about minimization problems; any maximization problem can be made i n t o an equivalent minimization problem by changing t h e sign o f the function. W e w i l l f o l l o w t h e widespread p r a c t i c e o f c a l l i n g the f u n c t i o n t o be minimized a cost function, regardless o f whether o r n o t i t r e a l l y has anything t o do w i t h monetary cost. To formalize the d e f i n i t i o n of the problem, a f u n c t i o n J ( x ) i s s a i d r o have a minimum a t i i f

f o r a l l x. This i s sometimes c a l l e d an unconstrained global minimum t o d i s t i n g u i s h i t frcm l o c s l and constrained minima, which are defined below. Two kinds o f side constraints are sometimes placed on the problem. g.(x) = 0 I n e q u a l i t y constraints are i n t h e form Equality constraints are i n the form (2.0-2)

The g i and value o f x straints it minimum o f

h i are scalar-valued functions o f x. There can be any number o f constraints on a problem. A i s c a l l e d admissible i f i t s a t i s f i e s a l l o f the constraints; i f a value v i o l a t e s any o f the coni s i n a h i s s i b l e . The constraints modify the problem statement as follows: ic i s the constrained J(x) i f i i s admissible and i f Equation (2.0-1) holds for a l l admissible x.

Two c r u c i a l questions about any optimization problem are whether a solution e x i s t s and whether i t i s unique. These questions a r e important i n a p p l i c a t i o n as well as i n theory. A computer program can spend a long time searching f o r a s o l u t i o n t h a t does n o t e x i s t . A simple example o f an optimization problem w i t h no s o l u t i o n i s the unconstrained minimization o f J ( x ) = x. A problem can also f a i l t o have 3 s o l u t i o n because there i s no x s a t i s f y i n g the constraints. W w i l l say t h a t a problem that has no s o l u t i o n i s ill-posed. e x,)', where x I silnple problem w i t h a nonunique s o l u t i o n i s the unconstrained minimization o f J ( x ) = (x, i s a 2-vector.

A l l o f the algorithms t h a t we discuss (and most other algorithms) search f o r a l o c a l minimum o f the funct i o n , r a t h e r than the global m i n i m . A l o c a l minimum (also c a l l e d a r e l a t i v e minimum) i s defined as follows: i i s a l o c a l minimum o f J(x) i f a scalar 5. > 0 e x i s t s such t h a t J(1)

<

J ( 1 + h)

(2.0-4)

f o r a l l h w i t h I h l < E . To define a constrained l o c a l minimum, we must add the q u a l i f i c a t i o n s t h a t ic and 1 + h s a t i s f y t h e constraints. The term "extremum" r e f e r s t o e i t h e r a l o c a l minimum o r a l o c a l maximum. Figure (2.0-1) i l l u s t r a t e s a problem w i t h three l o c a l minima. one o f which i s the global minimum. Note t h a t i f a global minimum exists, even i f i t i s not unique, i t i s a l s o a l o c a l minimum. The converse t o t h i s statement i s false; the existence o f a l o c a l minimum does n o t even imply t h a t a global minimum e x i s t s . W can sometimes prove t h a t a function has o n l y one l o c a l minimum point, and t h a t t h i s p o i n t i s also the e global minimum. When we lack such proofs, there i s no universal way t o guarantee t h a t the l o c a l minimum found by an algorithm i s the global m i n i m . A reasonable check f o r i t e r a t i v e algorithms i s t o t r y t h e algorithm w i t h many d i f f e r e n t s t a r t i n g values widely d i s t r i b u t e d w i t h i n the realm of possible values. Ifthe algorithm consistt?ntly converges t o tk same s t a r t i n g point, t h a t p o i n t i s probably the global minimum. The cost o f such a t e s t , however. i s o f t e n p r o h i b i t i v e l y high. The l i k e l i h o o d o f l o c a l minima d i f f i c u l t i e s varies widely depending on t h e application. I n some applicat i o n s w can prove t h a t there a r e no l o c a l minima except a t t h e unique global minimum. At the ccher extreme, e some applications are plagued by numerous l o c a l minima t o the extent t h a t most minimization algorithms a r e

e worthless. Host applications l i e between these extremes. W can o f t e n argue convincingly t h a t a p a r t i c u l a r answer must be the global minimum, even when rigorous proof i s impractical. The algorithms i n t h i s chapter are, w i t h a few exceptions, i t e r a t i v e . Given some s t a r t i n g value x,, the algorithms give a procedure f o r computing a new value x,; then x, i s computed from x,, e t c . The i n t e n t o f the i t e r a t i v e algorithms i s t o create a sequence X i t h a t converges t o the minimum. The s t a r t i n g value can be from an independent estimate o f a reasonable answer. o r i t can come from a special start-up algorithm. The f i n a l step o f any i t e r a t i v e algorithm i s t e s t i n g convergence. A f t e r the algorithm has proceeded for some time, we need t o choose among the f o l l o w i n g a l t e r n a t i v e s : 1) the algorithm has converged t o a value s u f f i c i e n t l y close t o the t r u e minimum and should therefore be terminated; 2) the algorithm i s making acceptable progress toward the solution and should be continued; 3 ) t h e algorithm i s f a i l i n g t o converge o . i s c o n v e r g i ~ gtoo slowly t o o b t a i n a s o l u t i o n i n an acceptable time, and i t should therefore be abirndoned; o r 4) the algorithm behavior t h a t suggests t h a t switching t o a d i f f e r e n t algorithm ( o r modifying the current one) i s exl~ibiting might be productive. This decision i s f a r from t r i v i a l because some algorithms can e s s e n t i a l l y s t a l l a t a p o i n t far from any l o c a l minimum, making such small changes i n X i t h a t they appear t o have converged. W have b r i e f l y mentioned the problems o f existence and uniqueness o f solutions, l o c a l minima, s t a r t i n g e values, and convPrgence tests. These are major issues i n p r a c t i c a l application, but we w i l l not examine them fui.ther here. The references contain considerable discussion o f these issues.

A cost function o f an N-dimensional x vector can be visualized as a hypersurface i n (N + 1)-dimensional space. For i l l u s t r a t i n g the behavior o f the various algorithms, we w i l l use i s o c l i n e p l o t s o f c o s t functions of two variables. An i s o c l i n e i s the locus of a l l points i n the x-space corre5ponding t o some specified cost function value. The i s o c l i n e s o f p o s i t i v e d e f i n i t e quadratic functions are always e l l i p s e s . Furthermore. a quadratic function i s cornpleteiy specified by one o f i t s i s o c l i n e s and the f a c t t h a t i t i s quadratic. Twodimensional examples are s u f f i c i e n t t o i l l u s t r a t e most o f the pertinent points o f the algorithms.
W w i l l consider unconstrained minimization problems, which i l l u s t r a t e the basic points necessary f o r our e purposes. The references address problems w i t h e q u a l i t y and i n e q u a l i t y constraints. 2.1 ONE-DIMENSIONAL SEARCHES

Optimization methodology i s strongly influenced by whether o r not x i s a scalar. Because the optimizat i o n problems i n t h i s book are generally multi-dimensional, the methods applicable only t o scalar x are n o t d i r 2 c t l y relevant. Many o f the multi-dimensional optimization algorithms, however, require the solution o f one-dimensional subproblems as p a r t of t h e l a r g e r algorithm. Most such subproblems are i n the form of minimizing the m u l t i dimensiondl cost function w i t h x constrained t o a l i n e i n the nulti-dimensional space. This has the superf i c i a l appearance o f a multi-dimensional problem, and f u r t h e m r e one w i t h the added complications o f cons t r a i n t s . To c l a r i f y the one-dimensional nature of these subproblems, express them as follows: the vector x i s r e s t r i c t e d t o a l i n e defined by x = x,

AX,

(2.1-1)

whore x, and x are fixed vectors, and 1 i s a scalar v a r i a b l e representing p o s i t i o n along the 1ine. Restricted t o t h i s l i n e , the cost can be w r i t t e n as a function o r A. g ( ~ ) J(x, E

+ AX,)

(2.1-2)

The function g(h) i s a scalar f u n c t i o n o f a scalar variable, and one-dimensional minimization algorithms apply d i r e c t l y . Substituting the minimizing value of A i n t o Equation (2.1-1) then gives the minimizing p o i n t along the l i n e i n the space o f x. W w i l l n o t examine the one-dimensional search algorithms i n t h i s book. Several o f the references have e good treatments o f the subject. W w i l l note t h a t most o f the relevant one-dimensional algorithms involve e approximating the function by a low-order polynomial based on the values o f the function and i t s f i r s t and second derivatives a t one o r more points. The mininum p o i n t o f the polynomial, e x p l i c i t l y evaluate replaces one o f the o r i g i n a l psints, and the process repeats. The d i s t i n g u i s h i n g features o f the algorithms are the order o f the polynomial, the number o f points, and the order o f the derivatives o f J(x) ea;aluated. Variants o f the algorithms depend on start-up procedures and methods f o r selecting the p o i n t t o be replaced. I n some special cases we can solve the one-dimensional minimization problems e x p l i c i t l y by s e t t i n g the d e r i v a t i v e t o zero, o r by other means, even when we cannot e x p l i c i t l y solve the encompassing multi-dimensional problem. Several o f our examples o f multi-dimensional algorithms w i l l use e x p l i c i t solutions o f the onedimensional subproblems t o avoid g e t t i n g bogged down i n d e t a i l . Real problems seldom w i l l be so conveniently amenable t o exact s o l u t i o n of the one-dimensional subproblems, except where the nulti-dimensional problem could be d i r e c t l y solved without r e s o r t t o i t e r a t i v e methods. I t e r a t i v e one-dimensional searches are u s u a l l y necessary w i t h any method t h a t involves one-dimensional subproblems. W w i l l encounter one o f the r a r e exceptions e !n the estimation o f variance.

2.2

DIRECT METHODS

Optimi t a t i o n methods t h a t do not r e q u i r e the evaluation o f derivatives o f the cost function are c a l l e d d i r e c t methods o r zero-order methods (because they use up t c zeroth order derivatives). These methods use only the cost function values. Axial i t e r a t i o n , also c a l l e d the univa14ate method o r coordinate descent. i s the basis f o r many o f the d i r e c t methods. I n t h i s method we search along each o f the coordinate d i r e c t i o n s o f the x-space, one a t a

time. S t a r t i n g w i t h the p o i n t x , f i x the values o f a l l b u t the f i r s t coordinate, reducing the problem t o one-dimensional minimization. ~o!ve t h i s problem using any one-dimensional algorithm. Call the r e s u l t i n g point x Then f i x the f i r s t coordinate a t the value so determined and do a s i m i l a r search along the d i r e c t i o n o f the second coordinate, g i v i n g the p o i n t x2. Continue these one-dimensional searches u n t i l each o f the N coordinate d i r e c t i o n s has been searched; the f i n a l p o i n t of t h i s process i s XN.

The p o i n t XN completes the f i r s t cycle o f minimization. Repeat t h i s cycle s t a r t i n g from t h e p o i n t XN instead o f x Continue repeating the minimization cycle u n t i l the process converges ( o r u n t i l you give up, which may welP come f i r s t ) .

The performance o f the a x i a l i t e r a t i o n algorithm on most problems i s unacceptably poor. The algorithm perfonns w e l l only when the minimum p o i n t along each a x i s i s nearly independent o f the values o f the other coordinates. Exam l e 2 2 1 Use a x i a l i t e r a t i o n t o minimize J(x.y) a A(x y)' + B(x + y)' The solution i s the t r i v i a l l y obvious (0.0). but the problem i s good f o r i l l u s t r a t i n g the behaviar o f algorithms i n a simple case. Instead o f using a one-dimensional search procedure, we w i l l e x p l i c i t l y solve the onedimensional subproblems. For any f i x e d y, obtain the minimizing x coordinate value by s e t t i n g the d e r i v a t i v e t o zeru

giving

Similarly, f o r f i x e d x, the minimizing y A - B Y ' m X

value i s

W see t h a t f o r A >> 0, the values o f x and y descend slowly toward the t r u e minimum a t (0.0). e Figure (2.2-1) i l l u s t r a t e s t h i s behavior on an i s o c l i n e p l o t . Note t h a t i f A = B (the > s t function i s o c l i n e i s c i r c u l a r ) the exact minimum i s obtained i n one c y ~ i e ,".~t A/B increases the perfurmance worsens. as Several modifications t o the basic a x i a l i t e r a t i o n method a r e a v a i l a b l e t o improve i t s performance. Some o f these modifications e x p l o i t the notion o f the p a t t e r n d i r e c t i o n , the d i r e c t i o n from the beginning p o i n t X x~ o f a cycle t o the end p o i n t X(i+,\, o f the same cycle. Figure (2.2-2) i l l u s t r a t e s the p a t t e r n direct i o n , which tends t o p o i n t i n the general Y i r e c t i o n o f the minimum. Powell's method i s the most powerful o f the d i r e c t methods t h a t search along p a t t e r n directions. See t h e references f o r d e t a i l s . 2.3 GRADIENT METHODS

Optimization methods t h a t use the f i r s t d e r i v a t i v e (gradient) o f the cost function are c a l l e d gradient mthods o r f i r s t order methods. Gradient methods require t h a t the cost f u n c t i o n be d i f f e r e n t i a b l e ; m s t o f the cost functions ccnsidered i n t h i s book meet t h i s requirement. The gradient methods generally converge ~n fewer i t e r a t i o n s than many o f the d i r e c t methods because the gradient methods use more information i n each aceration. (There a r e exceptions. p a r t i c u l a r l y when comparing simple-minded gradient methods w i t h the most powe' f u l o f the d i r e c t methods). The penalty paid f o r the generally improved performance o f the gradient mthods compared w i t h the d i r e c t methods i s the requirement t o evaluate the gradient. W define the gradient o f the function J ( x ) w i t h respect t o x t o be the row vector. e i t as a colurri vector; the difference i s inconsequential as long as one i s consistent.) (Some t e x t s define

A reasonable estimate o f the computational cost o f evaluating t h e gradient i s N times the cost o f evaluating the function. This estimate follows from the f a c t t h a t t h e gradient can be approximately evaluated by N f i n 1t e d i f f e r e x e s

where e l i s the u n i t vector along the x i a x i s and r i s a small number. I n special cases, there can be expressions f o r the gradient t h a t cost s i g n i f i c a n t l y less than N f u n c t i o n evaluations. Equation (2.3-2) somewhat obscures the d i s t i n c t i o n between the gradient methods and the d i r e c t methods. W can r e w r i t e any gradient method i n a f i n i t e difference fonn t h a t does not e x p l i c i t l y involve gradients. e There i s . nonetheless, a f a i r l y clear d i s t i n c t i o n between methods derived from g r a d i e ~ tideas and methods derived from d i r e c t search ideas. W w i l l r e t a i n t h i s philosophical d i s t i n c t i o n regardless o f whether the e gradients are evaluated expl i c i t l y o r by f i n i t e differences. The method o f steepest descent (also c a l l e d the gradient method) involves a series o f one-dimensional searches, as d i d t h e a x i a l - i t e r a t i o n method and i t s variants. I n t h e steepest-descent method, these searches

are along the d i r e c t i o n o f the negative o f the gradient vector, evaluated a t the c u r r e n t point. dimensional problem i s t o f i n d the value o f X t h a t minimizes

The one-

where

si

i s the search d i r e c t i o n given by s i = -v;J(x) lx=xi

The negative o f the gradient i s the d i r e c t i o n o f steepest l o c a l descent o f the c o s t f u n c t i o n (thus the name o f the method). To prove t h i s property, f i r s t note t h a t f o r any vector s we have

W are using the e

(...)

n o t a t i o n f o r t h e inner product (x.y)

x*y

Equation (2.3-5) t i o n (2.3-1) i s Equation (2.3-5) Cauchy-Schwartz

i s a generalization o f the d e f i n i t i o n o f the gradient; i t applies i n spaces where Equanot meaningful. W then need o n l y show that, i f s i s r e s t r i c t e d t o be a u n i t vector. e i s minimized by choosing s i n the d i r e c t i o n o f -v:J(x). This follows imnediately from the i n e q u a l i t y (Luenberger. 1969) o f 1inear algebra.

Theorem 2.3-1 (Cauchy-Schwartz) tx.y)' some scalar a. Proof - The theorem i s t r i v i a l


i f y = 0.

s 1x1'

1yI2 w i t h e q u a l i t y i f and o n l y i f

x = ay

for

For y f 0 examine

Choose

Substitute i n t o Equation (2.3-?) and rearrange t o g i v e

Equality holds i f and only i f x + ~y = 0 i n Equation (2.3-7), t r u e i f and only i f x = ay ( A w i l l then be -a).

which w i l l be

On the surface, the steepest descent property o f the method seems t o i c p l y e x c e l l e n t performance i n m i n i mizing the cost f u n c t i o n value. The d i r e c t i o n o f steepest descent, however, i s a l o c a l property which might p o i n t f a r from the d i r e c t i o n o f the global minimum. I t i s thus o f t e n a poor choice o f search d i r e c t i o n . D i r e c t methods such as Powell's o f t e n converge more r a p i d l y than steepest descent. The steepest descent method performs worst i n long narrow valleys o f t h e c o s t function. I t i s a l s o sensit i v e t o scaling. These two d i f f i c u l t i e s a r e c l o s e l y related; r e s c a l i n g a problem can e a s i l y create long narrow valleys. The f o l l o w i n g examples i l l u s t r a t e the s c a l i n g and v a l l e y d i f f i c u l t i e s : Example 2.3-1 L e t the cost f u n c t i o n be 1 J(x) = 7 ( x i

xi)

The steepest descent method works excel!ently f o r t h i s cost function (so does almost every optimization method). The gradient o f J ( x ) i s

Therefore, from any s t a r t i n g point, the negative o f the gradient p o i n t s e x a c t l y a t the o r i g i n , which i s the global minimum. The minimum w i l l be a t t a i n e d e x a c t l y (or t o the accuracy of the one-dimensional search methods used) i n one i t e r a t i o n . Figure (2.3-1) i l l u s t r a t e s the algorithm s t a r t i n g from the p o i n t (1,1)*.

Exam l e 2.3-2 Rescale the preceding example by replacing x, by 0 . 1 ~ ~ . e r aps we u s t redefined the u n i t s of x, t o be m i l l i m e t e r s instead o f centimeters.) The c o s t f u n c t i o n i s then

and the gradient i s vxJ(x) (O.Olx,,x,)

Figure (2.3-2) shows the search d i r e c t i o n used by the algorithm s t a r t i n g from the p o i n t (10,1)*, which corresponds t o the p o i n t (1.1)* i n the previous

example. The search d i r e c t i o n points almost 90" from the o r i g i n . A careless glance a t Figure (2.3-2) i n v i t e s t h e canclusion t h a t the minimum i n the search d i r e c t i o n w i l l be on the x axis and tnus t h a t the second i t e r a t i o n o f t'le steepest descent algorithm w i l l a t t a i n :he minimum. I t i s t r u e t h a t the minimum i s close t o t h e x axis. b u t i t i s not exactly on the axis; the d i s t i n c t i o n makes an important difference i n the algorithm's performance. For points x AV:J(X) along the search d i r e c t i o n from any p o i n t (x,.x,)*, the cost function i s 1 0 . 0 1 ~ )+ ~ : ( l A)'] ~ g(A) = f ( ~ A V ~ J ( X ) ) 7 [ a . O l ~ : ( l

The minimum of

g(A) i s a t

and thus the minimum p o i n t along the search d i r e c t i o n i s (x,

- o.o1x,i

x,

- X,i)*

w i t h i defined as above. The f o l l o w i n g t a b l e and Figure (2.3-3) show several i t e r a t i o t l s o f t h i s process s t a r t i n g from the point (10.1)*. Iteration x,

X2

The trend o f the algorithm i s c l e a r ; every two i t e r a t i o n s i t nloves e s s e n t i a l l y halfway t o the solution. Consider the behavior s t a r t i n g from the p o i n t (10,0.1)* instead o f (10.1)*: Iteration x,

Xz

This behavior. p l o t t e d i n Figure (2.3-4), i s abysmal. The algorithm i s bounci n g back and f o r t h across the valley, making l i t t l e progress toward the minimum. Several modifications t o the steepest descent method are available t o improve t t s performance. A rescaling step t o eliminbte valleys caused by scaling y i e l d s major improvements f o r some problems. The method o f parall e l tangents (PARTAN method) e x p l o i t s p a t t e r n d i r e c t i o n s s i m i l a r t o those discussed i n Section 2.2; searches i n such p a t t e r n d i r e c t i o n s are o f t e n c a l l e d acceleration steps. The conjugate gradient method i s t h e most powerf u l o f the modffications t o steepest descent. The references discuss these and other gradient algorithms i n detatl. 2.4 SECOND ORDER MFTHODS

Gptimt zation methods that use the second d e r i v a t i v e ( o r an approximation t o i t ) o f the cost .function are c a l l e d sscond order methods. These methods require t h a t t h e f i r s t and second dertvatives o f the cost function exist. 2.4.1 Newton-Raphson

The Newton-Raphson optimtzation algorithm (also c a l l e d Newton's m t h o d ) I s the basis f o r a l l o f the second order methods. The idea o f t h i s algorithm 1s t o approximate the cost function by the f i r s t three terms o f i t s Taylor series expansion about the current point.

From a geometrlc vlewpolnt, t h l s equatlon descrlbes t h e paraboloid t h a t best approxlmates the functlon near Equatlng the gradlent o f J l ( x ) t o zero glves an equatlon f o r the mlnlmum p o l n t o f the approxlmating not.* t h a t v X J ( x ~ ) and V ~ J ( X are evaluated a t the f i x e d p o l n t x j and ~) functlon. Taklnq t h i s :rad!rnt, thus are i ~ u tfunctlons o f x.
XI.

vxJl ( x ) * vXJ(xl) The solutlon i s

+ (X

- xl)*[v:J(xf)l

If the second gradlent o f J l s p o s l t i v e d e f l n l t e , then Equation (2.4-3) gives the exact un4. .,. mlnlrmm o f t h e approxlmatlng functlon; i t :s a reasonable guess a t an approximate mlnlmum o f the o r i g l n a junction. I f the second gradlent i s not p o s i t l v e d e f l n l t e , then the approxlmatlng function does n o t have a unlque mlnlmum and the algorlthm l s l l k e l y t o perform poorly. The Newton-Raphson a l g o r l t h n uses Equatlon (2.4-3) l t e r a t l v e l y ; t h e x from t h i s equatlon I s the s t a r t l n g p o i n t f a r the next I t e r a t i o n . The algorithm I s xXtl
I

xl

- [V:J(X~)I-'V~J(X~)

(2.4-4)

The performance o f t h l s algorithm I n the close neighborhood o f a s t r i c t l o c a l minlmum I s unexcelled; t h l s performance represents an Ideal toward whlch other algorithms s t r i v e . The Newton-Raphson algorlthm a t t a i n s t h e exact (except f o r numerlcal round-off e r r o r s ) mlnlmum o f any p o s i t l v e - d e f l n l t e quad:;tf functlon i n a s l n g l e I t e r a t i o n . Convergence w i t h l n 5 t o 10 l t e r a t l o n s i s c o m n on some p r a c t l c a l nonquadratlc problems w i t h several dozen dlmensions; d i r e c t and gradient methods t y p i c a l l y count I t e r a t i o n s I n hundreds and thousands f o r such problems and s e t t l e f o r less accurate answers. See the references f o r analysts o f convergence charac:erlstics. Three negative features of the Newton-Raphson algorithm balance f t s e x c e l l e n t convergence near the mlnlmum. F i r s t I s the behavior o f t h e algorlthm f a r from the mlnlmum. I f the i n l t i a l estimate i s f a r from the minimum, the algorithm o f t e n converges e r r a t l c a l l y o r even diverges. Such problems are o f t e n associated w l t h second gradlent matrices t h a t are not p o s l t i v e deflnlte. Because of t h l s problem, I t l s comnon t o use special start-up procedures t o get w i t h i n the area where Newton-Raphson performs well. One such procedure I s t o s t a r t w l t h a gradient met~tod, switchlng t o Newton-Raphson near the mlnlmum. There are many other start-up procedures, and they play a key r o l e i n successful appllcatlons o f the Newton-Raphson algorlthm. The second negative feature o f t h e Newton-Raphson method i s the computatlonal cost and complexity o f evaluating the second gradient matrlx. The magnitude o f t h l s diff!r,ulty varies widely among appllcations. I n some special cases the second g r a d l r n t I s l l t t l e harder t o compute than the f i r s t gradlent; Newton-Raphson. perhaps w l t h a start-up procedure, f s a good choice f o r such appllcatlons. If, t the other extreme, you are reduced a t o f l n i te-dl fference computatlon o f the second gradient, Davldon-Fletcher-Powell (Section 2.4.4) i s probably a more appropriate algorithm. I n evaluating the computational burden of Newton-Raphson and other methods, remember t h a t Newton-Raphson requires no one-dlmenslonal searches. Equatlon (2.4-4) c o n s t l t u t e s the e n t i r e algorlthm. The one-dfmensional searches requlred by most other algorithms can account f o r a m a j o r i t y of t h e l r computational cost. The t h l r d negative feature o f the Newton-Raphson algorithm l s the necessity t o I n v e r t the second gradient matrix ( o r a t l e a s t t o solve the s e t o f l l n e a r equations lnvolving the matrix). The computer tlme requlred f o r the inversion I s seldom an issue; t h l s tlme i s t y p l c a l l y small compared t o the time requlred t o evaluate t h e second gradlent. Furthermore. the algorlthm converges q u i c k l y enough t h a t I f one l l n e a r system s o l u t l o n per l t e r a t l o n i s a l a r g e f r a c t l o n o f the t o t a l cost, then the t o t a l cost must be low, even i f the l i n e a r system l s on the order a f 100-by-100. The c r u c i a l lssue concerning the Inversion o f the second gradlent I s the posslb i l l t y t h a t the matrlx could be slngular o r i l l - c o n d i t i o n e d . W w l l l dlscuss s i n g u l a r i t l e s i n Section 2.4.3. e 2.4.2 Invarlance

The Newton-Raphson algorlthm has f a r l e s s d l f f l c u l t y w i t h long narrow valleys o f t h e cost function than does the steepest-descent method. Thls difference i s r e l a t e d t o an invarlance property o f the Newton-Raphson algorlthm. Invarlance o f minimization rlgorlthms i s a useful concept whlch many t e x t s mentlon b r i e f l y , i f a t a l l . W w l l l therefore elaborate somewhat on the subject. e The examples i n t h e section on steepest descent l l l u s t r a t e a strong l l n k between scaling and narrow valleys. Scaling changes can ea:lly create such valleys. Therefore we can generally s t a t e t h a t minimization methods t h a t are sensitive t o scallng changes are l i k e l y t o behave poorly i n narrow valleys. This reasoning suggests a simple c r l t e r l o n f o r evaluating o p t l m l z r t i o n a l g o r l t h s : a good optlmlzation algorithm should be I n v a r i a n t under scallng changes. This p r l n c i p l e I s almost so self-evldent as t o be unworthy o f mention. The user o f a program would be j u s t i f i a b l y disgruntled I f an a l g o r i t b n t h a t t r k e d i n the Engllsh Gravltatlonal System (Inperid1 System) of u n i t s f a i l e d when applied t o the same problem expressed i n m t r l c nits ( o r v i c e versa). Someone t r y i n g to duplicate reported r e s u l t s would be perplexed by data p u b l i r h t d I n metric u n i t s which could be dupllcated only by convertlng t o Engllsh G r a v i t a t l o n r l System u n i t s , i n which thc conputation was r e a l l y done. Nonetheless, many c m n algorithms, Including the steepest descent method, f a l l t o e x h l b i t invariance under scaling. The c r l t e r l o n i s n e l t h e r necessary -r s u f f i c i e n t . I t i s easy t o construct r i d f c u l o u s a1 o r l t h n s t h a t are rlgoI n v a r i a n t to scale changes (such as thc g o r i t h . t h a t always returns the value zero).rnd rithms l i k e the steepest descent m t h o d nave achleved excellent r e s u l t s i n some a p p l i c r t l o n s . It i s safe t o

state-sensitive

state. however, t h a t you can usually Improve a good scale-sensltlve algorlthm by maklng I t scale-invariant. An i n i t l a l step t h r t r e s c r l e s the problem can e f f e c t l v e l y make the steepest-descent method scale-lnvarlant (although such a step destroys a d i f f e r e n t invarlance property of the steepest-descent method: lnvarlance under r o t a t l o n o f coordinates). Rescallng a problem can be done manually by the usor, o r It can be an automatic p a r t o f an algorlthm; automatic rescaling ha5 the obvlous advantage o f belng easier f o r the user, and a secondary advantage o f allowing dynamic s c a l l n g changes as the algorlthm proceeds. W can extend the ldea o f invarlance beyond scale changes. I n generdl, we would l l k e an algorlthm t o be e I n v a r i a n t under the l a r g e s t posslble set o f t r a n s f o m t l o n s . A j u s t i f i c a t i o n f o r t h i s c r l t e r i o n I s t h a t almost any complicated mlnlmlzation problem can be expressed as some transformation (possibly q u l t e compllcated) o f a slmpler p-oblem. W can sometimes use such transformations t o s l m p l l f y the s o l u t i o n o f the o r l g l e nal problems. Often It i s more d i f f i c u l t t o do the transformation than t o solve the o r l g i n a l optimization problem. Even ifwe cannot do the transformations, we can use the concept t o conclude t h a t an o p t i n l z a t i o n algorlthm l n v a r l a n t over a l a r g e class o f transformations i s l l k e l y t o work on a large class o f prtblems. l h e Newton-Raphson algorlthm I s i n v a r i a n t under a l l i n v e r t i b l e 1lnear transformations. lnvarlance property t h a t we can rlsually achieve. Thls l r the widest

The scale-invariance o f the Newton-Raphson algorithm can be p a r t i a l l y n u l l i f l e d by poor c h o l ~ e f matt-ix o Inversion ( o r Iinear system solution) algorlthms. W have assumed exact arlthmetlc I n the preced ,ng d i j c u s i l o n e o f scale-invariance. Some m a t r l ~ Inverslon routines are sensl t l v e t o scal l n g e f f e c t s . Inversion based >,. Cholesky f a c t o r i z a t i o n (Wilklnson, 1965, and Acton, 1970) i s a good. easilq implemented method f o r . I-i. matrices (the second gradient I s always sylrmctric). and I s I n s e n s i t i v e t o scallng. A l t e r n a t i v e l y , pt escale the matrix by using I t s diagonal elements. 2.4.3 Slngularltles

The second gradient matrix used I n the Newton-Raphson algorlthm i s p o s l t l v e d e f i n i t e l n r region near a s t r l c t l o c a l mlnlmwn. Idcall,, the start-up procedure w i l l reach such a reglon. and the Newton-Raphson a 1 0 ~ r i t h m w i l l then converge wlthout needing t o contend w i t h s l n g u l a r i t i e s . Thls vlewpolnt I s o v e r l y o p t l m l s t i c ; slngular o r i l l - c o n d i t i o n e d matrlces (the difference i s l a r g e l y academic) a r i s e i n many situations. I n the followlng discussion, we dlscount t h e e f f e c t s o f scaling. Matrices t h a t have l a r g e c o n d i t i o n numbers because o f scallng do n o t represent i n t r i n s i c a l l y I l l - c o n d i t i o n e d problems, and do n o t require t h e techniques dlscussed i n t h l s sectlon. I n some situatlons. the second gradient matrlx I s exactly singular f o r a l l values o f x; two columns (and rows) are l d e n t l c a l o r a column (and corresponding row) I s zero. These slmple s l n g u l a r l t l e s occur r e g u l a r l y even I n complex nonlinear problems. They o f t e n r e s u l t from e r r o r s i n the problem formulatlon, such as minlmlzl n g w i t h respect t o a parameter t h a t I s i r r e l e v a n t t o the cost function.

In the more general case. the second gradient i s singular (or ill-condltloned! a t swne points b u t not a t others. Whenever we use the t e n singular i n the following dlscussion, we i m p l i c i t l y mean singular o r illcondltloned. Because o f t h l s d e f i n i t i o n , there w i l l he vaguely defined regions o f s i n g u l a r l t y r a t h e r than i s o l a t e d polnts. The consequences o f s l n g u l a r l t i e s are d i f f e r e n t depertding on whether o r not they are near the mlnlmum.
Slngular!;ies far from the minimum pose no basic ~ h e o r e t l c a ld1ff:cultles. There a r e several p r a c t i c a l methods f o r handllng such s i n g u l a r i t l e s . One method I s t o use a gradlent algorithm ( o r any other algorithm unaffected by such s l n g u l a r l t i e s ) u n t i l x i s out o f the reglcn o f s i n q u l a r l t y . W can also use t h l s method e i f the second gradlent matrlx has negatlve eigenvalues. whether the matrix I s I l l - c o n d l t l o n e d o r not. I f the matrlx has negative eigenvalues, the Newton-Raphson algorithm i s l i k e l y t o behave poorly. ( I t could even conaround a l o c a l verge t o a l o c a l maximum.) The second gradlent I s always p o s i t i v e semi-definite I n a r e g i ~ n minimum, so negative eigenvalues a r e only a consideration away from the minimum. Another method o* handllng s l n g u l a r l t l e s I s t o add a small p o s l t l v e d e f l n i t e matrlx t o the second gradient before Inversion. W can also use t h i s method t o handle negative eigenvaluec, i f the added matrix i s large e enough. This method i s c l o s e l y r e l a t e d t o t h e prevlous suggestion o f using a gradient algorlthm. I f the aeded matrix i s a l a r g e constant tlmes an l d e n t l t y matrlx, the Newton-Raphson algorlthm. so modified, gives a small step f n the negative gradient d i n c t l o n . For srrmll constants, t h e algorithm has c h a r a c t e r l s t l c s between those o f steepest descent and Newton-Raphson. The conputatlonal cost o f t h l s method l s hlgh; i n essence, we are g e t t i n g p e r f o m n c e l l k e ste9pest descent w.tlle paying the conputatlonal c o s t o f Newton-Raphson. Even s m l l addltlons t o the second d e r i v a t l v c matrlx can dramatically change the convergence behavlor o f the NewtonRaphson algorlthm. W should therefore discontinue t h i s m d l f l c a t l o n h e r ! out o f the region o f s i n g u l a r l t y . e The advantage o f t h l s method I s i t s s l m p l l c l t y ; excluding the t e s t 3 when the matrlx i s Ill-conditioned. t h l s 3 modification can be done I n two short 1 lnes of FORTRAN code. Penrose (1955), Aokl (1967). The l a s t method i s t o use a pseudo-inverse (rank-deficient s o l u t i o n Luenberger (1969). Yllklnson and Relnsch (l9II), h l e r and Stewart (19731, and blrbar. Boyle. Dongarra. and k l e r (1977) dlscuss pseudo-lnverses I n d e t a i l . The baslc ldea o f the pseudo-lnverse method 1s t o Ignore the directions I n the x-space corresponding t o zero elgenvalues ( w i t h i n some t o l e r a n c ~ )o f the second gradlent. , I n the p a r a m t e r e s t l n u t i o n context. such d l m c t i o n s represent parameters, o r c d l n a t l o n s o f p a r a p r t e ~ ~ sabout which the data glve l l t t l e lnformatlcn. Lacklng any I n f o r m t i o n to the contrary, the method leaves i ~ param~ h e t e r conbination?, unchanged from t h e l r I n i t i a l values. The pseudo-inverse method does not address the problem of negatlve eigenvalues, b u t i t I s popular I n 8 large class o f a p p l i c a t i u t ~ s &re n t g t ! v e eigcnvalues are impossible. The method l s easy t c i l p l e w n t . bclng only a r e w r i t e o f the matrix-lnversion o r llnear-systnrcsolutlon subroutine. It a l s o has a useful property i b r r n t f r o n the other proposed methods; i t d a s not a f f e c t the Newton-Raphson a l g o r l t l m when the u t r i x i s well-conditioned, Therefore one can f r e e l y apply t h l s method without t e s t l n g whether I t 4s needed. (It i s t r u e t h a t condltlon t e s t s I n sorc form are p a r t o f s pseudo-lnverse algorithm. b u t such t e s t s are a t a lower l e v e l contained wi t h l n the pseudo-inverse subroutlne. )

S i n g u l a r i t l e s near the mlnlmum require speclal consideration. The ? x c e l l e n t convergence o f NewtonRaphson near the mlninum i s the prlmary reason f o r uslng the algorlthm. I f ,.* s l g n i f l c a n t l y slow the convergence near the mlnfmum, there I s l i t t l e argument f o r uslng Newton-Raphson. The use o f a pseudo-fnverse can handle s i n g u l a r l t l e s while malntainlng the e x c ~ l l e n t convergence; the pseudo-lnverse i s thus an approprlate t o o l f o r t h i s purpose. Although pseudo-lnverses handle the computational problems, s i n g u l a r l t l e s near the minlmrm a1 so r a l s e t h e o r e t i c a l and application lssues. Such a singularity indlcates t h a t the minimum p o l n t I s poorly deflned. The cost functlon I s e s s e n t l a l l y f l a t I n a t l e a s t one direction from t h e minimum, and the mlnlnnnn value o f the cost function might be a t t a l n e d t o 1, chine accuracy by widely separated points. Although t h e a l g o r l t h n cone verges t o a mlnlrwm polnt, It might w the wrong minimm p o l n t I f the mlnlmum I s f l a t . I f the only goal i s t o minimize the cost function, rny mlnlmlziny p o l n t mlsht be acceptable. I n the applications o f t h l s book, minlmlzlng the cost :unction i s only a means t o an end; :w desired ou' u t I s the value o f x. I f r l l t l p l e solut l o n s e x l s t , the problem statmwnt i s incomplete o r f a u l t y . W strongly advlse avoldlng the r o u t i n e use o f pseudo-lnverses o r other computational machinations t o e "solve" uniqueness problems. I f t h e baslc problem statement f s f a u l t y , no numerical t r i c k w l l l solve It. The pseudo-lnverse works by changlng the problem staterent o f the inversion, adding the s t i p u l a t i o n t h a t the Inverse have mlnfnum norm. The l n t e r p r e t a t l o n o f t h l s s t l p u l a t l o n I s vague i n the context o f the optlmlzatior, p r o b l ~ l n(unless t h e cost functlon I s quadratlc, i n which c3.e It speclfles the s o l u t l o n nearest the s t a r t i n g po! t ) . I f t h l s s t i p u l a t i o n l s a reasonable a d d i t l o n t o the problem statement, then the pseudo-lnverse i s an appropriate t o o l . Thls declslon can have s l g n i f l c a n t e f f e c t s . For a nonquadratic cost functlon, f o r example, there mlght be large differences I n the solutlon point, dependlng on small changes I n :he s t a r t i n g point. the data, o r the algorithm. The pseudo-lnverse can be a good diagnostic t o o l f o r g e t t i n g the informatlon '.-:eded t o revise the problem 1he analyst's trong p o l n t I s statement, but one should not depend upon i t t o solve the problem autonomous', i n f u m l a t l n g the problem; the compilter's strength i s i n crunchlng numbers t o a r r l v * a t t h e s o ~ u t l o n . A f a i l u r e I n e i t h e r r o l e w l l l compromise the va!,idity o f the s o l u t i ~ n . Thls stateineni s b u t a rephrasing o f the computer c l i c h e "garbage I n , garbage out. whlch has been said many more times than i t has been heard

2.4.4

Quasl-Newton Hethods

Quasi-Newton methods are intended f o r problems where e x p l l c i t evaluation o f the second gradlent o f the cost function I s coniplicated o r costly. but the performance of the Newton-Raphson algorithm I ,deslred. These : methods form approximations t o the second-gradlent m a t r l r uslng the f i r s t - g r a d l e n t values from several t i e r s tions. Thc approximation t o the second gradient then substitutes f o r the exact second gradien: I n EquaS o f the methods d l r u t l y forn, approxlmations o f the inverse o f the second-grad!ent matrlx, w t l o n (2.4-4). avoiding the cost and some o f the problems o f matrix inversion. Note t h a t as long as the approximatlon t o the second-gradtmt matrlx I s p o s l t i v e definite. Equat l o n (2.4-4) can never converge t o any p o l n t w l t h a nonzero f l r s t gradient. Therefore approximatlons t o the second gradient, no matter how poor, cannot a f f e c t the s o l u t l o n p o i n t . The approximations can g r e a t l y change the speed o f convergence and the area o f acceptable s t a r t l n g values. Approxlmatlons t o the f i r s t gradient would a f f e c t the s o l u t i o n p o l n t as well. The steepest descent method can be consldered as the crudest o f the quasi-Newton methods, uq;ing a constant times the i d e n t l t y matrix as the approximation t o the second gradient. The performance o f the abasi-Mewton methods approaches t h a t o f Newton-Raphson as the approximation t o the second gradient inprover. The Davidon-Fletcher-Powell method (variable metric method) i s the most popular quasl-Newton method. See the references f o r discussions o f these methods. 2.5 S M OF SQUARES US

The algorithms discussed i n the prevlous sectfons are generally applicable t o any mlnimizatlon problem. By t a l l o r l n g a l g o r i t h n ~ st o speclal c h a r a c t e r l s t l c s o f speclf i c problem classes. we can o f t e n achleve f a r b e t t e r performance than by uslng the general purpose algorithms. M n y o f the cost functic~nsa r i s l n g i n estlmatlon problems have the fonn o f sums o f squares. sums-of-squares form i s The general

The fi are vector-valued functions o f x, and the Y{ a r e m i g h t l n s. To s l m p l l f y s m o f t h e formulrr, This assurrption does not r e a l f y r e s t r i c t the a p p l i c a t l o n because we can m a s s w t h a t the Y are s always substitute l / l ( Y i + nonsymetrlc Y( wlthout changing t h e f u n r i i o n values. I n l a s t a p p l i cations, the Wf are p o s l t i v e semi-definite; t h i s i s not a r e q u l r c a n t , b u t we w l l l see t h a t I t helps eniure t h a t the stationary points encountered are l o c a l m i n i m . The fonn o f Equation (2.5-1) i s coawn enough to w r i t special study.

fi:Li:

The sunnutlon sign i n Equation (2.5-2) i s s o m h r t superfluous i n t h a t m y function i n the f o m o f Equat i o n (2.5-1) can be r e w r i t t e n i n an equlvalant f o m without the summtion sign. This can be done by concatan r t i n g the I d i f f e r e n t f l ( x ) vectors i n t o a single, longer f ( x ) vector and nuking a corresponding l a r g e Y matrix w l t h the W m t r i c e s on diagonal blocks. The o n l y d l f f e r m c e i s i n the notation. Ua choose the longer notation w i t h tha s u w t l o n sign because l t more d i r e c t l y corresponds w l t h the way u n y parameter o t i w t i o n problems are n a t u r a l l y phrased.

S e b ~ r a lo f the algorithms discussed i n t h e previous two sections work w e l l w i t h t h e form o f EquaFor any reasonable f i functions, Equation (2.5-11 defines a cost function t h a t i s w e l l t i o n (2.5-1). many approximated by quadratics over f a i r l y l a r g e regions. S i ~ ~ c e o f t h e general micimization schemes are based on quadratic approximations, a p p l i c a t i o n of these schemes t o Equation (2.5-1) i s natural. This statement does not imply t b a t Lhsre are never problems minimizing Equation (2.5-1); t h e problems are sometimes severr?. b u t the odds of succc. .s w i t h reasonable e f f o r t are much b e t t e r than they are f o r a r b i t r a r y cost function forns. Although the g e ~ e r a lmethods are usable, we can e x p l o i t the problem s t r u c t u r e t o do b e t t e r . 2.5.1 Linear Case

I f the f i functions i n Equation (2.5-1) are l i n e a r , then the c o s t function i s exactly quadratic and we can express the minimum point i n closed form. I n p a r t i c u l a r , l e t the f i be the a r b i t r a r y l i n e a r functions

Equation (2.5-1) then becomes

Equating the gradient o f Equation (2.5-3)

t o zero gives

Solving f o r

gives

assuming t h a t the inverse e x i s t s . I f the inverse exists, then Equation (2.5-5) gives t h e o n l y stationary p o i n t o f Eq'lation (2.5-3). This stationary p o i n t must be a minimum i f a l l the W i are p o s i t i v e semi-definite, and i t m s t be a maxi!,um i f a l l the Mi a,-e negative semi-definite. (We leave the straightforward proofs as an exercise.) I f t h e W i meet n e i t h e r of these conditions, the stationary p o i n t can be a minimum, a %ximum, or a saddle point.
I f the inverse i n Equation (2.5-5) does net e x i s t , then there i s a l i n e ( a t l e a s t ) 3 f solutions t o Equat i o n (2.5-4). A l l o f i ese points are stationary p o i n t s o f the cost function. Use o f a pseudo-inverse w i l l produce the solution w i t h minimum norm, b u t t h i s i s usually a poor idea (see Section 2.4.3).

2.5.2

Nonlinear Case

A natural I f the f i are nonlinear, there i s no s i m ~ l e , closed-form s o l u t i o n l i k e Equation (2.5-5). question i n such situations, i n which there i s an easy method t o handle l i n e a r equations, i s whether we can merely l i c e a r i z e the nonlinear equations and use the l i n e a r methodology. Such l i n e a r i z a t i o n does n o t give an acceptlble closed-form solution t o tne current problem. but i t does f o n the basis f o r an i t e r a t i v e methoc. Define the l i n e a r i z a t i o r , o f f i about any p o i n t
XJ

as

where

Equation (2.5-5). w i t h t h e ~ f j and b l J ) substituted f o r Ai and b i , gives t h e stationary p o i n t of t h e cost ) w i t h the l i l e a r i z e d f i functions. This p o i n t i s not, i n eneral, a solution t o the nonlinear proulem. If, however. x i i s close t o the solution, then Equation (2.1-57 should give a p o i n t closer t o the so1ution. because t h e I tnearization w l l l g i v e a good representation o f the cost function i n the region around x i . The i t e r a t i v e algorithlr, r e s u l t i n g from t h i s concept i s as follows: F!rst, choose a s t a r t i n g value x,. The closer x, i s t o the correct solution, the b e t t e r the algorithm i s l i k e l y t o work. Then define revised x j values by

This equation comes from s u b s t i t u t i n g Equation (2.5-7) i n t o Equation (2.5-5) and simplifying. I t e r a t e Equa t i o n (2.5-8) u n t i l i t converges by some c r i t e r i o n . o r u n t i l you give up. This method i s o f t e n c a l l e d quasil i n e a r i z a t i o n because i t i s based on l i n e a r i z a t i o n not o f the cost function i t s e l f , but c f f a c t o r s i n the cost function. Ye made several vague, unsupported statements i n the process o f d e r i v i n g t h i s a l g o r i t h . Ye now need t o analyze the algorithm's performance and compare i t w i t h the performance of the algorithms discussed i n the frevious sections. This task i s g r e a t l y s i n p l i f i e d by n o t i n g t h a t Equation (2.5-8) defines a quasi-Newton alrorithm. To show thi:, we can w r i t e the f i r s t and second gradients o f Equation (2.5-1):

(We have not previously introduced the d e f i n i t i o n o f t h e second gradient o f a vector, as ir the v i f i ( x ) above. The r e s u l t i s t e c h n i c a l l y a tenser, b u t we w i l l not need t o consider i t i n d e t a i l here.) Comparing w see t h a t the only difference between quasie Equation (2.5-8) w i t h Equations (2.4-4). (2.5-9). and (2.5-!0). l i n e a r i z a t i o n and Newton-Raphson i s t h a t q u a s i - l i n e a r i z a t i o n has dropped the second term In Equation (2.5-10). Quasi-!lnearization i s thus a quasi-Newton method using

as an approximation f o r the second gradient. term we w i l l adopt i n t h i s book.

The algorithm i n t h i s fcrm i s also known as Gauss-Newton, the

Near the solution. the neglected term o f the second gradient i s generally small. Section 5.4.3 c i i t l i n e s t h i s argument as i t applies t o the parameter estimation problem. Thercfore, Gauss-Newton approaches the excell e n t performance of Newton-Raphsor! near the solution. Such approximation i s the main goal o f quasi-Newton methods. Accurately approximating the p e r f o n ~ n c eo f Newton-Raphson far from the m i n i m i s not o f great concern because Newton-Raphson does not generally p e r f o n w e l l i n regions f a r from the minimum. Ye can even argue t h a t Gauss-Newton sometimes performs b e t t e r than Newton-Raphson f a r from t h e m i n i m . The worst problems w i t h Newton-Raphson occur when the second gradient matrix has negative e~genvalues; Newton-Raphson can then SO i n the wrong d i r e c t i o n , possibly converging t o a l o c a l maximum o r diverging. I f a l l o f the W i are p o s i t i v e semi-def i r t i t e (which i s usually t h e case), then the second gradient approximation given by Equation (2.5-11) i s p o s i t i v e semi-Jefinl t e f o r a1 1 x. A p o s i t i v e semi-def i n i t e second gradient approximation does not guarant e e good behavior, but i t surely helps; n e g i t i v e eigenvalues v i r t u a l l y guarantee problems. Thus we can heurist i c a l l y argue t h a t Gauss-Newton should perform b e t t e r than Newton-Raphson. W u i l l not attempt a d e t a i l e d e support of t h i s general argument i n t h i s book. I n several specific cases the improvement of Gauss-Newton over Newton-Raphson i s e a s i l y Cemonstrable. Although Gauss-Newton sometimes perfoms b e t t e r than Newton-Raphson f a r frcm the solution, i t has irany o f the r a m b a r i c start-up problems. Both a l g o r i t h n s e x h i b i t t h e i r best p e r f o m n c e near the minimum. Therefare, m w i l l o f t e n need t o begin w i t h some other, more stable algorithm, changing t o Ga~ss-Newtonas we near the minimum. The r e a l argument i n favor o f Gauss-Newton over Newton-Raphson i s the lower computational e f f o r t and comp l e x i t y of Gauss-Newton. Any performance improvement i s a coincidental side b e n e f i t . Equation (2.5-11) involves only f i r s t derivatives o f f i ( x ) . These f i r s t derivatives are a l s o used i n Equation (2.5-9) f o r the f i r s t gradient o f the cost. Therefore, a f t e r computing the f i r s t gradient o f J, the only s i g n i f i c a n t computat i o n remaining f o r the Gauss-Newton approximation i s the matrix m u l t i p l i c a t i o n i n Equation (2.5-11). The comp u t a t i o n o f the Gauss-Newton approximation f o r the second gradient can sometimes take less time than the compuat ion o f the f l r s t gradient, depending on the system d i w n s i o n s . For complicated f i f:~nctions, evaluation i n Equaticn (2.5-10) i s a major p o r t i o n o f the conputation e f f o r t o f the f u l l Newton-Raphson o f the v:fi(x) algorithm. Gauss-Ngrton avoids t h i s e x t r a e f f o r t , obt?ining t h e performance per i t e r a t i o n o f Newton-Raphson ( i f n o t b e t t a r i n some areas) w i t h computational e f f o r t per i t e r a t i o n comparable t o gradient methods. Considering the cost o f the one-dimensional searches required by gradient methods. Gauss-Newton can even be cheaper per i t e r a t i o n than gradient methods. The exact trade-off depends on the r e l a t i v e costs o t evaluati n g the f i and t h e i r gradients, and on the t y p i c a l nuntter o f evaluations required i n the one-dimensional searches. Gauss-Newton i s a t i t s best when the cost o f evaludting the f i i s nearly as much as the cost o f evaluating both t h e f i and t h e i r gradients b e t o high overhead costs comnon t o both evaluations. This i s exactly the case i n some a i r c r a f t applications, where the overhead consists l a r g e l y o f dimensionalizing the derivatives and b u i l d i n g new system matrices a t each time point. The other quasi-Newtnn methods, such as Davidon-Fletcher-Powell, a l s o approach Newton-Raphson performance without evaluating the second derivatives o f the f i . These methods. however, do r e q u i r e one-dimensional searches. Gauss-Newton stands almost alone i n avoiding both second d e r i v a t i v e evaluations and one-dimensional searches. This performance i s d i f f i c u l t t o match i n general a l g o r i t h s t h a t do not take advantage o f the special structure o f the cost function.

Some analysts (Fojter, 1983) introduce one-dimensic?al l i n e searches i n t o the Gauss-Newton a l g r i t h m t o ilprove i t s performance. The u t i l i t y o f t h i s idea depends on how w l l the Gauss-Newton r t h o d i s performing. I n most o f our experience, b u s s - k t o n works well enough that the one-dimensional l i n e searches canlot measurably imprbte performance; the t o t a l colputation time can well be larger with the l i n e searches. Uhen the Gauss-Newton a l g o r i t h i s performing poorly, however, such l i n e searches could help s t a b i l i z e it.
For cost functions i n the form o f Equation (2.5-1). the cost/perfoneance r a t i o o f Gauss-Newton i s so nuch better than that o f nmst other a l g o r i t h s that Gauss-Newton i s the c l e a r l y preferred algorithn. You may want t o nodify Gauss-Newton f o r specific problems, and you w i l l almost surely need t o use scme special start-up algorithn, but the best methods w i l l be based on Gauss-Newton.

2.6

C N F G N E !HPR@VMNT O YR E C

Second-order methods ~ i ? do converge quite rapidly i n regions where they work well. There i s usually t such a region arout~dthe minim16 point; the size o f the region i s problem-dependent. The price paid f o r t h i s region of excellent convergence i s that the second-order methods often corverge poorly o r dive1.w i n regions far from the minimm. Techniques t o detect and recnedy such convergence problems are an important part o f the practical implementation o f second-order methods. I n t h i s section, we b r i e f l y l i s t a few o f the many convergenca ilrprovenent techniques. l b d i f i c a t i o n s t o improve the behavior of second-order m t b d s i n regions f a r from the mininum almost inevitably slow the convergence i n the region near the mininrm. This r e f l e c t s a natural trace-off between speed and r e l i a b i l i t y o f convergence. Therefore, e f f e c t i v e implementation o f convergence-improvcrnent techniques usually includes d i f f e r e n t treatment o f regions f a r f?om the m i n i m and near the mininum. I n regions f a r from the minimun, the second-order methods are modified or abandoned i n favor o f more conservative algorithms. I n regions near the m i n i m , there i s a t r a n s i t i o n o the f a s t second order methods. : The means o f determining when t o make such transitions vary hidely. Transitions can be based on a sinple i t e r a t i o n count, on adaptive c r i t e r i a which exaainr the cbserved convergence behavior, o r on other principles. Transitidns can be either gradual or step changes.

Some convergence improvemnt techniques abandon second-order methods i n the regions f a r from the miniam, adopting gradient methods instead. I n t u r experience, the pure gradient method i s too slow f o r practical use on m s t parameter estimation problems. Accelerated gradient methods such as PARTAN and conjugate gradient are reasonable p o s s i b i l i t i e s .
Other convergence i n p r o v e n t techniques are modifications o f the second-order methods. Kany convergence problems r e l a t e t o ill-conditioned or nonpositive second gradient matrices. This suggests such modifications as adding positive d e f i n i t e matribes t o the second gradient or using rank-deficient solutions. Constraints m the allowable range 3 f estimates or on the change per i t e r a t i o n can also have s t a b i l i z i n g effects. A p a r t i c u l a r l y popular constraint i s t o f i x some o f the ordinates a t constant values, thus rzducing the dimension of the optimization p r o b h ; t h i s i s a form o f axial iteration, and i t s effectiveness depends on a wise (ar lucky) choice o f the ordinates t c be constrained. Relaxation methods, which reduce the indicated parameter changes by some fixed percentage, can sometimes s t a b i l i z e o s c i l l a t i n g behavior o f the algorithm. Line searches i n the indicated direction extend t h i s concept. and should ha capable o f s t a b i l i z i n g a i m s t any problem, a t the cost of additional function evaluations. The above l i s t o f convergence improvement techniques i s f a r from complete. I t also omits mention of numerous important implementation details. This 1i s t serves only t o c a l l attent ion t o the area o f convergence improvement. See the references f o r m r e thorough treatments.

LOCAL MINIMA GLOBAL MlNlMU

Figure (2.0-1).

I l l u s t r a t i o n o f local and global minima.

Figure (2.2-1).

Behavior o f axial iteration.

Figure (2.3-2).

The g r a d i e ~ t direction near a narrow valley.

Figvre (2.2-2).

The pattern direction.

Figure (2.3-3).

Behavior o f the gradient algorithm i n a narrow valley.

Figure (2.3-1).

rm The gradient direction fo a circular isocline.

Figure (2.3-4).

Worse behavior o f the gradient algorithm.

CHAPTER 3 3.0 BASIC PRINCIPLES

FSOW PROBABILITY

I n t h i s chapter ue w i l l review s o w basic d e f i n i t i o n s and r e s u l t s from p r o b a b i l i t y theory. W p r e s w e t h a t the reader has had previous exposure t o t h i s material. Our aim here i s t o review and serve as a r e f e r ence f o r those concepts t h a t are used extensively i n the following chapters. TCe treatment, therefore, i s q u i t e abbreviated, and devotes l i t t l e time t o motivating the f i e l d o f study o r philosophizing about t h e r e s u l t s . Proofs o f several o f the statements a r e omitted. Some of the other proofs a r e merely out1 ined, w i t h tone o f the more tedious steps omitted. Apostol (1969). Ash (1970). and Papoulis (1965) g i v e more d e t a i l e d treatment. 3.1 PWOBAEILITY SPACES

A ~mbability spdce i s formally defined by t k e e i 4 m s (n.6.P). sornetirnes c a l l e d the p r o b a b i l i t y t r i p l e . n is c~tled thr sample space, and the elements u 01 ri are r a ? l e d outcomes o r r e a l i z a t i o n s . 6 i s a set o f sets d t f i n e d on n, closed under countable set operaiions (union, intersection, a ~ compiemnt). Each set d B c 6 i s c a l l e d an event. I n the current discussion, we w i l l n o t be concecnrd n'tb t h a 'ine d e t a i l s of d e t ~ n i t l u ro f 6. 6 i s r e f e r r e d t o as the class o f measurable sets and i s studied i a masure theory (hbyden. 1968; iudin, 1974). P i s a scalar valued function defined on 6, and i s c a l l e s the ? r o b a b i l i t y function o r ~rooatility measure. For each set B i n 6, the f u n c t i ~ n P(B) defines the probabil .:y t h a t w w i l l be i n B. P . u s t s a t i s f y the following axioms: 1) 0

P(B) i 1 f o r a l l

BE 6

3) 3.1.2

P(F

B ) i

P(Bi)

f o r 311 countable sequences o f d i s j o i n t

Bi

E b

Conditiona! P r o b a b i l i t s I f A and B a r e * i events and P(B; va

+ 0,

the ccnditional p r o b a b i l i t y o f

A oiv?n E i s defined as (3.1-1)

p(AlB) = P(AIB)/P(B) where A/B i s the set i n t e r s e c t i o n o f the events A and B.

The events A and B are s t a t i s t i c l l l y independent i f P(A1E) = P(A). Note t h a t t h i s condition. i s symmetric; tha: i s . i f P(AIB) = P(A), then P(BIA) = P(B), provided t h a t P(AIB) and P(BJA) are both defined. 3.2 SCALAR RANDOM VARIABLES X(4) defined on n i s c a l l e d a random v a r i a b l e i f the set {w:X(W) < x) i s

A scalar real-valued function i n 6 f o r a l l real x. 3.2.1

D i s t r i b u t i o n and Density F u n c m Every random variable has a d i s t r i b u t i o n function defined as follows:

I t f ~ l l o w s i r e c t l y from the properties of a p r o b a b i l i t y measure t h a t Fx(x) m s t be a nondecreasing function d . of x. w i t h FX(--) = 0 and Fx(-) = I By the Lebesque decomposition l e m (Royden. 1968, p. 240; L d i n , 1974. p. 129). any d i s t r i b u t i o n function can always be w r i t t e n as the sum o f a d i f f e r e n t i a b l e conponent and a componeitt which i s piecewise constant w i t h a countable number o f d i s c o n t i n u i t i e s . I n m n y cases, w w i l l be e concerned w i t h variab?es w i t h d i f f e r e n t i a b l e d i s t r i b u t i o n functions. For such random variables, we define a function, px(s), c a l l e d the p r o b a b i l i t y density function, t o be the d e r i v a t i v e o f the d i s t r i b u t i o n function: pX(x) W have also the inverse r e l a t i o n s h i p e d ' jy Fy(x) (3.2-2)

A p r o b a b i l i t y density function must be nonnegative, and i t s i n t e g r a l over the r e a l l i n e must equal 1. For shorten px(s) t o p(x) where the medning i s clear. Where confusion i s s i m p l i c i t y o f notation, we w i l l o' possible, we w i l l r e t a i n the longer notation. A p r o b a b i l i t y d i s t r i b u t i o n can be defined completely by g i v i n g e i t h e r the d i s t r i b u t i o n function o r the density function. W w i l l work mainly w i t h density functions, except when they are n o t defined. e

24
3.2.2 &tations and b n : s

The expected vaiue o f a random variable. X, i s defined by

I f X does not have a density function, the p m i s e d e f i n i t i o n o f the e x p e c t a t i o ~i s somewhat more technical. i n v o l v i n g a S t i e l t j e s i n t e g r a l ; Equation (3.2-4) i s adeguate f o r the needs o f t h i s document. The expected value i s a l s o c a l l e d t h e expectation o r the mean. Pny (measurable) function o f a random variable i s also a random variable and

The expected value o f xn f o r p o s i t i v e n i s c a l l e d the n t h mornent of X. Under m i l d conditions, knowledge o f a l l o f the mrnnents o f a d i s t r i b u t i o n i s s u f f i c i e n t t o define the d i s t r i b u t i o n (Papoulis. 1965, p. 158).
The variance of

X i s defined as var(k) z EI(X

- {XI)') - 2EtX)EtXl - {XI2

= E ( X ~ }+ E ~ x ) ' = EtX21

The standard deviation i s the square r o o t o f t h e variance.


3.3 JOINT RANDOn VARIABLES

Two random variables defined on the same sample space are c a l l e d j o i n t random variables.
3.3.1 D i s t r i b u t i o n and Density Functions

I f two random variables. X and Y, a r e defined on the same sample space, we define a j o i n t d i s t r i b u t i o n function of these variables as

For absolutely continuous d i s t r i b u t i o n functions, a j o i n t p r o b a b i l i t y density function by the p a r t i a l d e r i v a t i v e a2 P ~ , J ( ~ . Y )= FX,y(x.~)

pxPy(x.y) i s defieed (3.3-2)

k then have a l s o

I n a s i m i l a r manner, j o i n t d i s t r i b u t i o n s and a e n s i t i e s o f N random variables can be defined. As i n t h e scalar case, the j o i n t density fuiiction o f N random variables must be nonnegative and i t s i n t e g r a l over t h e e n t i r e space must equal 1. A random N-vector i s the same as N j o i n t l y random scalar variables, t h e only difference being i n the terminology

3.3.2

Expectations and Moments The expected value o f a random vector X i s defined as i n t h e scalar case:

The covariance o f

i s a matrix defined by cov(X) = Et[X

- E(X)][X - E(X)lf)

The covariance matrix i s dlways s y n m t r i c and p o s i t i v e semi-definite. It i s p o s i t i v e d e f i n i t e I f X has a density function. Higher order uuments o f random vectors can be defined, but a r e n o t a t t o n a l l y clumsy and seldom used. Consider a random vector Y given by

where A i s m y deterininistic matrix (not necessarily square), and b i s an appropridte length deterministic vector. Then the mean and covariance o f Y are

3.3.3

Marginal and Conditional D i s t r i b b t i o n s

I f X and Y are j o i n t l y random variables w i t h a j o i n t d i s t r i b u t i o n f u n c t i o n given by Equation (3.3-1). then X and Y are a l s o i n d i v i d u a l l y random variables, w i t h d i s t r i b u t i o n functions defined as i n Equat i o n (3.2-1). The i n d i v i d u a l d i s t r i b u t i o n s o f X and Y are c a l l e d the marginal d i s t r i b u t i o n s , and the corresponding density functions are c a l l e d marginal density functions.
The marginal d i s t r i b u t i o n s o f X and Y can be derived from the j o i n t d i s t r i b u t i o n . (Note t h a t the converse i s false without additional assumptions.) By comparing Equations (3.2-1) and (3.3-l), we obtain

and correspondingly FY(y) = FXSy(-.y) I n terms o f the density functions, using Equations (3.2-2) and (3.3-3). we obtain (3.3-10a)

The conditional d i s t r i b u t i o n function o f

X given Y

i s defined as (see Equation (3.1-1))


<

F
and correspondingly f o r

X I Y (xly)

= P(Iw:X(w)

< xlltw:Y(w)

yl)

(3.3-11)

FylX. The conditional density function, when i t exists, can be expressed as p x I Y ( x I ~ )' P ~ , ~ ( ~ , Y ) ~ P ~ ( Y ) (3.3-12)

Equation (3.3-12)

i s known as Bayes' r u l e .

The conditional expectation i s defined as


E ~ X ~ Y } =

J
-m

assuming tPat t h e density function e x i s t s .

Using Equation (3.3-13),


a

we obtain the useful decomposition (3.3-14)

EIf(L.Y)) 3.3.4 S t a t i s t i c a l Independence Two random vectors


X and Y

EIE(f(X,Y)IY)l

defined on the same p r o b a b i l i t y space are defined t o be independent i f F x , y ( ~ s ~ ) Fx(x)Fy(y) (3.3-15)

I f t h e j o i n t p r o b a b i l i t y density function exists, we cai. w r i t e t h i s condition as


P~,,(X.Y)

" px(x)py(y)
o t depend U Y ) does nindependentono ry,anyand py x doesando t9. are f functions n
f

An imnedtate corollary, using Equation (3.3-12). I s t h a t px depend on x. I f X arid Y r r e jndependent. then f(X) and Two vectors a r e uncorrelatrd i f

EtXY*) = E{X)EtY*)

(3.3-17)

o r equivalently i f EI(X

EIX))(Y

- E{Yl)*)

* 0

I f X and Y

are uncorrelated, then the covariance o f t h e i r sum equals the sum of t h e i r covariances.

Iftwo vectors are independent, then they are uncorrelated, b u t the converse o f t h i s statement ' s f a l s e .
3.4 TRANSFORMATION OF VARIABLES

A large p a r t o f p r o b a b i l i t y theory i concerned i n some manner w i t h the transformation o f variables; i.e., : characterizing random variables defined as functions o f other random variables. W have previo.!sly c i t e d e l i m i t e d r e s u l t s on the means and covariances o f some transformed variables (Equations (3.2-5). (3.3-7). and (3.3-8)). I n t h i s section we seek the e n t i r e density function. Our consideration i s r e s t r i c t e d t o variables t h a t have density functions. Let X be a random vector w i t h density f u n c t i o n pX(x) defined on Rn, the Euclidean space o f r e a l n-vectors. Then define Y E Rm by Y = f(X). W seek t o derive the density funce t i o n o f Y. There are three cases t o consider, depending on whether rn = n, m > n, o r m < n.
The primary case of i n t e r e s t i s when m = n. Assume t h a t f ( . ) i s i n v e r t i b l e and has continuous p a r t i a l derivatives. (Technically, t h i s i s only required almost everywhere.) Define g(Y) = f-'(Y). Then

where 3

i s the Jacobian o f the t r a n s f o m t i o n g J.. =


1J

aYJ

agi (Y

See Rudin (1974. p. 186) and Apostol (1969, p. 394) f o r the proof. Example 3.4-1 Let Y = CX, w i t h C and J = C - l , g i v i n g square and nonsingular. Then g(y) = C-'y

as the trarisformation equation. If f i s not i n v e r t i b l e , the d i s t r i b u t i o n o f


Y

i s given by a sum o f terms s i m i l a r t o Equation (3.4-1).

For the case w i t h m > n, the d i s t r i b u t i o n o f Y w i l l be concentrated on, a t most, an F-iimensional hypersurface i n ,R and w i l l not have a density f u n c t i o n i n I?,,,. , The simplest n o n t r i v i a l case o f m < n i s when Y consists o f a subset o f the elements o f X. I n t h i s case, the density function sought i s the density function o f the marginal d i s t r i b u t i o n o f the p e r t i n e n t subset I n general, when m < n, o f the elements o f X. Marginal d i s t r i b u t i o n s were discussed i n Section 3.3.3. X can oe transformed i n t o a random vector Z E Rn, such t h a t Y i s a subset o f the elements o f Z. Example 3.4-2 Let

R,

and Y = XI + X,.

Define

= CA

where

Then using example 3.4-1,

where

Then Y = Z,, so the d i s t r i b u t i o n o f Eqbation (3.3-10). 3.5 GAUSSIAN VARIABLES

i s the marginal d i s t r i b u t i o n of

Z ,

which can be computed from

Random variables w i t h Gaussian d i s t r i b u t i o n s p l a y a major r o l e i n t h i s document and i n much o f p r o b a b i l i t y e theory. W w i l l , therefore, b r i e f l y review the d e f i n i t i o n and s m of the s a l i e n t properties o f Gaussian d i s t r i b u t i o n s . These d i s t r i 5 u t i o n s are o f t e n c a l l e d normal d l s t r i b u t i o n s i n the l i t e r a t u r e .

3.5.1

Standard Gaussian D i s t r i b u c o s

A l l Gaussian d f s t r i b u t i o n s derive from the d i s t r i b u t i o n o f a standard Gaussian variable w i t h mean 0 and covariance 1. The density f u n c t i o n o f the standard Gaussian d i s t r i b u t i o n i s defined t o be

e The d i s t r i b u t i o n f u n c t i o n does n o t have a simple closed-form expression. W w i l l f i r s t show t h a t Equat i o n (3.5-1) i s a v a l i d density f u n c t i o n w i t h mean 0 and covariance l. The m s t d i f f i c u l t p a r t i s showing t h a t i t s i n t e g r a l over the r e a l l i n e i s 1. Theorem 3.5-1 Equation (3.5-1) defines a v a l i d p r o b a b i l i t y density function.

Proof The f u n c t i o n i s obviously nonnegative. i h e r e remains only t o show t h a t i t s i n t e g r a l over the r e a l l i n e i s 1. Taking advantage o f the symnetry about 0. we can reduce t h i s problem t o proving t h a t

There i s no closed-form expression f o r t h i s ,ntegral over any f i n i t e range, b u t f o r the s e m i - i n f i n i t e range o f Equation (3.5-2) the f o l l o w i n g " t r i c k " works. Form the square o f the i n t e g r a l :

Then change variables t o p o l a r coordinates, s u b s t i t u t i n g r Z f o r and r d r de f o r dx dy, t o get

xZ + y Z

The i n t e g t a l i n Equation ( 3 . 5 4 ) has a closed-form solution:

Thus.

Taking the square r o o t gives Equation (3.5-2). The mean o f the d i s t r i b u t i o n i s t r i v i a l l y zero by s j n m t r y .

completing the proof. To derive the covariance, note t h a t

EII
Thus,

- xZ)

= l l ( l

- x2)(2n)-1l2
COV(X)=

exp

( l

xz ax

(21)-l/zr rxp

(- -

x
z)\:m

(3.5-9)
=

~ 1 x 2 )

~1x1' = 1

-o=1

This completes our discussion o f the scalar standard Gaussian. W define a standard m u l t i v a r i a t e Gaussian vector t o be the concatenation o f n independent standard e Gaussian variables. The standard m u l t i v a r i a t e Gaussian density f u n c t i o n i s therefore the product of n marginal density functions i n the form o f Lquatton (3.5-1).

The mean o f t h i s d i s t r i b u t i o n i s 0 and the covariance i s an i d e n t i t y matrix. 3.5.2 General Gausslan O i s t r i b u t i o n s

W w i l l define the class o f a l l Gaussian d i s t r i b u t i o n s by reference t o the standard Gaussian d l s t r i b u t l o n s e o f the previous section. Ye define a random vector Y t o have a Gaussian d i s t r i b u t i o n ~f Y can be represented i n the form

where X i s a standard Gaussian vector, A i s a d e t e r m i n i s t i c matrix and m i s a d e t e r m i n i s t i c vector. The A matrix need not be square. Note t h a t any d e t e r m i n i s t i c vector i s a special case o f a Gaussian vector w i t h a zero A matrix. W have defined the class o f Gaussian random variables by a set o f operations t h a t can produce such e variables. I t now remains t o determine the forms and properties o f these d i s t r i b u t i o n s . (This i s somewhat backwards from the most comnon approach, where the forms o f the d i s t r i b u t i o n s are f i r s t defined and Equat i o n (3.5-12) l s , roven as a r e s u l t . W f i n d t h a t our approach makes i t somewhat easier t o handle singular e and nonsingular cases c o n s i s t e n t l y without introducing c h a r a c t e r i s t i c functions (Papoulis, 1965). By Equations (3.3-7) and (3.3-8). the Y defined by Equation (3.5-12) has mean m and covariance AA*. Our f i r s t major r e s u l t w i l l be t o show t h a t a Gaussian d i s t r i b u t i o n i s uniquely s p e c i f i e d by i t s mean and covariance; t h a t i s , i f two d i s t r i b u t i o n s a r e both Gaussian and have equal means and covariances, then the two d i s t r i b u t i o n s are i d e n t i c a l . Note t h a t t h i s does n o t mean t h a t the A matrices need t o be i d e n t i c a l ; the reason the r e s u l t i s n o n t r i v i a l i s t h a t an i n f i n i t e number o f d i f f e r e n t A matrices g i v e the same covariance AA*. Example 3.5-1 Consider three Gaussian vectors

and

where X, and X, are standard Gaussian 2-vectors and X, Gaussian 3-vector. W have e

i s a standard

Thus a l l three

Yi

have equal covariance.

The r e s t o f t h i s section i s devoted t o proving t h i s r e s u l t i n three steps. F i r s t , we w i l l consider sqlrare. nonsingular A matrices. Second, we w i l l consider general square A matrices. F i n a l l y , we w i l l consider nonsquare A m t r i c e s . Each o f these steps uses the r e s u l t s o f the previous step. Theorem 3.5-2 I f Y i s a Gaussian n-vector defined by Equation (3.5-12) w i t h a nonsingular A matrix, then the p r o b a b i l i t y dencity f u n c t i o n o f Y e x i s t s and i s given by p ( y ) * 1 ~ n A' ' 1 2 eXp[i where
A

(y

- .)*A-'(y - m)l

i s the covariance

AA*.

Proof This i s d i r e a c Equation a(3.4-1).c t a p p l i c a t i o n o f the transformation o f variables f o r P,(Y) = pX[A-'(y

- m)llh"I

Substituting A

f o r kA*

then gives the desired r e s u l t .

Note t h a t the densl:y function, Equation (3.5-13). depends only on t h e mean and covariance, thus proving the uniqueness r e s u l t f o r the case r e s t r i c t e d t o nonslngular matrices. A p a r t i c u l a r case o f i n t e r e s t I s where m i s 0 and A i s u n i t a r y . (A u n l t a r y m t r i x i s J square one w l t h Me = I . ! I n t h l s case. Y has a standard Gausslsn d i s t r i b u t i o n .

Theorem 3.5-3 I f Y I s a Gaussian n-vector defined by Equation (3.5-12) wfth any square A matrix, then Y can be represented as Y * S ~ + M where 1 i s a standard Gaussian n-vector and S i s p o s i t i v e semi-definite. Furthermore, the S i n t h i s representation i s unique and depends only on the covariance o f Y. Proof s first. -fThe uniqueness ii n easy t o prove, andi we w i l l do i tcovarianceThe f covario the Y given Equation (3.5-12) s AA*. The o a Y
ante

(3.5-14)

expressad as i n Equation (3.5-14) i s SS*. A necessary (bu not s u f f i c i e n t ; condition f o r Equation (3.5-14) t o be a v a l i d representation o f Y i s therefore, t h a t SS* equal AA*. I t i s an elementary r e s u l t o f l i n e a r :igebra (Wilkinson, 1965; Dongarra, Holer, Bunch, and Stewart, 1979: 2i10 Strang, 1980) t h a t AA* i s always p o s i t i v e semi-definite and t h a t there i s one and only one p o s i t i v e semi-definite matrtx S s a t i s f y i n g SS* = AA*. S i s c a l l e d the matrix square r o o t of M*. This proves the uniqueness. The existence proof r e l i e s on another r e s u l t from l i n e a r algebra: any square matrix A can be factored as SQ, where S i s p o s i t i v e sscri-definite and Q i s unitary. For nonsingular A, t h i s f a c t o r i z a t l o n i s easy-S i s the matrix square r o o t o f AA* and Q i s S"A. A formal proof f o r general A matrices would be too long a diversion i n t o l i n e a r algebra f o r our current purposes, so we d i l l omit it. This f a c t o r i z a t i o n i s closely r e l a t e d t o , and can be formally derived from, the well-known QR f a c t o r i z a t i o n , where Q I s u n i t a r y and R i s upper t r i a n g u l a r (Wilkinson, 1965; Dongarra, Koler. Bunch, and Stewart, 1979; and Strang, 1980). Given the SQ f a c t o r i z a t i o n o f

A, d r f i n e

By theorem (3.5-2). X i s a standard Gaussian n-vector. S u b s t i t u t i n g i n t o Equation (3.5-12) gives Equation (3.5-14). completing the proof. Because the S i n the above theorem depends c n l y on the covariance o f Y. i t imnediately follows t h a t A matrix i s uniquely specified by the mean and covariance. It remains only t o extend t h i s r e s u l t t o rectangular A matrices.
the e L s t r i h u t i o n of any Gaussian variable gener.aten by a square

Theorem 3.5-4 The d i s t r i b u t i o n o f any Gaussian vector i s uniquely defined b y I t s mean and covariance. Proof - We square A have already shown the r e s u l t f o r Gaussian rector:, generated by matrices. W need only show that a Gaussian r e c t o r generated by e a rectangular A matrix can ?e r e w r i t t e n i n terms o f a square A matrix. Let A be n-by-m, and cons, tr '.he two cases, n > m and 0 m. I f n . m : . define a standard Gaussian n-vector x by augmenting the X vector w i t h n m independent standard Gaussians. deffne an n-by-n matrix by augmenting A w i t h n m rows o f zeros. W then have e

as desired. For the case n < mr define a random m-vector m n zeros. Then

by augmenting

with

PaAxti
where 61and Theorem (3.5-3) are obtained-by augwnting zeros t o m and A. t o r e w r i t e Y as Use

.=sicti
SInce the l a s t m i n the form

-n

elements o f

are zero, E q u a t i ~ n(3.5-16) nust be

Thus Y=5ktm which i s i n the required fonn. Theorem (3.5-4) i s the c e n t r a l r e s u l t o f t h t s approach t o Gaussian variables. It makes the p r a c t i c a l manipulation o f Gaussian variables much easier, Once you have demonstrated t h a t sane r e s u l t i s Gaussian. you

r,eed o n l y derive the mean anc' covariance t o specify the d i s t r i b u t i o n completely. This i s f a r easier than manipulating the f u l l density . u n c t i o n o r d i s t r i b u t l n n function, a process which often requires p a r t i a l d i f f e r e n t i a l equations. I f the covariance m a t r i x i s nonslngular, then the density f u n c t i o n e x i s t s and i s given by Equation (3.5-13). I f the covariance i s singular. a density f u n c t l o n does not e x i s t (unless you extend the d e f i n i t i o n o f density functions t o include components l i k e impulse functions). Two p r o p d r t l c s o f the Gaussian density f u n c t i a n o f t e n provide usr.ful computational shortcuts t o evaluating the mean and covariance o f nonsingular Gaussians. The f i r s t property i s t h a t the mean o f the density f u n c t i o n occurs a t i t s n~aximum. The mean i s thus the unique s n l u t i o n o f

The logarithm i n t h i s equation can be removed, b u t the equation i s u s u a l l y most useful as w r i t t e n . property i s t h a t the covariance can be expressed as cov(Y) =

The second (3.5-18)

-[vi en p ( y ) ] - I

Both o f these properties are easy t o v e r i f y by d i r e c t s u b s t i t u t i o n i n t o Equation (3.5-13). 3.5.3 Properties

I n t h i s section we derlve several useful properties o f Gaussian vectors. Most o f these properties r e l a t e t o operations on Gaussian vectors t h a t g i v e Gaussian r e s u l t s . A major reason f o r the wide use o f Gaussian d f s t n - i b u t i o n s i s t h a t many basic oberations on Gaussian vectors g i v e Gaussian r e s u l t s , which can be characterized completely by the mean and covariance. Theorem 3.5-5 I f Y i s a Gaussian vector w i t ? mean m and covariance and i f Z i s given by A,

then Proof --

Z i s Gaussian w i t h mean B + b and covaria~ice BAB*. m


By d e f i n i t i o n , Y can be expressed as

where X Z gives

i s a standard Gaussian.

Substituting Y

i n t o the expression for

proving t h a t Z i s Gaussian. The mean and covariance expressions f o r l i n e a r operations on any random vector were previously derived i n Equations (3.3-7) and (3.3-8). Several o f the properties discussed i n t h i s section involve the concept o f j o i n t l y Gaussian variables. Two or m r e random vectors are said t o be j o i n t l y Gaussian i f t h e i r j o i n t d i s t r i b u t i o n i s Gaussian. Note t h a t two vectors can both be Gaussian and y e t n o t be j o i n t l y Gaussian. Example 3.5-2 Let Y be a Gaussian rarldom v a r i a b l e w i t h mean 0 and variance 1. Define Z as

The random v a r i a b l e Z i s Gaussian w i t h mean 0 and variance 1 (apply Equation (3.4-1) b u t Y and Z are not j o i n t l y Gaussian. ~ : ~ e o r e 3.5-6 Let Y, and Y be j o i n t l y Gal - i a n vectors, and l e t the mean m , and covariance o f the j o i n t d i s t r i b u t 4 m Lie p;-titloned as

t o show t h i s ) ,

Then the marginal d i s t r i b u t i o n s o f Y, and Y, E(Y,) = m , EIY,)


P

are Gaussian w i t h

cov(Y,) cov(Y,)

= =

A , A ,

= m ,

Proof Apply theorem (3.5-5) w i t h

B = [l 01 and

E = [0

11.

The f o l l o w i n g two theorems r e l a t e t o Independent Gausslan varlables: Theorem 3.5-7 Eausslan. If Y and

are two independent Gaussian variables. then Y and Z are j o i n t l y

Proof For ncnstngular d t s t r i b u t t o n s , t h t s proof t s easy t o do by w r i t l n g out m r o d u c t o f the denstty functtons. For a more general proof, we can proceed as follows: w r i t r Y and Z as

e where X, and X are standard Gaussian vectors. W can always construct the I n tkese equations t o be tndependent, but the f o l l o w i n a ar-r!ment X, and X, .-.,. standard avotds the necessity t o prove t h a t statement. Deftne trx, inde Gaussians, R, and k,. and f u r t h e r deflne

Then ? and-f have the same j o t n t d t s t r i b u t t o n as Y and 2 . The concatenatton o f k and X, I s a standard Gaussian vector. Therefore, ? and 9 a r e j o i n t l y ~ a u s s f a nbecause they can be expressed as

Stnce Y and Z have the came !otnt d i s t r i b u t t o n as j o t n t l y Gausslan.

and

i, Y

and

Z a r e also

tehn

Theorem 3.5-8

I f Y and Z are two uncorrelated j o t n t l y Gausslan i a r t a b l e s , are tndependent and Gaussian.

Proof - By theorem (3.5-3).

we can express

where X i s a standard Gaussian vector and S i s p o s t t t v e semt-deflnite. P a r t i t i o n S as

By the d e f t n i t t o n o f "uncorrelated," we ms'. have S = st1 = 0. Therefon. , p a r t i t i o n t n g X i n t o X, and X,, and p a r t t t i o n i n g m I n t o m, and m we , can w r i t e

o Since Y and Z dre f u ~ r i i o n s f the independent vectors are independent and Gaussian. Stnce any two tnde~cndentvectors are uncorrelated, Theorem (3.5-8) c o r r e l a t i o n are equivalent f o r Gaussians.

X, and X,,

Y and Z

proves t h a t tndependence and lack of

W prevfously covered margtnal d i s t r i b u t t o n s o f Gaussian vectors. The f o l l o w i n g t h e o r m considers condie t t o n a l d i s t r i b u t i o n s . Ue w t l l d i r e c t l y constder on;y condttional d t s t r t b u t i o n s c f nonstngular Gausstans. Stnce the r e s u l t s of the t h e o r w involve tnverses, there are obvious d i f f i c u l t i e s t h a t cannot be cirrumvented by avotdtng the use o f p r o b a b i l t t y denstty functions I n the proof. Theorem 3.5-9 Let Y, and Y, be j o t n t l y Gaussian vartables w i t h a nonslnguJ a r j o i n t d i s t r i b u t t o n . P a r t i t i o n the man, c ~ v a r t a n c e , and inverse c o v a r ~ a n ~ e o f the j o t n t d i s t r i b u t i o n as

Therr the condlttonal d t s t r i b u t t o n s o f L u s s t a n wtth means and covrrtances EIYlIY21 cov(Y1lYII

Y, given Y2, and o f

Y,

given Y,,

are (3.5-188) (3.5-18b) (3.5-191) (3.5-19b)

=m , = A,,
rn

A,,A;:(Y,

- m,)

- ~,,fi.;;~,,

EIY,(Y,I
COV(Y,IY,)

m + A,,A;:(Y~ ,
A ,,

- A2,A;:b1,

(rl1!-l ml) (rt2)-'

Proof - Th

j o i n t p r o b a b i l i t y density functicn o f

Y,

and Y,

is

where c compute.

i s a scalar constant. the magnitude o f which we w t l l not need t o Expanding the exponent, and recognfzing t h a t r,, = r,,*, gives

Completing squares r e s u l t s i n

I n t e g r a t i n g t h l s expression w i t h respect t o y, gives the marginal denslty function o f Y,. The second term i n the exponent does not tnvolve y,, and we recognize the f i r s t tern as the exponent i n a Gaussian density function wi:h m a n m , r-'r ,(y m and covariance r, , ) I t s integral with respect t o y iilthere#ore 3 constant independent o f y,. The m r ~ ! n a l density o f Y, i s therefore 1 P ( Y ~ ) c2 exp[- 7 (y2 m2)*(r,, = r,,r;?r,,)(~~ m2d

function

where c, i s a constant. Note that because we know t h a t Equation (3.5-22) must h- = p r o b a b i l i t y density function, we need not cmouee the value o f c,; t h i s srves us a l o t o f work. Equation (3.5-22) i s an expression f o r a , r,lry:r,2)-A. Gaussian density functton w i t h m a n m arld covariance (1:; The p a r t i t t o n e d matrix Inversion l e m (Appendix A) gives us

thus independently v e r t f y t n g the r e s u l t o f Theorem (3.5-6) on the marginal distribution. The conditional density o f Y, g:ven Y, i s obtained using Baves' r u l e , by d i v i d i n g Equatton (3.5-21) by Equation (3.5-22)

where c, i s a constant. This I s an expression f o r a Gaussian density function w i t h a m a n m , r;:r ,(y, m and covariance r: ;. The p a r t l tioned r t r l x inversion l a (Appendix then gtves

Thus the condittonal d i s t r i k t i o n o f Y, given Y, I s Gaussian w i t h m a n m + A '(y m ) and r o v a r l m c e A l :,A;)A~,. as we desired t o prrve. coAai2fon:l di:tribution o f V, given Y1 f o lows by s y n n t r y .

he

The f i n a l r e s u l t o f t h i s section concerns sums o f Gnussian variables. Theorem 3.5-10 I f Y and Y are j o i n t l y Gaussian random vectors a f q u a l l e n g t h and t h e i r j o i n t d:strfbution has mean and covarlance p a r t i t i o n e d as

Then Y, + Y,
"1
+

Al,

A12

I s 6russian w i t h man m, + At,.

+ m,

and covariance

Proof - Apply T h e o m (3.5-5)

w l t h 5 = [I

I] and b = 0.

A s l l p l e s u R I r y of t h l s section i s t h a t l i n e a r oper&tions on b u s s f a n variables g l v e Gaussian r e s u l t s . This p r l n - i p l e i s not generally t r u e f o r non1l;rrar oporatfons. Therefore. b u s s l s n distributions arc s t w n g l y associated w l t h the analysis of l l n e r r system.

3.5.4
3.5.4
Central L i m t t Theorem

Tne Central L i m i t Theorem i s o f t e n used as a basis f o r j u s t i f y i n g the assunrption t h a t the d i s t r i b u t i o n o f some physical quantity i s approximately Gaussian.

Y, be a sequence o f independent, i d e n t i c a l l y d i s t r i b u t e 0 randoo Theorem 3.5-11 Let Y,. vectors w i w i n i t e mean m and covariance A. Then the vectors
N

...

converge i n d i s t r i b u t i o n t o a Gaussian vector w i t h mean zero and covariance Proof See Ash (1970, p. 171) and Apostol (i569, p. 567)

A.

Cramer (1946) discusses several variants on t h i s th:orem, where t h e Y i need n c t be independent and ident i c a l l y distributed, b u t nther requirements a r e placed on the d i s t r i b u t i o n s . The general r e s u l t i s t h a t sums o f random variables tend t o Gaussian l i m i t s under f a i r l y broad conditions. The precise conditions w i l l not concern us here. An implication o f t h i s theorem i s t h a t macroscopic behavior which i s the r e s u l t o f the sumnation o f a 13rge number o f microscopic events often has a Gaussian d i s t r i b u t i o n . The classic example i s Brownian motion. W : r i l l i l l u s t r a t e the Central L i m i t Theorem w i t h a simple example. e Exa l e 3.5-3 Let the d i s t r i b u t i o n o f the Y i i n Tl~eorem(3.5-11) be uniform on z e i n Z G a ? (-1.1). Then the mean i s zero and the covariance i s 113. Examine the densit,y functions o f the f i r s t few Zi. The f i r s t function, Z,, i s equal t c Y,, and thus i s uniform on (-1.1). Figure (3.5-1) compares the densities o f Z and the Gaussian l i m i t . The , GadSSian l i m i t d i s t r i b u t i o n has mean zero and variance 113. For the second function we have

and the density fuactior. o f

Z,
I

i s given by I z for

z
Figure (3.5-2)

Irl

compares the density o f

1 , w i t h the Gauss~anl i m i t .

The density function o f

Z,

i s given by

Figure (3.5-3) compares density o f 2 , w i t h the Gaussian l i m i t . N i s 3, ZN i s already becoming reasonably close t o Gaussian.

By the time

21

Figure (3.5-1).

Density functions o f

2,

acd the l i . n i t Gaussian.

Figure (3.5-2). Density functions of the l i m i t Gaussian.

Z2

and

Fiaure (3.5-3).

Density functions o f the l i m i t Sahssian.

Z3

and

CHAPTER 4 4.0 STATISTICAL ESTIMTORS

I n t h i s chapter, we introduce the c o n c ~ p tof at, ;mator. k then define some basic measures of e s t i mator perfowance. Uc use these measures of p e r f o m . . i e t o introduce several c m a n s t a t i s t i c a l estimators. The d e f i n i t i o n 5 i n t h i s chapter are general. Subsequent c h a p t e ~ w i l l t r e a t s p e c i f i c forms. For other s treatnents o f t h i s and r e l a t e d material. see Sorenson (1980). Schweppe (1973). Goodwin and Payne (1977), and Eykhoff (1974). These books a l s d c m e r other e a * i m t o r s t h a t we do not mention here.
4.1

DEFINITION G AN ESTIMATOR F The s t a t i s t i c a l d e f i n i t i o n o f an estimator i s as

The concept o f e s t i m t i o n i s central t o our study. f3llows: Perform an experiment (input) U taken from the set , responre i s a random variable:

o f possible experiments on the system.

The system

where
5.

F. E E

i s the t r u e value o f the parameter vector and


Z

n i s the random component o f the >;stem.

An estimator i s any function o f Thus

w i t h range i n 5.

The value O F the function i s c a l l e d the est'mate

= i!z.u)

= ~(Z(E.U.~).U)

(4. I-?)

This d e f i n i t i o n i s r e d d i l y generalized t o r m l t i p l e performances o f the sane experiment o r t o the performance o f more than one experiment. I f N experiments Iji are performed, w i t h responses Zi, then an estlmate would be o f the 'om

i=

E(Z

,.... ZN.u ,...U,)


,. )a

= i(z(~.u~,0
where the U. are independent. Thc N experiments can be regarded as a single "super-ex eriment" the response t o which i s the concatenated vector (2,. ..ZN) t $ x I x (U,. ..UN) td x 0 x x The random element i?(u, ...-i~) t n . n x . x n. Equation (4.1-3) i s then simply a restatement o f Equation (4.1-2) on rbe l a r g e r space.

...

...

a ... x(2L

For s i m p l i c i t y o f notatio., we w i l l generally omit the dependence on U from Equations (4.1.-1) anti (4.1-2). For the nost part, we w i l l be discussing parar,xter estimation based on responses t o s p e c i f i c , known inputs; therefore, the dependence o f the response and the estimate on the input are i r r e l e v a n t , and merely c l u t t e r up the notation. Formally, a l l o f the d i s t r i b ~ r t i o n sand expectations #;lay be considered t o be impTlci t l y conditioned on U. Note t h a t the estimate 5 i s a random var.abl2 because i t i s a function o f Z, which i s a rsndom variable. When the experiment i s a c t u a l l y performed, s p e c i f i c r e a l i z a t i o n s o f these random variables w i l l be obtained. Thc t r u e parameter value F. i s not u s u a l l y considered t o be random, simply unknown. I n some s i t u a t i c n s , however, i t i s convenient t o define 5 as a random variable instead o f as an unknown paraneter. The s i g n i f i c a n t difference between these approaches i s t h a t a random variable has a p r o b a b i l i t y d i s t r i b u t i o n , which constitutes additional information t h a t can be used i n the random-variable approach. Several popular estimators can only t e defined using the random-variable approach. These advantages of tne random-variable approach are balanced by the necessity t o know the p r o b a b i l i t y d i s t r i b u t i o n o f E. Ift h i s d i s t r i b u t i o n i s not known, there are no differences, ~ x c c p ti n t e r m i n o l o ~ , between the randcm-variable and unknown-peramct;; approaches.

A t h i r d view o f 6 involves idsas from information theory. I n t h i s context. E i s considere6 :c be an unknt~wnparameter as above. Even though 5 i s n o t random, i t i s defined t o have a "probabi 1i t y d i s t r i b u t i o n . " This p r o b a b i l i t y d i s t r i b u t i o n doer not r e l a t e t a any randomness o f F.,but r e f l e c t s our 4 ,owledge o r informat i ? n about the value o f 6. D i s t r i b u t i o n s w i t h low variance correspond t o a high degree o f c e r t a i n t y about the value o f 5, and vice versa. The term " p r o b a b i l i t y d i s t r i b u t i o n " i s a misnomer i n t h i s context. The terms "information d i s t r i b u t i o n " o r "information function" more accurately r e f l e c t t h i s i n t e r p r e t a t i o n .
I n the contsxt o f information theory, the marginal o r p r i o r d i s t r i b u t i a n p(6) r e f l e c t s the information about 5 p r i o r t o perfonning the experiment. A case wh?re there i s no p r i o r information can be handled as a i i m i t o f p r i o r d i s t r i b u t i o n s w i t h less and l e s s information (variance guing t o i n f i n i t y ) . The d i s t r i b u t i o n of , t h e response Z i s a function o f the value o f 6. When F i s a random variable, t h i s i s c a l l e d p ( Z 1 ~ ) . the e conditional d i s t r i b u t i o n o f z glven c. W w i l l use the same notation when E i s n o t randor i n order t o emph~sizethe dependence o f the d i s t r i b u t i o n on c , and f o r consistency o f notdtion. When p(6) f s defined, the j o i ~ l tp r c % b i l i t y density i s then

The marginal p r o b a b i l i t y density o f

Z is P(Z) = $ P ( z . ~ ) ~ I c I

The conditional d e ~ ~ s i of ty

c given Z (also c a l l e d the p o s t e r i o r density) i s

I n tCe information theory context, the p o s t e r i o r d i s t r i b u t i o n r e f l e c t s informatiorl about the value o f 6 a f t e r t h e experinen: i s performed. I t accounts f o r the information knowrl p r i o r t o the exberiment, and the informat i o n gained by the e x p e r i m n t . The d i s i i n c t i o n s among the rarld~mvariable, unknwrt parameter, and i n f o r n a t i o n theory p o i n t s o f view are l a r g e l y academic. Although the conventional notations d i f f e r , the equations used are equivalent i n a l l three e cases. Cur presentation uses thp p r o b a b i l i t y density n o t a t i o n througnout. W see l i t t l e b e n e f i t i n repeating i d e n t i c a l deribations, s u b 5 t i t u l i n g the t e n "intormation function" f o r " l i k e l i h c o d functicn" and changing notation. W derive the basic e q u a t i c ~ so n l j once, r e s t r i c t i n g the d i s t i n c t i o n s among the three p o i n t s o f view e t o discussions c f a p p l i c a b i l i t y and i n t e r p r e t a t i o n . 4.2 PROPERTIES O ESTIMATORS F

W can define an i n f i n i t e nu~rbero f ectimators f o r a given problem. The d e f i n i t i o n o f an estimator proe vides no means o f evaluating these estimators, some o f which can be r i d i c u l o u s l y poor. This section w i l l describe some o f the properties used t o evaluate estimators and t o select a good e s t i m t o r f o r a p a r t i c u l a r problem. The properties are a l l expressed i n terms o f o p t i m a l i t y c r i t e r i a . 4.2.1 Cnbiased Estimators

A bias i s a consistent o r repeatable e r r o r . The parameter estimates from any s p e c i f i c data set w i l l 21wtys be imperfect. I t i s reasonable t o hope, howerer, t h a t the estimate obtained from a l a r g e set of maneuvers w u l a be centered around the t r u e value. The e r r o r s i n the estimates might be thought o f as consisti n g o f two compcinents- copsistent e r r o r s and random e r r o r s . Random error; 3 r e generally unavoidable. Consist e n t o r average e r r o r s might be removable. Let us r t s t a t e the above ideas more precisely. The bias b o f an estimator

i(.)i s

defined as (4.2-1)

b ( ~ ) ~ t i I - : = E I ~ ( Z ( C . ~ ) ) ~ <E) = ~ 1 The i i n thes- equations i s a o f the t r u e value. I t averages the d i f f e r e n t t r u e values. The made e r p l l c i t . A i l discclssions

random variable, n o t a s p e c i f i c r e a i i z a t i o n . Note t h a t the bias i s a f u n c t i o n out (by the E I . I ) the random noise e f f e c t s , b u t there i s no averaging among b i a s i s a l s o a f u n c t i o n of the i n p u t U, b u t t h i s dependence i s n o t u s u a l l y o f bias are i m p l i c i t l y r e f e r r i n g t o some given Input.

An unbiased estimator i s defined as an estimatar f o r which the b i a s i s i d e n t i c a l l y zero:

This requiremerit i s q u i t e s t r i n g e n t because i t must be met f o r every value o f Unbizsed estimators may n o t e x i s t f o r some problems. Fsr other problems, unbiased estimators rray e x i s t , b u t may be too c m p l i c a t e d f o r p r a c t i c a l computation. Any estimator thar. i s n o t unbiased i s caller! h:ased. Generally, i t i s considered desirable f o r an estimator t o be unbi?sed. This judgment, however, does n o t apply t o a l l s i t u a t i o n s . The bias o f an e r t i m a t o r measures o n l y the average o f i t s behavior. I t i s possible f o ? the i n d i v i d u a l e s t i m t e s t o be so poor t h a t they are ludicrous, y e t average o u t so t h a t the e s t i m a ~ o ri s unbiased. The following example i s taken from Ferguson (1967, p. 126). Exam i e 4.2-1 A telephone operator has been working f o r 10 minutes and wondersPif he would be missed i f he took a 20 minute coffee break. Assume t h a t c a l l s are coming i n as a Poisson process w i t h the average r a t e o f calls per 10 minutes, A being unknown. Tte number Z o f c a l l s received i n the f i r s t 10 minutes has a Poisson d i s t r i b u t i o n w i t h parameter A .

:.

O the basis o f 2 , the operator desires t o estimate 8, the p r o b a b i l i t y of n receiving no c a l l s i n the next 20 minutes. For a Poisson process, 8 = I f t h e estimator 6(Z) i s t o be unbiased, we must have E ~ ~ ( z ( B . ~ ) ) ~=B 6 I Thus for a l l
BE

!O,l]

M u l t i p l y by eA, g i v i n g

Expand the right-hand side as a power series t o get

The convergent power series are e ual f o r a l l i E [O.-) i f the c o e f f i c i e n t s are i d e n t i c a l . Thus 8(Z) = (-I)? i s the on,y unbiased estimator o f B f o r t h i s problem. The operator would estimate tht* p r o b a b i l i t y o f missing no c a l l s as +1 i f he had received an even number o f c a l l s and -1 i f he had i received an odd n u h e r o f c a l l s . This e s t i ~ t o r s the only unbiased estimator r i d i c u l o u s l y poor one. I f the estimates are f o r the ?roblem, b u t i t i s reauired t o l i e i n the meaningful range o f [0,1], then there i s no unbiased can be e a s i l y constructed. estimator, b u t some q e i t e reasonable biased ~ s t i m t o r s

I.

The bias i s a uceful t o o l f o r studying estimators. I n general, i t i s desirable f o r the b i a s t o be zero, o r a t l e a s t s n a l l . However, because the b i a s measures only the average properties o f the estimates, i t cannot be used as t h e sole c r i t e r i o n f o r evaluating estimators. It i s possible f o r a Diased estimator t o be c l e a r l y superior t o a l l o f the unbiased esti:,ators f o r a problem. 4.2.2 Minimum Variance Estimator? The variance o f an estimator i s defined as

Note t h a t the variance. l i k e the biss, i s a function o f the input and the t r u e value. The varibnce alone i s not a reasonaole measure f o r evaluating an estimator. For instance, any constant estimator (one t h a t always returns a constant value, ignoring the data) has zero variance. These arc obviously poor estimators i n most situations.

A more useful measure i s the mean square e r r o r :

The mean square e r r o r and variance are obviously i d e n t i c a l f o r unS!ased estimators (E{ilc; = 5). An estimator i s uniformly minimm mean-square e r r o r i f , f o r every value o f 5, - t s mean square e r r o r i s l e s s than o r equal t o the mean square e r r o r o f any other estimator. Note t h a t the man-square e r r o r i s a cymnetric nutrilc. Cnc s y m e t r i c m t r i x i s l e s s than o r equal t o another i f their. difference i s poii:;vr semi-driinite. l h i s d e f i n i t i o n i s somewhat academic a t t h i s p o i r t Secau-e such +srimators do not e x i s t except i n t r i v i a l cases. A cons t m t ejtimatcir has zero medn-square ? r r o r when i s equal t o tne constant. (The performance i s poor a t cther values o f 5.) Therefore, i n order t o be uniformly miniinur mean-squdre error, an estimator would have t o have zero mean-square e r r o r f o r every 5; otherwise, a constant estima:or would be b e t t e r f o r t h a t c. Tne concept o f minimrm mean-square e r r o r becomes more useful i f the class o f estimators allowed i s r e s t r i c t e d . An estimator i s uniformly minimum mean-square e r r o r unbiased i f i t i s unbiased and, f o r every value o f 5, i t s mean-square e r r o r i s l e s s than o r equal t o t h a t o f any other unbiased estimator. Such e s t i mators do n o t e x i s t f o r every problem, because the requirement must hold f o r every value o f 5. Estimators optimum i n t h i s sense e x i s t f o r many problems o f i n t e r e s t . The mean-square e r r o r and t h e variance are i d e n t i c a l f o r unbiased estimators, so such optimal estimators are a l s o c a l l e d uniformly minimum variance unbiased estimators. They are also o f t e n c a l l e d simply minimum variance est.imators. This term should be regarded as an abbreviation, because i t I s n o t lneaningful i n i t s e l f . 4.2.3 &met--Rao Inequality ( E f f i c i e n t Estimators1

The Cramer-Rao i n e q u a l i t y i s one o f the c e n t r a l r e s u l t s used t o evaludto the performance of estimators. The i n e q u a l i t y gives a t h e o r e t i c a l l i m i t t o the accuracy t h a t i s porsible, regardless o f the estimator used. I n a sense, the Cramer-Rao i n e q u a l i t y gives a measure o f the information content o f the data. Before d e r i v i n g the Cramer-Rao inequality, l e t us prove a b r i e f lemna.

Lm a 4.2-1 en

Let

X and Y

be two random N-vectors.

Then

E{XXf) 2 E{XY*)[EIYY*)]"EIYX*l assuming t h a t the inverse exists.

8-by-N matrix.

Proof

The proof i s done by completing the square. Then EI(X

Let

A be any nonrandom (4.2-6)

AY)(X

- AY)*l

r 0

because i t i s a covariance matrix.

Expanding

choose

4.2.3 Then E{ XX*) or


2

E{XYt , ~ ( Y Y * ) ] " E ~ Y x *+ E(xY+)[E;YY*)]-'EIYX'I ~ [E{xY*)[E(YY*)]'~E{YY*)[E~YY*~]~~EIYX*)

contpleting the lemna. Ye now seek t o f i n d a bound on E { ( i

- c ) ( i - c)*Jc},

the mean square e r r o r of the estimate.

Theorem 4.2-2 (Cramer-Rao) Assume t h a t the density p(ZI6) e x i s t s and i s smoothenough t o allow the operations below. (See Crame'r (1946) f o r d e t d i l s . ) This assumption proves adequate f o r most Lases o f i n t e r e s t t o us. Pitman (1979) discilsses some o f the cases where ? ( Z i t ) i s not as smooth as required here. Then

where

Proof lemra - Let X and Yl e tfroml o f the (4.2-1) be E(Z]n - t respectively, and al expectations i the on 5. Concentrate f i r s t on t+e term

and V an p(ZI6). ; lemna be conditioned

where d / Z I relation

!s the volume element i n the space Z.

Substituting the

gives EtxY*l~) =

$(~(zI - ~ ) i ~ ~ p ( Z l ~ ) j d ~ Z ~

=J~(Z)(V~P(ZIC))~IZI -J(V~PQICI)~IZI Now i i ~ i)s not a f u n c t i o n o f c . Therefore, assuming s u f f i c i e n t smoothness o f p(Z1.C) as a function o f F, the f i r s t t e n becomes , f t ( z ) ~ ~ P ( z ~ o = ~ z J~(Z!P(ZIF)~IZI d v5 ~ = 7cE{i(Z)j0 Using the d e f i n i t i o n (Equation (4.2-1)) o f the bias, obtain vgE{<(Z)It}
I

(4.2-15)

vF[F + b(F)1 = I + vLb(F)


f

(4.2-16) Z, so

I n t h e second term o f Equation (4.2-14).

i s not a function o f

f i ~ ~ p ( z l ~ ) d l = lI~~/P(ZIC)~IZI z = t v 1 = 0 F Using Equations (4.2-16) and (4.2-17) 11) Equation (4.2-14) E{XY*/{) = I + v C b ( t ) Define the Fisher Information m a t r i x M(c) : EIYY*(c) They by lemna (4.2-1) EI(~(z)
5

(4.2-17) gives (4.2-18)

E{(v;

an ~ ( Z l c ) ) ( van P ( Z I E ) ) ] C I ~

- c)(i(Z) - t)*lc)

[I+ v ~ ~ ( c ) I M ( ~ ) - ~+ v F b ( c j l * CI

(4.2-10)

which ?s the desired r e s u l t . Equation (4.2-10) i s the Cramer-Rao fne u a l i t y . I t s s p e c i a l i z a t i o n t o unbta:ed interest. For an unbiased estimator, b ? ~ )i s zero so E { ( ~ ( z ) c)(i(Z) estimator:, i s o f particular (4.2-20)

- c)*It)

r M(c)-'

This gives us a lowar bound, as a function o f c , or) the achievable variance o f any unbiased estimator. An unbiased e s t i ~ m t o r which a t t a i n s the e q u a l i t y i n Equation (4.2..20) i s c a l l e d an e f f i c i e n t estimator. No estimator can achieve a lower variance than an e f f i c i e n t estimator except by introducing a b i a s i n the e s t i mates. I n t h i s sense, an efficient estimator mkes the mit use o f the information available i n tne data. The above development gives !lo guarantee t h a t an e f f i c i e n t estimator e x i b t s f o r every prnblem. When an e f f i c i e n t estimator does e x i s t . i t i s a l s o a uniformly minimum variance unbiased estimator. I t i s much edsier t o check f o r equality i n Equation (4.2-20) than t o d i r e c t l y prove t h e t no other unbiased estimator has a smaller variance than a given estimator. The Cramer-Rao ineqirality i s tnerefore useful as a s u f f i c i e r ~ t(but not necessary) check t h a t an estimatcjr i s uniformly minimum variance unbiased. A uszful a l t e r n a t i v e expression f o e the informatior! matrix F: can be obtained i f p ( Z / t ) i s s u f f i c i e n t l y smooth. Applying iquation (4.2-13) t o the definition o f M (Equation (4.2-19)) gives

Then exdmine

The second term i s equal t o

M(c). as shobm i n Equation (4.2-21).

Evaluate the f i r s t term as

Thus an alternate expression for the information c a t r i x i s

4.2.4

Bayesian Optimal Eztimators

The o p t i m a l i t y conditions o f the previous sections have bcen q u i t e r e s t r i c t i v e i n t h a t they must hold simultaneously f o r every possible value o f 6 . Thus f o r s m problems, no estimatops e x i s t t h a t are optimal by these c r i t e r i a . The Bayesian approach avoids t h i s d i f f i c u l t y by using a single, o v e r a l l , o p t i m a l i t y c r i t e r i o n which averages the e r r o r s made f o r d i f f e r e n t values o f 5. With t h i s appro3ch. an optimal estimator may be worse than a nonoptimal one f o r s p e c i f i c values o f 5 , but the o v e r a l l averaged performance o f the Bayesian optimal estimator w i l l be better. The Bayesian approaih requires t h a t a l o s s function ( r i s k function, o p t i m a l i t y c r i t e r i o n ) be defined as a function o f the t r u e value 5 and the estimate i. The most comnon loss function i s a weighted square e r r o r J(i.i) = (C -

i)*n(s - i)

(4.2-25)

where R i s a weighting matrix. An estimator i s considered optimal I n the Bayesian sense i f i t minimizes the a posteriori expected value o f the loss function:

An optimal estimator must minimize t h i s expected value f o r eacb 2. Since P(Z) i s not a furlction o f i, i t does not a f f e c t the minimization of Equation (4.2-26) w i t h respect t o C. Thus a Bayesian optimal estimator also minimizes the expression

Note t h a t p ( c ) , t h e p r o b a b i l i t y density o f 6, i s required i n order t o dafine Bayesian o p t i m a l i t y . For t h i s purpose, p(6) can be considered simply as a weighting t h a t i s p a r t o f the loss function, ~f i t cannot appropriately be interpreted as a t r u e p r o b a b i l i t y density o r an information function (Section 4.1). 4.2.5 Asymptotic Properties

Asymptotic properties concern the c h a r a c t e r i s t i c s o f the estimates as the amount o f data used increases toward i n f i n i t y . The amount o f data used can Increase e i t h e r by repeating experiments o r by increasing the time s l i c e analyzed i n a s i n g l e experiment. (The l a t t e r i s pertinent only f o r dytiamic systems.) Since only a f i n i t e amount o f data can be used i n practice, i t i s not imnediately obvious why there i s any I n t e r e s t i n asymptotic properties.

This i n t e r e s t arises p r t m a r i l y from considerations o f s i m p l i c i t y . It i s o f t e n slmpler t o colnpute asympt o t i c properties and t o construct asymptotically optimal estimators than t o do so f o r f i n i t e amounts o f data. W can then use the asymptotic r e s u l t s as good approxtmatlons t o the more d i f f i c u l t f i n i t e data r e s u l t s i f t h e e amount o f data used t s large enough. The f i n i t e data d e f i n i t i o n s o f unbiased estimators and e f f i c i e n t e s t i mators have d i r e c t asymptotic a~alogumco f i n t e r e s t . An estimator i s asymptotically unbiased i f t h e b i d s goes t o zero f o r a l l c as the amount of data ?r?c; t o i n f i n i t y . An estimator i s asymptotically e f f i c i e n t i f i t i s asymptotically unbiased and i f

as the amount o f data approaches i n f i n i t y . Equation (4.2-70).

Equation (4.2-28)

i s an asymptotic expression f o r e q u a l i t y i n

One important asymptotic property has no f i n i t e data analogue. This i s t h e notion of consistency. An estimator i s consistent i f 2 + E as the amount of data goes t o i n f i n i t y . For strong consistency. the convergence i s required t o be w i t h p r o b a b i l i t y one. Note t h a t strong conststency i s defined i n terms of he convergence o f i n d i v i d u a l real!zations o f the estimates, u n l i k e the bias, variance, and other properties which are defined i n terms o f average properties (expected values). Consistency i s a stronger property than asymptotic unbiasedness; t h a t 43, a l l consistent estimators are asymptotically unbiased. This i s a basic cnr:vergence r e s u l t - t h a t convergence w i t h p r o b a b i l i t y one implies convergence i n d i s t r i b u t i o n (and thus, s p ~ c i f l c ally , convergence i n man). W r e f e r the reader t o L l p s t e r and e Shiryayev (19771, C r a d r (1946). Goodwin and Payne (1977). Zacks (1971). and Mehra and L a i n l o t i s (1976) f o r t h i s and other r e s u l t s on consistency. Resuits on consistency tend t o involve careful mathematical arguments relati.ig t o d i f f e r e n t types o f convergence. Ue w i l l n o t delve deeply i n t o asymptotic properties such as consistency i n t h i s book. k'e generally f e e l t h a t asymptotic properties, although t h e o r e t t c a l l y i n t r i g u i n g , should be played down i n p r a c t i c a l application. Application o f i n f i n i t e - t i m e r e s u l t s t o f i n t t e data i s an zpproximation, one t h a t i s sometimes useful, b u t sometimes gives completely misleading conclusions (see Section 8.2). The inconsistency should be evident i n books t h a t spend copious time arguing f i n e p o i n t s o f d i s t i n c t i o n between d t f f e r e n t kinds o f convergence and then pass o f f a p p l i c a t t o n t o f i n i t e data w i t h cursory a l l u s i o n s t o ustng large data samples. Although we de-emphasize the "rigorous" treatment o f asymptotic properties, some asymptotic r e s u l t s are c r u c i a l t o p r a c t i c a l implementation. This i s n o t because o f any improved r i g o r o f the asymptctic r e s u l t s , b u t because the asymptotic r e s u l t s are o f t e n simpler, sometimes enough simpler t o make the c r i t i c a l dtfference i n u s a b i l i t y . This i s our primary use o f aPymptotic r e s u l t s : as s i m p l i f y i n g approximattons t o the f i n i t e - t i m e results. Introduction o f complicated convergence arguments hides t h i s essential r o l e . The approximations work well I n many cases and, as w i t h most approximations, f a i l i n some s i t u a t i a n s . Our emphasis i n asymptotic r e s u l t s w i l l center on j u s t i f y i n g when they are akpropriate and understanding when they f a i l . 4.3
C W N ESTIMATORS

This section w i l l define some o f the comnonly used general types o f estimators. The l i s t i s f a r from complete; we mention only those estimators t h a t w i l l be used i n t h i s book. W a l s o present a few general e r e s u l t s characterizing the estimators. 4.3.1
A poetorion Expected Value

One o f the most natural estimates i s the a posteriori expected value. mean o f the p o s t e r i o r d i s t r i b u t i o n .

This estimate I s defined as the

This estimator requires t h a t 4.3.2 Bayesian Minimm Risk

p ( ~ ) , the p r i o r density o f

c,

be known.

Any estimator whtch minimizes the a posteriori Bayesian o p t i m a l i t y was defined I n Section 4.2.4. expected value o f the l o s s function I s a Bayesian minimum r t s k estimator. (In general. there can be more than one such estimator f o r a given problem.) The p r i o r d i s t r i b u t i o n o f 6 must k known t o deflne Bayesian estimators. Theorem 4.3-1 The a poetoriori expected value (Sectton 4.3.1) Bayesian mtni'mum r i s k estimator f o r t h e l o s s function J(r.) where R i s the unique

= (C

- i)+R(c - 1

I s any p o s i t i v e d e f i n i t e symnetrlc m t r l x .

Proof A Bayestan minimum r i s k estimator must minimize

Since R i s s y m t r i c , the gradient o f t h i s fdnction i s viE(J1Z) -2E(R(c

- i ( Z ) ) IZI*

S e t t i n g t h i s expression t o zero gives

o
Therefore

= R E(C

- e ( z ) ! z l = RCE(CIZ) - i ( Z ) 1
EtJIZl. The second gradient i s

i s the unique stationary p o i n t o f

v ~ E ~ J*Z ~> 0 I 2R so the stationary p o i n t i s the global minimum. Theorem (1.3-1) applies only for the quadratic l o s s f u n c t l o n o f Equatiorl (4.3-2). The f o l l o w i n g very s i m i l a r theorem applies t o a much broader class o f loss functions, but requires the assumption t h a t p(c1Z) i s symnetric about i t s man. Theorem (4.3-1) makes no assumptions about p(;lZ) except t h a t i t has f i n i t e mean dnd variance. Theorem - 4.3-2 Assume t h a t p(cIZ) i s symnetric about i t s mean f o r each 2; i.e..
F~~~ ( { ( z )
+

LIZ) = p S I L ( i ( z )

- CIZ)

(4.3-8)

where i ( Z ) i s the expected value o f given 2. Then the a posteriori expected value i s the unique Bayesian minimum r i s k estimator f o r any l o s s function of the form J(C.i) where J,
a

J1(C

- i)

(4.3-9)

i s symnetric about 0 and i s s t r i c t l y convex.

Proof W need t o d m n s t r a t e t h a t e D(a) r ~ { J ( c , i ( Z ) + a l Z l for a l l a f 0. Using Equation (4.3-9)

- E { J ( c , ~ ( z ) ~ ~ )0 >

and the d e f i n i t i o n o f expectation ~

D(a) =,~(CIZ)[J~(C

-i

- )a) - J ~ ( C ~UIII~ILI 2 01

Because of the sym#try o f p(clZ). we can replace the i n t e g r a l i n Equat i o n (4.3-11) by an i n t e g r a l over the region
S = IE:(c

- i(Z).a)

Using the symnetry o f DL.)

J,

gives

*j'P(cIZ)[Jl(t - i ( 2 ) - a) s

+ J,(c

- ~ ( 2 )+ a)
i(Z)) (4.3-15)

By the s t r i c t convexity o f J,

J,(C
for a l l a

- Z!L) - a) + J,(F - i ( Z ) + a)
Therefore D(a) > 0 f o r a l l

> 2J,(t

+ 0.

a t 0 as we desired t o show.

Note t h a t i f J, i s convex, but not s t r i c t l y convex, theorem (4.3-2) s t i l l holds except f o r the uniqueness. Theorms (4.3-1) and (4.3-2) a r e two o f the basic r e s u l t s i n the theory o f estimation. They motivate the use of a poeterioxd expected value estimators. 4.3.3

Maximum a posteriori P r o b a b i l i t y
a porwrtori p r o b a b i l i t y (MP) estima+.e i s defined as t h e mode o f the p o s t e r i o r d i s t r i b u t i o n o f c which maximizes the poster'ar density function). I f the d i s t r i b u t i o n i s n o t unimodal. m y n o t be unique. As w i t h the previously discussed estimators, the p r i o r d i s t r i b u t i o n o f i n order t o define the M P estimate.

The maxi mu^ (1.e.. t h e value the 1 A estimate 3P c must be k n m

The M P estimate i s q u a l to the a pcsturiori expected value (and thus to the Bayesian m i n i m r i s k f o r l o s s functions meeting the conditions o f Theorm (4.3-2)) i f the p o s t e r i o r d i s t r i b u t i o n i s s y n a r t r i c a b w t i t s m a n and unimoL1, since the m e and the mean o f such d i s t r l b u t i q n s are q u a l . For nonsymetric d i s t r i b u d tions, t h i s e q u a l i t y does n o t hold.

Ths MAP estimate i s generally much caster t o c a l c u l a t e than the a p o e t ~ r i o r iexpected value. a poetsriori expected value is (from Er,, a t i o n (4.3-1))

The

This c a l c u l a t i o n requires the evaluation o f two i n t e g r a l s over mization o f

s.

The M P estimate require5 the maxi-

w i t h respect t o

6.

The p(Z) i s not a function o f

5, so the MAP estimate can also be obtatned by

The "arg IMX" notation indicates t h a t i i s t h e value of c t h a t maximizes the density function p ( Z l ~ ) p ( ( ) . The m a x i m i z a t i o ~i n Equation (4.3-18) i s generally much simpler than the integrations i n Equation (4.3-16).

The previous e s t i m t o r s h i v e a l l required t h a t the p r i o r d i s t r i b u t i o n o f E be known. When 6 i s n o t random o r when i t s d i s t r i b u t l o n i s not known, there a r e f a r fewer redsonable estimators t o choose from. Maximum l i k e l i h o o d estimators are the only type t h a t we w i l l discuss. The naximum l i k e l t h o o d estimate i s defined as the value o f p ( Z 1 ~ ) ; i n other words,
E which maximizes t h e l i k e l i h o o d functional

The mximum l i k e l i h o o d estimator i s c l o s e l y r e l a t e d t o the MAP estimator. The HAP estimator maximizes ~ ( 1 2 ) ; h e u r i s t i c a l l y we could say t h a t the MAP estimator selects the m s t probable value o f 5 , given the data. The maximum 1ike:ihood estimator maximizes p(Z1c); i.e., i t selects the value o f 6 which makes the observed data most plausible. Although these may sound l i k e trot statements o f t h e same concept, there are c r u c i a l d i f f e r ences. One o f the most central differences i s '.hat maximum likelihood i s defined whether o r not the p r i o r d i s t r i b u t i o n o f c i s known. C~mparingEquation (4.3-18) w i t h EqudtiOn (4.3-19) reveals t h a t the maximum 1i k e l ihood estimate i s Ident i c a l t o the MAP estimate i f p(:) i s a constant. I i the parameter space .": has f i n i t e size, t h i s implies t h s t p(6) i s the uniform d i s t r i b u t i o n . For i n f i n i t e E, such as Rn, there are no uniform d i s t r i b u t i o n s , so a s t r i c t equivalence cannot be established. Ifwe r e l a x our d e f i n i t i o n o f a p r o b a b i l i t y d i s t r i b u t i o n t o allow a r b i t r a r y density functions which need n o t integrate t o 1 (sometimes c a l l e d generalized p r o b a b i l i t i e s ) , the equivalence can be established f o r any z. Alternately. the uniform d i s t r i b u t i o n f o r i n f i n i t e size a can be viewed as a l i m i t i n g case o f d i s t r i b u t i o n s w i t h variance going t o i n f i n i t y ( l e s s and less p r i o r c e r t a i n t y about t h e value o f E). The maxinum l i k e l i h o o d estimator places no preference on any value o f 6 over any other value o f c; the estimate i s s o l e l y a function o f the data. The MAP estimate, on t h e other hand, considers both the data and t h e preference defined by the p r i o r d i s t r i b u t i o n . Maximum l i k e l i h o o d estimators have many i n t e r e s t i c g properties, which we w i l l cover l a t e r . most basic i s given try the following theorem: Theorem 4.3-3 I f an e f f i c i e n t estimator e x i s t s f o r a problem, t h a t estimator i s a maxirmm l i k e l l h o o d estimator. ?roof (This proof requires the use o f the f u l l notation f o r p r o b a h i l i t y densTty functions t o avoid confusion.) Assume t h a t e(Z) i s any e f f i c i e n t estimator. An estimator w i l l be e f f i c i e n t i f and only i f equaltty holds i n lenna (4.2-1). Equality holds I f and only i f X = AY i n Equation (4.2-6). S u b s t i t u t i n g for A from Equation (4.2-8) gives One o f t h e

Substituting f o r X and Y as i n the proof o f the Cramer-Rao bound, and using Equations (4.2-18) and (4.2-19) gives

i(z)

E = [I

v~~(c)IM(E)-'V;

pZIC(Zlt)

E f f l c t e n t estimators mrst be unbjased, so b(() i s zero and i(z)

- r = M(E)-~v;

trr P ~ ~ ~ ( Z I C )

For an e f f i c t e n t e s t l m t o r , Equatton (4.3-22) must hold f o r a l l values o f and E. I n p a r t l c u l a r , for each Z, the equation st h o l d f o r c e(Z), The left-hand slde :s then zero, so we must have

The e s t i l M t e i s thus a t a s t a t i o n a r y p o i n t o f the l i k e l i h o o d f u n c t i o n a l . Taking the gradient o f Equation (1.3-22)

-I= M(c)-'v;
Evaluatinq t h i s a t

r n p z l C ( Z i ~ ) M ( C ) - ~ [ V ~ M ' C ) I M ( C ) - ~9 ; pZle(i16) V . n

= i(Z;,

and using Equation (4.3-23)

gives

Since R i s p o s i t i v e d e f i n i t e , the s t a t i m a r y p o i n t i s a l o c a l maximum. I n f a c t , i t i s the only l x a l maximum. because a l o c a l maximum a t any p o i n t other than c = [ ( Z ) would v i o l a t e Equation (4.3-22). The requirement f o r ( Z Z t o be f i n i t e implies t h a t p ~ / ~ ( Z / c 0 ds ) + SO glohal maximum. Therefore t h i the loca{ m a x i m w i l l be is a maxlmm l i k e l ihood estimator.

-.

Corollar l e f f i c i e t estimators r a problem d A lestimatornexists. i sf ounique). are equivalent ( i .e., c ent


it

if

and i This t h e o ~ - e ~ ~ ~ t s c o r o l l a r y are not estimators do not e x i s t f o r many problems. mdtor i s e f f i c i e e t . The theorem does apply appi i c a b l e asymptctlc r e s u l t s which w i l l be

as useful as they might seem a t f i r s t glance, because e f f f c i e n t Therefore, i t i s not always t r u e t h a t a maximm l i k e l ihood e s t i t o some simple problems, however, and motivates the more widely discussed l a t e r .

Maximum l i k e l i h o o d estimates have the f o l l o w i n g naturol invariance property: l e t i be the maximum then f ( c ) i s the maxi~ruml i k e l i h o c d estimate of f : ~ ) for any f u n c t i o n f. The l i k e l i h o o d estimate o f for proof o f t h i s statement i s t r i v i a l if f i s i n v e r t i b l e . L e t Lc(c.Z) be the l i k e l i h o o d f u n c t i o n a l o f a given Z. i k f i n e

e;

Then the l i k e l i h o o d function o f

is

This i s the c r u c i a l equbtion. By d e f i n i t i o n , the left-hand side i s maximized by side i s maximized by f-'(x) = 6. Therefore
a

x = i , and the right-hand (4.3-26)

f(O)

The extension t o n o n i n v e r t i b l e f i s straightforward-simply r e a l i z e t h a t f - ' ( x ) i s a set o f values, r a t h e r t h a n a single value. The same argument then s t i l l holds, regarding Lx(x,Z) as a One-to-many f u n c t i o n (Setvalued function). F i n a l l y , l e t us emphasize t h ~ t ,although maximum l i k e l i h o o d estimates are formally i d e n t i c a l t o MAP e s t i mates w i t h unjform p r i o r d i s t r i b u t i o n s , there i s a basic t h e o r e t i c a l d i f f e r e n c e i n i n ~ e r p r e t a t i c n . Maximum l i k e l i h o o d makes no statements aboui d i s t r i b u t i o n s o f :, p r i o r o r p o s t e r i o r . S t a t i n g t h a t a parameter has a uniform p r l o r d i s t r i b u t i o n i s d r a s t i c a l l y d i f f e r e n t from saying t h a t we have no information about the parameter. Several c l a s s i c "paradoxes" o f p r o b a b i l i t y theory r e s u l t e d from ignoring t h i s difference. The paradoxes a r i s e i n transformations o f variable. L e t a scalar E have a uniform p r i o r d i s t r i b u t i o n , and l e t f be any continuous i n v e r t i b l e function. Then, by Equation (3.4-l), x = f ( ~ has the density f u n c t i o n )

which i s n o t a u ~ i i f o r md i s t r i b u t i o n on x (unless f i s 1 inear). Thus i f we say t h a t there i s no p r i o r information (uniform d i s t r i b u t i o n ) about c. then t h i s gives us p r i o r information (nonuniform d i s t r i b u t i o n ) about x, and vice rersa. This apparent paradox r e s u l t s from equating a unifotm d i s t r i b u t i o n w i t h the i d t a o f "no informat+on. Therefore, although we can formally derive the equations f o r maximum l i k e l i h o o d e s t i m t o r s by s u b s t i t u t i n g uniform p r i o r d i s t r i b u t i o n s i n the equations f o r MAP estimators, we must avoid misinterpretations. Fisher (1921. p. 326) discussed t h i s subject a t length: There would be no need t o emphasize the baseless character o f the assumpti*ns made under the t i t l e s o: inverse PI obabil i t y and BAYES' Theorem i n view o f I must indeed plead the decisive c r i t i c i s m t o which they have been exposed g u i l t y i n my o r i g i n a l statement o f the Method o f Maximm Likelihood 19) t o having based my argument upon the p r i n c i p l e o f inverse p r o b a b i l i t y ; i n the same paper. i t i s true, I emphasized the f a c t t h a t such inverse p r o b a b i l i t i e s were r e l a t i v e only. That i s t o say, t h a t w h i l e one might speak o f one value o f p as having an inverse p r o b a b i l i t y three times t h a t o f another value c f p, we might on no account introduce the d i f f e r e n t i a l element dp, so as t o be able t o say t h a t i t was three t i m s as probable t h a t p should l i e i n one r a t h e r than the other o f two equal elements. Upon consideration. therefore. I perceive t h a t the word p r o b a b i l i t y i s wrongly used i n such a connection: p r o b a b i l i t y i s a r a t i o o f frequencies, and about the frequencies o f such values whatever. W m s t r e t u r n t o the actual f a c t t h a t one value e we can know n o t h i t ~ g o f p, o f the frequency o f which we know nothing, would y i e l d the observed r e s u l t three times as frequently as would another value o f p. I f we need a word t o characterize t h f s r e l a t i v e p r o p t r t y o f d i f f e r e n t values o f p, I suggest

....

t h a t we may speak wlthout confusion o f the l i k e l i h o o d o f one value o f p being t h r l c e the l i k e l l h o o d of another. b r a r i n g always I n mtnd t h a t l l k e l l hood I s n o t here bsed l o o s e l y as a rynonym o f p r o b a b l l l t y , but simply t o express t h e r e l a t i v e frequencies w l t h whlch such values o f the hypothetical quanttty p would i n f a c t y i e l d +be observed sample.

CHAPTER 5 5.0 THE STATIC ESIIMATIOh PROBLEM

I n t h i s chapter begins the application o f the general types of estimators deflned I n Chapter 4 t o s p e c i f l c problems. The problems discussed i n t h l s chapter are s t a t l c estlmation problems; t h a t fs, problems where t f m 1s not e x p l l c i t l y involved. Subsequent chapters on dynmlc systems draw heavlly on thesr s t a t i c on r e s u l t s . Our treatment I s f a r from complete; I t I s easy t o spend an e n t l r e b ~ o k s t a t i c estimation alone (Sorenson, 1980). The materlal presented here was selected l a r g e l y on the basis o f relevance t o dynamic systems. W concentrate primarily on l l n e a r systems wlth a d d l t i v e Gaussian noise, where there are slmple, closede form solutlons. Ue also cover nonllnear systems w i t h a d d i t i v e Gausslan noise. whlch w i l l prove o f major importance I n Chapter 8. Non-Gaussian and nonadditive noise are mentioned only b r i e f l y , except f o r the special problem o f estlmation o f variance. W w i l l i n i t i a l l y t r e a t nonsingular problans, where we assume t h a t a l l relevant d l s t r l b u t i o n s have denslty e functlons. The understandlnc, and handllng o f slngular and I l l - c o n d l t l o n e d problems then receive special attentlon. S l n g u l a r i t l e s and I l l - c o n d l t i o n l n g are c r u c i a l Issues I n practical application, but are i n s u f f l . c i e n t l y treated I n nwrch of the current literature. W also discus. p a r t i t l o n l n g o f estfmatlon problems, an e Important technique f o r s l r p l l f y l n g the computatlonal task and t r e a t l n g some s i n g u l a r l t l e s . The general form o f a s t a t i c system m d e l I s

W apply a known s p e c i f l c input U (or a set o f inputs) t o the system, and measure t h e response 2. The e vector w i s a random vector contamlnatlng the measured system response. W desire t o e s t i m t e the value e of

(.

The estimators discussed l n Chapter 4 requlre knowledge o f the conditional d i s t r r o u t l - n o f Z given ( and U. Ue assume, f o r now, t h a t the d i s t r i b u t i o n i s nonsingular. w l t h deqsity p(ZI(.U). I f 6 i s conI n some simple cases. these densities might be sidered random. you m s t know the j o i n t denslty p(Z,tllr). given d i r e c t l y , l n which case Equation (5.0-1) i s not necessary; the estimators o f Chapter 4 appl d l n c t l y More t y p i c a l l y , p(Z1c.U) i s a complicated density whlch I s derlved from Equation (5.0-1) and ~ ( w ~ c , u ) ; I t I s o f t e n reasonable t o assume q u i t e slmple d i s t r i b u t i o n s f o r U , independent o f 5 arhd U. I n t h l s chapter, m? w l l l look a t several specific cases. 5.1 LINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE

Th* s i w l e s t and most c l a s s i c r e s u l t c are obtalned f o r l l n e a r s t a i i c s y s t ~ m s l t h a d d i t l v e b u s s i a o noise. w The system equatlons are assumed t o have the form Z = C(U)C + D(U) + G(U)U
(5.1-1)

For any particular U. 2 i s a l i n e a r combination o f (, W, and a constant vector. Note t h a t there are no assumptions about l i n e a r i t y w l t h respect tr, U; the functions C. 0, and G can be a r b i t r a r i l y complicated. Throughout t h l s section, we omit the e x p l i c i t dependence on U from the notation. Similarly, a l l d l s t r l b u t i o n s and expectatlons are i m p l i c i t l y understood t o he conditioned on U. The random noise vector w i s assumed t o be b u s s f c n and independent uf By conventlon, we w l l l deffne the mean o f w to be 0, and the covarlancc t o be i d e n t i t y . Thls conventlon does not l l m i t the gener, F a l l t y o f Equatlon (5.1-l), f o r i f w has a m a n m and a f l n i t e covariance FF*, we can define G = G and 0 = D + m to obtaln ;

(.

whzre

Y,

has zero lman and l d e n t l t y couariance.

When 5 i s consldered as random, we w l l l assume t h a t I t s marglnel ( p r l o r ) d i s t r l b u t l o n i s Gausslan w l t h mean mc and covarlancc P. p ( ~ ) /?dl-'/' = Equatlon (5.1-3) assumes that cases l a t e r . 5.1.1 ;oint
exp(-

i(( - n ( ) * ~ - l ( ( -

(5.1-3)

P I s nonsingular.

Ue w l l l discuss the i n p l i c a t l o n s and h r n d l i n g o f singular

D l s t r l b u t i o n o f Z and

Several d l s t r l b u t i o n s which can be derived from Equation (5.1-1) w i l l be r q u l n d I n order t o analyze t h l s s y s t w . I * t us f i r s t conslder p(ZIc), the conditional density of Z given C. This d i s t r l b u t i o n i s defined whether C I s r a n d m o r not. I f { i s given, then Equation (5.1-1) i s simply the sun o f a copstant vector and r constant n a t r l x times a (irusslrn vector. Uslng the p m p e r t l e s o f Iirussian distributions discussed i n Chapter 3, we see t h a t the conditional d i s t r l b u t l o n o f Z glven { l s Qussfrn w l t h man and covarirncr.

Thus. assuming t h s t

GG*

i s nonsingular,

~ ( ~ 1 6 )I ? ~ G G * I - ' / ~exl,(-

i( Z - cc - D ) * ( G G * ) - ~ ( z- CL - D)]

(5.1-6) . I l y defrne the distribution


c' ir,dependent

i s random, w i t h m a r g i n ~ ldensity given by Eqbation (5.1-3), we can a l s o mea-.'7 If j o i n t d i s t r i b u t i o n o f Z and 6 , the conditional d i s t t :::stion o f 6 givcn I . and t h r ma -in: o f 2. For the marglnal d i s t r i b u t i o n o f Z, note t h a t Equation (5.1-11 i s a l i n e a r comb.inr:ion Qussian vectors. Therefore Z 1s Gaussian w f t h mean and covariance

cov(Z) = CPC* + GG* For the j o i n t d i s t r i b u t f o n o f 6 and

( 5 1-8)

Z. we now r e q u ~ r etne cross-correlation

E([Z The j o i n t d i s t r i b u t i o n o f 5 and

- E ( Z ) I [ t - E(01'1

* C P

Z i s thus Gaussian w f t h mean and covariance

PC*

r , .rote t h s t t h i s j o i n t d i s t r i b u t i o n could a l s o be derived b m u l t i p l y i n g Equations (5.1-3) and (5.1-6) according t o Aayes r u l e . That d e r i v a t i o n a r r l v e s a t the same r e s u l t s f o r Fquations (5.1-10) and ( 5 . 1 - l l ) , b u t i s much more tedious. o f i n a l l y , we can deri,ie the conditional d l s t r i h u t i o n o f 5 given Z ( t h e p o s t e r i o r d + s t ~ , i b u t i o n f 6 ) from the j o i n t d i s t r i b u t i o n o f and Z. Applying Theorem (3.5-9) t o Equations (5.1-10) and (5.1-11). we see t h a t the conditional O i s t r i b u t i o l o f F given Z i s Gacssian w i t h m a n and covariance

, Equations (5.1-12; and (5.1-13) assume t h a t CPC* + GG* i s nonsingular. IC t h i s matrix i singular. the problem i s i l l - p o s e d and should Le restated. W w i l l discuss the s i n s u l a r case l a t e r . e

Assuming t h a t P. GG*. and (C*(GG*)-'C + P - I ) are nonsingslar, we can use the m a t r i x illversion l e m s . ( l e n m s (-1.1-3) and (A.l-4)). t o put Equations (5.1-12) and (5.1-13) i n t o forms t h a t w i l l prove i n t u i t i v r : y useful.

the form o f W w i l l have much occasion t o contrast the form o f E uations (5.1-12) and (5.1-13) ~ 4 t h e Equations (5.1-141 tnd (5.1-151. W w i l l c a l l Equations 15.1-12, and (5.1-13) the covariance form because they e r r e i n t e n s of the uninverted covariances P and GG*. E q u a t i o ~ ~(5.1-14) and (5.1-15) are c a l l e d the i n f o r s n a t i o n Corm because they are i n t e n s o f the inverses P-' and (GG*]'l, which are r e l a t e d t o , t r amo:lt o f infotn:ation. (The l a r g e r the covariance, thc less information you have, and v i c e versa.) Equation (5.1-15) has an i n t e r p r e t a t i o n as a d d i t i o n o f information: P-I i s the amount of p r i o r informatisn about c. and CC(GG*)"C i s the amount o f informat,ion i n the measurement; the t o t a l i n f o m t i o n a f t e r the ~,.rasurement i s thc sum o f these two terns. 5.1.2
A Posteriori Estimators

L e t us f i r s t examine the three types o f estimators t h a t are based on the p o s t e r i o r d i s t r i b u t i o r p(6IZ). These three types of estimators are a posteriori expected value, maximum a p o s t r r i o ~ ip r o b a b r l i t y , and Bayesian minirum r i s k .
U previously derived the expression f o r the a posteriori expected value i n the process o f d e f i n i n g the e p o s t e r i o r d i s t r i b u t i o n . E i t h e r the covariance o r i n f o m t i o n fonn can be used. Ue w i l l use the inforrration form because i t t i e s i n w t t t other approaches as w i l l be seen below. Thc a posteriori cxpected value estimator i s thus

The maximum a p09teri0ri p r o b a b i l i t y estimate i s q u a 1 t o the a 0 8 t # ~ i 0 &cxpertad value because the p o s t e r i o r d i s t r i b u t i o n i s b u s s i a n (and thus unimodal and s y m t r c c .gout i t s mean). This f a c t suggests an a l t e r n a t e d e r i v a t l o n o f Equation (5.1-16) which i s q b i t e enlightening. To f i n d the maximum p 3 i n t a f the posterfor d i s t r i b u t l o n of given 2, w r i t e

Expanding t h i s equation using Equations (5.1-3) i n P(CIZ) =

and (5.1-6)

gives

- $ (Z - cc

- o)*(w)-'(z - cc - D) -

2 (c

- m c ) * ~ l ( c- mE) + a(Z)

(5.1-18)

where aiZ) i s a function o f Z ~ n l y . E uation (5.1-19) shows the problem i n i t s "least squares" form. Ye are attempting t o choose c t o m m z e m ) and (Z - C - D ) The matrices P-' and (GG*)-' are weightings used i n the cost functions. The l a r g e r the value o f (GG*)", the more importance i s placed on D), and r i c e versa. minimizing (Z CE

Obtain the estimate t i o n (3.5- 17).

b j s e t t i n g the gradient o f Equation (5.1-18) t o zero, as suggested by Equa0 = C*(GG*)"(L

- C( - 0)

p-l(i

me)

(5.1-19)

Write t h i s as 0 = C*(GGf)-'(Z dnd the solution i s

CmE

- D) - P - l ( i - me) - C*(GG*)-lc(i

m) E

(5.120)

assuming t h a t the inverses e x i s t .

For Gaussian d i s t r i b u t i o n s , Equatton (3.5-18) en p([IZ)]-'

gives the covariance as (5.1-22)

cov((IZ) = -'v;

= (C(GGf)"C

+ Pel)-'

Note h c the second gradient i s negative d e f i n i t e (and the covariance p o s i t i v e d e f G ? i t e ) , v e r i f y i n g t h a t the solution i s a maximum o f the p o s t e r i o r p r o b a b i l i t y density function. This d e r i v a t i o n does n o t require the use o f matrix inversion lemnas, o r the expression from Chapter 3 f o r the,Gaussian ccnditional d i s t r i b u t i o n . For more complicated problems, such as conditional d i s t r i b u t i o n s o f N :ointly Gaussian vectors, the a l t e r n a t e d e r i v a t i o n as i n Equations (5.1-17) t o (5.1-22) i s much easier than the straightforward d e r i v a t i o n as i n E q u a t i o n (5.1-10) t o (5.1-15). Because o f the symnetry o f the p o s t e r i o r d i s t r i b u t i o n , the Bayesian optimal estimate i s also equal t o the a posterioii expected value estimate i f t h e aayes l o s s function meets t h e c r i t e r i a uf Theorem (4.3-1). W w i l l now examine the s t a t i s t i c a l properties o f the estimator given by Equation (5.1-16). e estimator i s a l i n e a r function o f 2 , the b i a s i s easy t o compute. b(6) = E { < ~ E I Since the

= E{mg + (C*(GG*)-'C

+ P-')-'CZ(GG*)-'(Z

- CmE -

D)]cI

The estimator i s b i a b r d TUI i;; fir,::; iion;ir,pu:ar P and SB*. The gralar case gives ..me i n s i g h t i n t o t h i s bias. I f 5 i s scalar, the f a c t o r i n brackets i n Equation (5.1-23) l i e s between 0 and 1. As GG* decreases and/or P increases, the f a c t o r approaches 0, as does the bias. I n t h i s case, the estimator obtains l e s s information from the i n i t i a l guess o f E (which has large covariance), and more information from the measurement (which has small covariance). I f the s i t u a t i o n i s reversed, G * increasi.lg and/or P decreasing, the G I n t h i s case, the estimator shows an increasing predilect.;on t o ignore the measured bias becomes l a r g e r response and t o keep the i n i t i a l guess o f 6. The variance and mean square e r r o r are a l s o easy t o compute. Equations (5.1-16) and 15.1-5): c o v ( i l 6 ) = (Cf(GG*)-lC The variance o f follows d i r e c t l y from Pml)-l (5.1-24)

+ P-')'lC*(GG*)-lGG*(GGt)-lC(C*(W)-lC

+ +
P-I)-'

= (C*(GG*)-'C
The mean square e r r o r i s then

P")"C*(GG*)-'C(C*(GG')''C

mse(c) = c o v ( i l c ) + b ( ~ ) b ( 6 ) * which i s evaluated using Equatfons (5.1-23) and (5.1-24).

The rmst obvious question t o ask i n r e l a t i o n t o Equations (5.1-24) hnd (5.1-25) i s how they compare w i t h other cstimators and w i t h the Cramer-Rao bound. Let us evaluate the Cramer-Rao bound. The Fisher information matrlx (Equation (4.2-19)) i s easy t o congute using Equation (5.141:

Thus the C r a r r - L o bound for unbiased e s t i w t o r s i s .K(~IE)


2 (c*(W)-'c)-'

Note that, f o r wr v l l ~ s o f c, the a posteriori expected value estimator has a lower lean-square error than i the C r u c r - L o bound f o r unbiased estlmtors; naturally. t h i s i s because the estimator i s biased. To compute the t r u r - L o bound f o r an estimator with bias given by Equation (5.1-23). we need t o evaluate

The Craacr-Rao bound i s then ( f n r Equation (4.2-10))

=(ilr)

(P(G*)-lC

+ P-')-'c*(w)-'c(c*(s~*)-'C

+ P-')"

(5.1-29)

At every Note that the estimator does not achieve the C r u r - L o bound except a t the single point c = m other point. tb second term I n Equation (5.1-25) i s positive, and the f i r s t term i s equal t o tfk bound; therefore, the m e i s greater than the bound. For a single observation. w can say i n s l r v r y that the a posteriori estimator i s optianl Bayesian f o r e a large class o f loss functions, but i t i s biased and does not achieve the Crarr-Rao lcuer bound. I t -ins t o investigate the a s y q t o t i c properties. The a s j w t c t i c behavior of e s t i m t o r s for s t a t i c systems i s defined i n t e r n o f N independent repetitions o f the experimnt. where N apprwches i n f i n i t y . Ye -st f i r s t define the application o f the a .pcsteriori estimator t o r e p a t e d experimnts. A s s u that the system sdel i s given by Equation (5.1-1). with c distributed according t o EquaPerfom W e x p e r i r n t s Ul...~. does not matter whether the Uj are distinct.) The t i o n (5.1-3j. corresponding system matrices are C i . Di, and Gi6.. and the measurements are Zi. Th? randm noise w i i an : indcpcndent, zero-n, i d e n t i t y covariance. ~ u s s l a n vector f o r each i. The maxia postrriori estimate o f E i s g i m by

1It

assuinp that t inverses exist. k The a s y q t o t i c properties a n defined f o r r e p e t i t i o n o f the sale e x p e r i r n t , so we do not need the f u l l tpnerality o f Equation (5.1-30). I f Ui = Uj, C i = C j . Di = Dj, and G i = G j f o r a l l i and j. Equat i o n (5.1-3G) can be written N i s B + [ P ( f f i * ) - ' C + P-1]-1C*(6b)-1 , (5.1-31) (Ti hL 0) i=i

rQqwte the bias. covariance, and m e o f t h i s estimate i n the s a m manner as Equations (5.1-23) t o (5.1-25): b(c) = [I (NCe(66f)-'C + P-1)-')(C*(66f)-1C](~, cov(i(c) = [ K * ( W ) " C m e ( l l c ) = cov!ilc)
+

- c)
+ P-']-l

(5.1-32)
'
i- '

+ P-']-'NC*(6b)-'C[NC*(S*)-'C
b(c)b(c)*

(5.1-33) (5.1-34)

I., P .

The Crarr-Rao bound f o r unbiased estimators i s


.se(.ilc)

z (nt*(G*)-lc)-l

As W increases, Equation (5.1-32) w s t o zero, so the estimator i s asymptotically unbiased. The e f f e c t of increasing N i s exactly conparrlble to Increasing ( * ' ; we take m r e and better q u a l i t y measurements. a ) ' as the estimator depends lore heavily on t!! measurements and less on i t s i n i t i a l guess. The estimator i s also asylptotically e f f i c i e n t as defined by Equation !4.2-28) ~c*(Gb.)''c K*(66*)-'C 5.1.3 R a x l n u Likelihood E s t f v t o r the derf?etion o f the only difference i s cov(ilf;)
---r

because (5.1-36) (5.1-37)

b(~)b(f;)"

N N

I 0

The derivation o f t a b expression f-r the m x i u likelihood estimator i s s i n f l s r maximu~l portel.iori probability estimator dore i n Equations (5.1-17) to (5.1-22). a that instead o f m p ( { I Z ) , m maximize

The only relevant difference between Equation (5.1-38) and Equation (5.1-18) i s the inclusion o f the t e r n based on the p r i o r d i s t r i b u t i ? n o f c i n Equation (5.1-18). (The a(z) are also different, but t h i s i s o f no consequencs a t the nmment., The maximum likelihood estimate does not make use o f the p r f o r distribution; indeed i t docs not require that such a d i s t r i b u t i o n exist. He w 11 see that many o f the M results are equal t o the E R4P results with the terns from the p r i o r distribution m t t e d . Find the maximum point o f Equation (5.1-38) by setting the gradient t o zero.

The solution. assuming that

C*(GG*)-'C

i s nonsingular, i s given by

< = (Le(GG*)-'C;"C*(GG+)"(Z
E =

Dl P " set t o zero.

This i s the same fonn as that o f the R4P estinete. Equation (5.1-21). with

Z.

A p a r t i c u l a r l y simple case occurs when C = I and D = 0.

I n t h i s event. Cquation (5.1-40) reduces t o


C; that i s

Note that the expression (C*(GG*)-'C)-'C*(GGf)-'

i s a left-inverse of

He can view the e s t i m t o r given by Equation (5.1-40) as a pseudo-inverse of the system given by Equat i o n (5.1-1). Using both equations. w r i t e

i = (C*(GG*)'lC)-lC*(~G*)-l(Cc
=

+ D + Gw

- D)

+ (L* (GG*)-'C)-'C*(GG*)-'Gw

A1though we nust use Equation (5.1-40) t o conpute because and w are not known. Equation (5.1-42) i s useful i n analyzir~gand understanding the behavior o f the estimator. h e interesting point i s inmediately obvious from Equation (5.1-42): the e s t i m t e i s simply the sum o f the true value plus the e f f e c t o f the contaminating noise W. For the particular realization w = 0, the estimate i s e ~ a c t l yequal t o the true value. This property, which i s not shared by the a posterior;, estimators, i s closely related t o the bias. Indeed. the bias o f the maximum likelihood estimator i s inmediately evident from Equation (5.1-42).

The maximun likelihood estimate i s thus unbiased. r e s u l t i f we substitute 0 f o r P-'.

Note that Equation (5.1-32) for the M P t i a s gives the sane A Using Equation (5.1-42).

Since the estimator i s unbiased, the covariance and mean square error are equal. they are given by

Ye can also obtain t h i s r e s u l t from Equations (5.1-33) and (5.1-34) f o r the M P esti,nator by substituting 0 A

f o r P-'.
We previously conputed the Cramer-Rao bound f o r unbiased estimators f o r t h i s problem (Equation 5.1-27)). The mean square error o f the maxinum likelihood estimator i s exactly equal t o the Cramer-Rao b?und. The maxinum l ~ u e l i h o o d estimator i s thus e f f i c i e n t and i s , therefore, a minimum variance unbiased estimator. The maximum likelihood estimator i s not, i n general. Bayesian o p t i m l . Bayesian optimality may not even be defined, since 6 need not be random.

The M E result: f o r repeated experiments can be obtained from the corresponding M P equations by substiA t u t i n g zero f o r P- and mc. We w i l l not repeat these equations here. 5.1.4 Comparison o f Estimators

We have seen t h a t the m a x i m likelihood estimator i s unbiased and e f f i c i e n t , whereas the a posteriori estimators are only asynptotically unbiased and e f f i c i e n t . O the other hand, the a ~osterioriestimators are n Bayesian optimal f o r a large class o f loss functions. Thus neither estimator emerges as an unchallenged favorite. The reader might reasonably expect smne guidance as t o which estimator t o choose f o r 2 given problem.

The roles o f the %to estimators arc actually quite d i s t l n c t and well-defined. The maximum likelihood estimator does the best possible job ( i n the sense o f mininum msn square error) o f estimating the balue o f c based on the masurenrnts alone. without prejudice (bias) frm any preconceived guess about tile value. The m a x i m likelihood estimator i s thus the obvious choice when we have no p r i o r infomation. Having no p r i o r infonnation i s analogous t o having a p r i o r d i s t r i b u t i o n w i t h i n f i n i t e variance; i.e.. P-' = 0. I n t h i s regard. goes t o zero. The l i m i t i s (assuming that examine Equation (5.1-16) f o r the a po8teriol.i e s t i w t e as P'' C*(ffi*)-'C i s nonsingular)

5.1.4

i=

m + (C*(GGt)"C)-'C*(GG*)-l(Z CmE 0) E = mE (C*(GG*+-~C)-~C*(GG*)-~C~~ + (c*(GG*)-'c)-'c*(GG*)-'(Z D)

(C* (GG*)-'C)-'C*(GG*)"(Z

- D)

(5.1-45)

which i s equal t o the maximum l i k e l i h o o d eltimate. The rraximum l i k e l i h o o d estimate i s thus a l i m i t i n g case c f an a poste."ion' estimator as the variance o f the p r i o r d i s t r i b u t i o n approaches i n f i n i t y . The o posteriori estimate cornbines the information from the masurements w i t h the p r i o r information t o obtain the optimal estimate considering both sources. This e s t i w t o r makes use o f more information and thus can o b t a i n more accurate estimates, on the average. With t h i s litproved average accuracy comes a b i a s i n favor o f the p r i o r estimate. I f the p r i o r e s t i m t e i s good, the a posteriori estimate w i l l generally be more accur a t e than the maximum l i k e l i h o o d estimate. I f the p r i o r estimate i s poor, t h e a posteriori estimate w i l l be poor. The advantages o f the a posteriori estimators thus depend heavily on the accuracy o f the p r i o r estimate of the value. The basic c r i t e r i o n i n deciding whether t o use an MAP o r M E estimator i s whether you want estimates based only cn the current data o r based on both the current data and the p r i o r i n f o m a t i o n . The MLE estimate i s based only on the current data. and the MAP estimate i s based on both the current data and the o r i o r distribution. The d i s t i n c t i o n between the M E and MAP estimators o f t e n becomes b l u r r e d i n p r a c t i c a l application. The estimators are closely r e l a t e d i n nunerical computation, as w e l l as i n tbeory. An F(AP estimate can be an intermediate computational step t o obtaining a f i n a l M E estimate, o r v i c e versa. The f o l l o w i n g paragraphs describe one o f these situations; the other s i t u a t i o n i s discussed i n Section 5.2.2. It i r q u i t e comnon t o have a p r i o r guess o f the par..meters, b u t t o desire an independent v e r i f i c a t i o n o f the value based on the measurements alone. I n t h i s case, the maximum l i k e l i h o o d estimator i s the appropriate t o o l i n order t o make the estimates independent o f the i n i t i a l guess.

A two-step estimation i s o f t e n the most appropriate t o o b t a i n maximum i n s i g h t i n t o 3 problem. F i r s t , use the maxinxrm l i k e l i h o o d estimator t o obtain the best estimates based on the measurements alone, ignoring any p r i o r information. Then consider the p r i o r information i n order t o obtain a f i n a l best estimate based on bath t h e measurements and the p r i o r information. By t h i s two-step approach, we can see where the information i s coming from-the p r i o r d i s t r i b u t i o n , the measurements, o r both sources. The two-step approach a l s o allows the freedom t o independently choose the methodology f o r each step. For instance. we mioht desire t o use a maxinum l i k e l i h o o d estimator f o r obtaining t h e estimates bared on the measurmnts, b u t use engineering judgnent t o e s t a b l i s h the best conpromise between the p r i o r expectations and t h e maximum l i k e l i h o o d r e s u l t s . This i s o f t e n the best approach because i t may be d i f f i c u l t t o completely and accurately characterize the p r i o r i n f o r m a t i m i n terms o f a s p e c i f i c p r o b a b i l i t y d i s t r i b u t i o n . The p r i o r information o f t e n includes h e u r i s t i c f a c t o r s such as the engineer's judgment o f what would c o n s t i t u t e reasonable results.
The theory o f s u f f i c i e n t s t a t i s t i c s (Ferguson, 1967; Cramer, 1940; and Fisher, 1921) i s useful i n t h e two-step cpproach i f we desire t o use s t a t i s t i c a l techniques for both steps. The maximum l i k e l i h o o d estimate and i t s covariance fcrm a s u f f i c i e n t s t a t i s t i c f o r t h i s problem. Although we w i l l not go i n t o d e t a i l here. i f we know the maximum l i k e l i h o o d estirrate and i t s covariance, we know a l l o f the s t a t i s t i c a l l y useful infonnat i o n t h a t can be extracted from the data. The specific a p p l i c a t i o n i s t h a t the a posteriori estimates can be w r i t t e n i n terms o f the maximum l i k e l i h o o d estimate and i t s covariance instead o f as a d i r e c t f u n c t i o n o f the data. The following expression i s easy t o v e r i f y using Equations (5.1-16). (5.1-40). and (5.1-44):

is where ia the a posteriori t i o n (5.1-40)). and Q i s the form. the r e l a t i o r ~ s h i pbetween p r i o r d i s t r i b u t i o n i s the only t h e measured data o r even w i t h

estimate (Equation (5.1-16)). in i s the maximum 1 i k e l ihood estimate (Equacovariance o f the maximum l i k e l i h o o d estimate (Equation (5.1-44)). In this the a posteriori estirrilte and the maximum l i k e l i h o o d estimate i s p l a i n . The f a c t o r which enters i n t o the relationship; i t has nothing d i r e c t l y t o do w i t h what experiment was performed. Both

Equation (5.1-46) i s c l o s e l y r e l a t e d t o the measurement-partitioning ideas o f the next section. r e l a t e t o contining data from two d i f f e r e n t sources. 5.2 PARTITI9NING IN ESTIMATION PROBLEMS

P a r t i t i o n i n g estimation problems has some o f the same b e n e f i t s as p a r t i t i o n i n g optimization problems. A problem h a l f the size o f the o r i g i n a l t y p i c a l l y takes w e l l l e s s than h a l f the e f f o r t t o solve. Therefore, we can o f t e n come out. ahead by p a r t i t i o n i n g a problem i n t o smaller subproblems. O f course, t h i s t r i c k only works i f the solutions t o the subproblems can e a s i l y be combined t o give t s o l u t i o n t o t h e o r i g i n a l problem Two kinds o f p a r t i t i o n i n p applicable t o parameter estimation problems a r e nieasurement p a r t i t i o n i n g and parameter p a r t i t i o n i n g . Both o f these schemes permit easy combination of the subproblem solutions i n some sf tuatfons.

5.2.1

Measurement P a r t i t i a n i n q

A problem w i t h n u l t i p l e measurements can o f t e n be p a r t i t i o n e d i n t o a seque,rlce o f subproblems processing t h e measurements one a t a time. The same p r i n c i p l e applies t o p a r t i t i o n i n g a vector measurement i n t o a series o f scalar (or shorter vector) measurements; the only difference i s notational.

The estimators under discussion are a l l based on p(Z C ) or, f o r a p o s t e r i o r i estimators, ~ ( E I Z ) . W e w i l l i n i t i a l l y consider measurement p a r t i t i o n i n g as a prob em i n f a c t o r i n g these density functions. Let t h r measurement Z be p a r t i t i o n e d i n t o two measurements. Z and 2,. (Extensions t o more than two e p a r t i t i o n s f o l l o w t h e same principles.) W would l i k e t o f a c t o r p(t15) i n t o separate f a c t o r s dependent on 2, and 2 , . By Bayes' r u l e , we can always w r i t e

This fonn does not d i r e c t l y achieve t h e required separation because achieve the required separation, we introduce the requirement t h a t

p(Z,IZ,.c)

involves both 2, and Z2.

TO

W w i l l c a l l t h i s the Markov c r i t e r i o n . e H e u r i s t i c a l l y , the Harkov c r i t e r i o n assures t h a t p(Z1l:) contains a l l o f the useful information we can e x t r a c t from Z,. Therefore, having computed p(Z, 1 S ) a t the measured value o f 2,. we have no f u r t h e r need f o r 2,. I f the Markov c r i t e r i o n does n o t hold, then there a r e i n t e r a c t i o n s t h a t r e q u i r e Z and Z t o be , ; considered together instead o f separately. For systems w i t h a d d i t i v e noise, the Markov c r i t e r i o n imp1i e s t h a t Z i s independent o f t h a t i n 2,. , Note t h a t t h i s does not mean t h a t 2 , i s independent o f 2,. the noise i n For systems where the Markov c r i t e r i o n holds, we can s u b s t i t u t e Equation (5.2-2) t o get i n t o Equation (5.2-1)

which i s the desired f a c t o r i z a t i o n o f When

p(Z1c). p(cIZ) follows from t h a t o f p(Z1~).

has a , x i o r d i s t r i b u t i o n , the f a c t o r i z a t i o n o f

i n the p ( i ) i n the denominator i s not important, because the denominator i s merely :he mixing o f 2, and Z , I t w i l l prove convenient t o w r i t e Equation (5.2-4) i n the form a normalizing constant, independent o f

:.

Let us now consider measurement p a r t i t i o n o f arl M P estimator f o r a system w i t h A Equation (5.2-5). The M P estimate i s A

~ ( ~ 1 factored as i n 2 ) (5.2-6)

= arg max P ( Z ~ I ~ ) P ( S I Z ~ )

This equation i s i d e n t i c a l i n form t o Equition (4.3-la), w i t h p ( c I Z ) playing the r o l e o f the p r i o r d i s t r i b u A t i o n . W have, therefore, the f o l l o w i n g two-step process f o r obtaining the M P estimate by measurement e partitioning: F i r s t , evaluate t h e p o s t e r i o r d i s t r i b u t i o n o f E given Z . This i s a function o f 5, r a t h e r than a single value. Practical a p p l i c a t i o n demands t h a t t h i s d i s t r i b u t i o n he e a s i l y representable by a few s t a t i s t i c s , b u t A w put o f f such consideratiois u n t i l the next section. Then use t h i s as the p r i o r d i s t r i b u t i o n f o r an M P e Provided t h a t the system meets the Markov c r i t e r i o n , t h e resu:ting e s t i estimator w i t h the measurement Z , . mate should be i d e n t i c a l t o t h a t obtained by t h e unpartitioned M P estimator. A :he Measurement p a r t i t i o n i n g o f MLE estimator f o l l o w s s i m i l a r l i n e s , except f o r some issues o f i n t e r p r e t a t i o n . MLE estimate f o r a system factored as i n Equation (5.2-3) i s

This equation i s I d e n t i c a l i n form t o Eqrlation (4.3-18). w i t h p(Z,It) playing the r o l e o f the p r i o r d i s t r i b u t i o n . T k two steps o f the p a r t i t i o n e d MLE estimator are therefore as follows: f i r s t , evaluate p(Z1 1 C) a t t h e measured value o f Z,, g i v i n g a f u n c t i o n o f 6. Then use t h i s function as the p r i o r density f o r an M P A Provided t h a t the system meets the Markov c r i t e r i o n , the vesulting estimate estimator w i t h measurement 2. , should be i d e n t i c a l t o t h a t obtained by the unpartitioned MLE estimator.
I t i s not a p r o b a b i l i t y The p a r t i t i o n e d MLE estimator raises an issue o f interpretat'on o f p(Z,Ic). e density function of 6 . The vector 6 need n o t even be random. W can avoid the issue o f c n o t being random by using fnfonnation terminology, considering p(Z, IS) o represent the s t a t e o f our knowledge o f 6 t based on Z instead of being a p r o b a b i l i t y density function o f E. Alternately, we can simply consider p(Z,Ic) t o be a function of 6 t h a t arises a t an intermediate step o f computing the MLE estimate. The process described gives the c o r r e c t MLE estimate o f 5 , regardless o f how we choose t o i n t e r p r e t the intermediate steps.

The close connection between M P and MLE estimators i s i l l u s t r a t e d by t h e appearance o f an MAP estimator A as a step i n obtaining the MLE estimate w t t h p a r t i t i o n e d measurements. The r e s u l t can be interpreted e i t h e r as an M P estimate based on the measurement Z and the p r i o r density p(Z,Ic), o r as an MLE estimate based on , A , , both Z and Z.

5.2.2

w i c a t i o n t o Linear Gaussian Systems

W now consider the a p p l i c a t i o n o f measurement p a r t i t i o n i n g t o l i n e a r systens w i t h a d d i t i v e Gaussian e noise. Ye w i l l f i r s t consider the p a r t i t i o n e d HAP estimator, followed by the p a r t i t i o n e d MLE estimator. Let the p a r t i t i o n e d system be

are independent Gaussian random variables w i t h mean 0 and covariance 1. The Markov c r i t e r i o n where W, and W, be independent f o r measurement p a r t i t i o n i n g t o apply. The p r i o r d i s t r i b u t i o n of c requires t h a t W, and W, i s Gaussian w i t h mean me and covariance P, and i s independent o f W, and w,.

). W have previously seen t h a t t h i s e The f i r s t step of the p a r t i t i o n e d HAP estimator i s t o compute p ( ~ i Z Denote the mean and i s a Gaussian density w i t h mean and covariance given by Equations (5.1-12) and (5.1-13). , Thm, Equations (5.1-12) and (5.1-13) give covariance of p(I(Z1) by m and PI.

The second step i s t o conpute the MAP e s t i m t e o f c using the measurement Z and the p r i o r density , p(cIZ,): This step i s another a p p l i c a t i o n o f Equation (5.1-12), using m f o r mC and Pi f o r P. The , result i s

The i defined by Equation (5.2-11) i s t h e MAP estimate. I t should exactly equal the HAP estimate obtained by d i r e c t a p p l i c a t i o n o f Equation (5.1-12) t o the concatenated system. You can consider Equat i o n s (5.2-9) through (5.2-11) t o be an algebraic rearrangement o f the o r i g i n a l Equation (5.1-12); indeed. they can be derived :'n such terms. Example 5.2-1 Consider a system z = c + w where w i s Gaussian w i t h mean 0 and covariance 1, and c has a Gaussian e p r i o r d i s t r i b u t i o n w i t h mean D and covariance 1. W make two independent are independent) and desire measurements o f Z (i.e.. the two samples o f , , the MAP estimate of 5. Suppose the Z measurement i s 2 and the Z measurement i s -1. Without measurement p a r t i t i o n i n g , we could proceed as follows: concatenated system w r i t e the

D= . D i r e c t l y apply E uation (5.1-12) w i t h mc = 0. P = 1. C = [l I]*, 0 G = 1, and Z = q2, -I]*. MAP estimat? i s then The

Now consider t h i s same problem w i t h measurement p a r t i t i o n i n g . To get p(eIZ,), apply Equations (5.2-9) and (5.2-10) w i t h mc = 0, P = 1, C, = 1, D = 0, l G = 1. and Z = 2 . 1 , m = l(2)-'2, ,

1 2 Z, =

For the second step, apply Equation (5.2-11) w i t h m = 1, P, = 1/2, C , , D , 0, G , 1 and Z = -1. , ,

= 1,

W see t h a t the r e s u l t s o f t h e two approaches are i d e n t i c a l i n t h i s example, e


as claimed. Note t h a t the p a r t i t i o n i n g removes the requirement t o i n v e r t a 2-by-2 matrix, s u b s t i t u t i n g two 1-by-1 inversions.

The computational advantages o f using the p a r t i t i o n a d form o f the M P estimator vary depending on A numerous factors. There are numerous other rearrangements o f Equations (5.1-12) and (5.1-13). The information form o f Equations (5.1-14) and (5.1-15) i s o f t e n preferable i f the required inverses e x i s t . The information form can also be used i n the p a r t i t i o n e d estimator, replacing Equations (5.2-9) through (5.2-11) w i t h corresponding i n f o m t i o n forms. Equation (5.1-30) i s another alternative, which i s o f t e n rhe most e f f i c i e n t . There i s a t l e a s t one circumstance i n which a p a r t i t i o n e d form i s mandatory. This i s when the data comes i n two separate batches and the f i r s t batch o f data must be discarded ( f o r any o i seberal reasons-perhaps u n a v a i l a b i l i t y o f enough canputer m r y ) before processing the second batch. Such circumstances occur regularly. P a r t i t i o n e d estimators are also p a r t i c u l a r l y ap?ropriate when you have already computed the e s t i mate based on the f i r s t batch o f data before receiving the second batch. Let us now consider the p a r t i t i o n e d MLE estimator. The f i r s t step i s t o compute p(Z,Ct). EquaI t i s imnediately evident t h a t the logarithm o t i o n (5.1-38) gives a f o m l a f o r p(Z,I{). p(Z,It) i s a quadratic form i n t . Therefore, although p(Z,Ic) need not be i n t e r p r e t e d as a p r o b a b i l i t y density function o f E, i t has the algebraic form o f a G?ussian density function. except f o r an i r r e l e v a n t constant m u l t i p l i e r . Applying Equations (3.5-17) and (3.5-18) gives the mean and covariance o f t h i s function as

A The second step o f the c a r t i t i c n e d MLE estimator i s i d e n t i c a l t o the second step o f the p a r t i t i o n e d M P estimatoi-. Apply Equation (5.2-11). using the m; and P from the f i r s t step. For the p a r t i t i o n e d MLE , estimator. i t i s most natural (although n o t required) t o use the i n f o m t i o n form o f Equation (5.2-11). which i s

This form i s more p a r a l l e l t o E q ~ a t i o n s(5.2-12) Exa

and (5.2-13).

E $ l x f i , ignoring the p r i o r d i s t r i b u t i o n o f 6 . To get the MLE . .


estimate f o r the concatenated system, appl Equation (5.1-40) w i t h C = [l I]*. 0. G = 1, and Z = [2. -If*. D =

l e 5 2 2 Consider a maximum l i k e l i h o o d estimator f o r the problem o f

= (2)-'[l

112 =

3 (Z,

+ 2,)

I 2

Now consider the same problem w i t h measurement p a r t i t i o n i n g . Far the f i r s t , step. apply Equations (5.2-12) and (5.2-13) w i t h C, = 1, Dl = 0, G = 1, and Z = 2. ,

For the second step, apply Equations (5.2-14) and (5.2-15) w i t h C, = 1.

D, = 0 , G = 1 , a n d 2 , = -1. ,
P, = [ l ( l ) - l + (1)-11-1 =

;
= 1+ 1 p Z, =

= 2+

$ (I)-'(z,

- 2 - 0)

The p a r t i t i o n e d algorithm thus gives the same r e s u l t as the o r i g i n a l unpartitloned algorithm. There I s o f t e n confusion on the issue o f the b i a s o f the p a r t i t i o n e d MLE estimator. This i s an M E e s t i mate o f , based on both Z and Z2. It i s , therefore, unbiased l i k e a l l MLE estimators f o r l i n e a r systems w i t h a d d i t i v e Gaussian noise. On the other hand, the l a s t step o f the p a r t i t i o n e d estimator i s an M P estimate A based on Z w i t h a p r i o r d i s t r i b u t i o n described by m and P,. , , W have previously shown t h a t MAP estimators e are biased. There i s no contradiction i n these two viewpoints. The eztimate i s biased based on the measurement 2 alone, b u t unbiased based on 2, and Z , . , Therefore, i t i s o v e r l y s i m p l i s t i c t o u n i v e r s a l l y condemn MAP estimators as biased. The b i a s i s not always so c l e a r an issue, b u t requires you t o define exactly on what data you are basing the b i a s d e f i n i t i o n . The primary basis f o r deciding whether t o use an MAP o r M E estimator i s whether you want estimates based o n l y on the c u r r e n t s e t o f data, o r estimates based on the current data and p r i o r information combined. The b i a s merely r e f l e c t s t h i s decision; i t does not give you independent help i n deciding. 5.2.3 &meter Partitioning
6

I n parameter p a r t i t i o n i n g , we w r i t e the parameter vector & izations are obvious) smaller vectors . and c,.

as a f u n c t i o n o f two (or more-the general-

The f u n c t i o n f must be i n v e r t i b l e t o o b t a i n c and c, from 6, o r the s o l u t i o n t o the p a r t i t i o n e d problem w i l l n o t be unique. The simplest k i n d ~f p a r t i t i o n s are those i n which c, and c, are p a r t i t i o n s o f the c vector. we have a p a r t i t i o n e d o p t i m i z a t i o n problem. Two With the parameter 6 p a r t i t i o n e d i n t o 6, and 6 possible s o l u t i o n methods apply. The best method, i f f t can be used, i s generally t o solve f o r c, i n t e n s o f (, ( o r v i c e versa) and s u b s t i t u t e t h i s r e l a t i o n s h i p i n t o the o r i g i n a l problem. A x i a l i t e r a t i o n i s another reasonable method i f solutions f o r c, and 5, are nearly independent so t h a t few i t e r a t i o n s are required.

5.3

LIMITING CASES AND SINGULARITIES

I n the previous discuss+ons, w have simply assumed t h a t a l l o f the required m a t r i x inverses e x i s t . W e e made t h i s assumption t o present some o f the basic r e s u l t s without g e t t i n g sidetracked on f i n e points. W w i l l e now take a comprehensive look a t d l 1 o f the s i n g u l a r i t i e s and l i m i t i n g cases, explaining both the circumstances t h a t g i v e r i s e t o the various special cases, and how t o handle such cases when they occur. The reader w i l l recognize t h a t most o f the special cases are i d e a l i z a t i o n s which are seldom l i t e r a l l y true. W almost never know any value p e r f e c t l y (zero covariance). Conversely, i t i s r a r e t o have absolutely e no information about the value o f a parameter ( i n f i n i t e covariance). There are very few parameters t h a t would not be viewed w i t h suspicion i f an estimate o f , say. 10's6 were obtained. These i d e a l i z a t i o n s are useful i n p r a c t i c e f o r two reasons. F i r s t , they avoid the necessity t o quantify statements such as " v ~ r t u a l l y e r f e c t " p when the d i f f e r e n c e between v i r t u a l l y p e r f e c t and p e r f e c t i s n o t o f measurable consequence (although one must be careful : sometimes even an extremely small difference can be c r u c i a l ). Second, numerical problems w i t h f i n i t e a r i t h m e t i c can be a l l e v i a t e d by recognizing e s s e n t i a l l y singular s i t u a t i o n s and t r e a t i n g them s p e c i a l l y as though they were e x a c t l y singular. W w i l l address two kinds o f s i n g u l a r i t i e s . The f i r s t k i n d o f s i n g u l a r i t y involves Gaussian d i s t r i b u t i o n s e w i t h singular covariance matrices. These are perfect11 v a l i d p r o b a b i l i t y d i s t r i b u t i o n s conforming t o the usual d e f i n i t i o n . The d i s t r i b u t i o n s , however, do n o t have density functions; therefore the maximin a p o s t e r i o r i p r o b a b i l i t y and maximum l i k e l i h o o d estimates cannot be defined as we have done. The s i n g u l a r i t y implies t h a t the p r o b a b i l i t y d i s t r i b u t i o n i s e n t i r e l y concentrated on a subspace o f the o r i g i n a l l y defined p r o b a b i l i t y space. Ifthe problem statement i s redefined t o include only the subspace, the r e s t r i c t e d problem i s nonsingul a r . You can also address t h i s s i n g u l a r i t y by l o o k i n g a t l i m i t s as the covariance a ~ p r o a c h ~ s singular the matrix, provided t h a t the i i m i t s e x i s t . The second k i n d o f s i n g u l a r i t y involves Gaussian variables w i t h i n f i n i t e covariance. Conceptually, the meaning o f i n f i n i t e covariance i s e a s i l y stated-we have no information about the value o f the variable (but we must be c a r e f u l about generalizing t h i s idea, p a r t i c u l a r l y i n nonlinear t r a n s f o m t i o n s - s e e the discussion a t the end o f Section 4.3.4). Unluckily, i n f i n i t e covariance Gaussians do n a t f i t w i t h i n the s t r i c t d e f i n i t i o n o f a p r o b a b i l i t y d i s t r i b u t i o n . (They cannot meet axiom 2 i n Section 3.1.1.) For c u r r e n t purposes, we need o n l y recognize t h a t an i n f i n i t e covariance Gaussian d i s t r i b u t i o n can be considered as a l i m i t i n g case ( i n s o w sense t + a t we w i l l not p r e c i s e l y define here) o f f i n i t e covariance Gaussians. The term "generalized p r o b a b i l i t y d i s t r i b u t i o n " i s sometimes used i n connection w i t h such l i m i t i n g arguments. The equations which apply t o the i n f i n i t e covariance case are the l i m i t s o f the correspondino f i n i t e covariance cases, provided t h a t the 1 i m i t s e x i s t . The primary concern i n p r a c t i c e i s thus how t o compute the appropriate 1 i m i t s . W could avoid several o f the s i n g u l a r i t i e s by r e t r e a t i n g t o a higher l e v e l o f a b s t r a c t i o n i n the mathee matics. The theory can consistently t r e a t Gaussian variables w i t h singular covariances by replacing the concept o f a p r o b a b i l i t y density function w i t h the more general concept o f a Radon-Nikodym d e r i v a t i v e . (A p r o b a b i l i t y density f u n c t i o n i s a s p e c i f i c case o f a Radon-Nikodym derivative.) Although such variables do not have p r o b a b i l i t y density functions, they do have Radon-Nikodym d e r i v a t i v e s w i t h respect t o appropriate measures. S u b s t i t u t i n g the more general and more abstract concept o f a - f i n i t e measures i n place of probab':i t y measures allows s t r i c t d e f i n i t i o n o f i n f i n i t e covariance Gaussian variables w i t h i n the same context. This l e v e l o f a b s t r a c t i o n requires considerable depth o f mathematical background, b u t changes l i t t l e i n the p r a c t i c a l application. W cdn derive the i d e n t i c a l computational methods a t a lower l e v e l o f abstrhction. e The abstract theory serves t o place a l l o f the t h e o r e t i c a l r e s u l t s i n a comnon framework. I n many senses the general a b s t r a c t theory i s simpler than the more concrete approach; there are fewer exceptions and special cases t o consider. I n implementing the abstract theory, the same computational issues a r i s e , b u t the s i m p l i f i e d viewpoint can help i n d i c a t e how t o resolve these issues. Simply knowing t h a t the problem does have a well-defined s o l u t i o n i s a major a i d t o f i n d i n g the solution. The conceptual s i m p l i f i c a t i o n gained by the abstract theory requires s i g n i f i c a n t l y more background than we assume i n t h i s book. Our emphasis w i l l be on the computations required t o deal w i t h the s i n g u l a r i t i e s . r a t h e r than on the abstract theory. Royden (1968). Rudin (1974). and L i p s t e r and Shiryayev (1977) t r e a t such subjects as a - f i n i t e measures and Radon-Nikodym d e r i v a t i v e s . W w i l l consider two general computational methods f o r t r e a t i n g s i n g u l a r i t i e s . The f i r s t method i s t o e use a l t e r n a t e forms o f the equations which a r e n o t a f f e c t e d by the s i n g u l a r i t y . The covariance form (Equati~ns (5.1-12) and (5.1-13)) and t h e information form (Equations (5.1-14) and (5.1-15)) o f the p o s t e r i o r d i s t r i b u t i o n are equivalent, b u t have d i f f e r e n t p o i n t s o f s i n g u l a r i t y . Therefore, a s i n g u l a r i t y i n one form can often be handled simply by switching t o the other form. This simple method f a i l s i f a problem statement has s i n g u l a r i t i e s i n both forms. Also, we may desire t o s t i c k w i t h a p a r t i c u l a r form f o r other reasons. The second method i s t o p a r t i t i o n the estimation problem i n t o two parts: the t o t a l l y singular p a r t and the nonsingular p a r t . This p a r t i t i o n i n g allows us t o use one means o f solving the singular p a r t and another means o f s o l v i n g the nonsingular p a r t ; we then combine the p a r t i a l solutfons t o g i v e the f i n a l r e s u l t .

5.3.1

Singular

The f i r s t case t h a t we w i l l consider i s singular P. A s i l ~ g u l a r P m t r i x indicates t h a t some parameter o r l i n e a r combination o f parameters i s known p e r f e c t l y before the experiment i s performed. For instance, we might know t h a t E, = 55, + 3, even though 6, and 5, are unknown. I n t h i s case, we know the l i n e a r combinat i o n E, 55, exactly. The singular P matrix creates no problems i f we use the covariance form instead o f the information form. I f we s p e c i f i c a l l y desire t o use the information form, we can handle the s i n g u l a r i t y as follows.

Since P i s always s y m e t r i c . the range and the n u l l space o f P form an orthogonal decomposition o f the spaLe 5 . The singular eigenvectors o f P span the n u l l space, and t h e nonsingular eigenvectors span the range. Use the eigenvectors t o decompose the parameter estimation problem i n t o the t o t a l l y singular subproblem and the t o t a l l y nonsingular subproblem. This i s a parameter p a r t i t i o n i n g as discussed i n Section 5.2. The e t o t a l l y singular subproblem i s t r i v i a l because w know the exact s o l u t i o n when we s t a r t (by d e f i n i t i o n ) . Subs t i t u t e the s o l c t i o n o f t h e singular problem i n the o r i g i n a l problem and solve the nonsingular subproblem i n t h e normal manner. A s p e c i f i c implementation o f t h i s decomposition i s as follows: l e t X S be the matrix o f orthonormal s i n g u l a r eigenvectors o f P, and XNS be the matrix o f orthonormal nonsingular eigenvectors. Then define

The covariance5 o f

:S and fhs

are

where

PNS i s nonsingular.

Write

and r e s t a t e the Substitute Equation (5.3-3) i n t o the o r i g i n a l problem. Use the e x a c t l y known value o f problem i n terms o f ENS as the unknown parameter vector. Other decmpositions derived from m u l t i p l y i n g Equation (5.3-1) by nonsingular t r a n s f o r m t i o n s can be used i f they have advantages f o r s p e c i f i c s i t u a t i o n s . W w i l l henceforth assume t h a t P i s : nonsingular. I t i s unimportant whether the a r i g i n a l problem e statement i s nonsingular o r we are w o r ~ i n gw i t h the nonsingular subproblem. The implementation bbove i s defined i n very general terms, which would a l l o w i t t o be done as an automatic computer subroutine. I n practice, we u s u a l l y know the f a c t o f and reason f o r the s i n g u l a r i t y beforehand and can e a s i l y handle i t more concretely. I f an equation gives an exact r e l a t i o n s h i p between two o r more variables which we know p r i o r t o the experiment, we solve the equation f o r one variable and remove t h a t v a r i a b l e from the problem by s u b s t i t u t i o n . Exa l e 3-1 Assume t h z$kn--5orce and mment a t the output o f a system i s a known f u n c t i o n o f the

An unknown p o i n t f o r c e i s applied a t a known p o s i t i o n r r e f e r r e d t o the o r i g i n . W thus know t h a t e

I f F and M are both considered as unknowns, the P m a t r i x i s singular. But t h i s s i n g u l a r i t y i s r e a d i l y removed by s u b s t i t u t i n g for M i n terms of F so t h a t F i s the only unknown.

5.3.2

Singular

GG*

The treatment o f singular GG* i s s i m i l a r i n p r i n c i p l e t o t h a t o f singular P. A singular GG* matrix implies t h a t some masurement o r combination o f measurements i s made p e r f e c t l y (1.e.. noise-free). The covariance form does n o t involve the inverse o f GG*, and thus can be used w i t h no d i f f i c u l t y when GG* i s singular. An a l t e r n a t e approach involves a sequential decomposition o f the o r i g i n a l problem i n t o t o t a l l y singular (GG* 0) and nonsingular subproblems. The t o t a l l y singular subproblem must be handled i n the covariance form; t h e nonsingu11r subproblem can then be handled i n e i t h e r form. This i s a measurexnt p a r t i t i o n i n g as descrfbed i n Section 5.2. Divide the measurement i n t o two portions, c a l l e d the singular and the nonsingular F i r s t ignore Z and f i n d the p o s t e r i o r d i s t r i b u t i o n o f E given o n l y Z Then measurements, ZS and ZN use t h i s r e s u l t as the j i s t r i b u t i o n p r i o r $0 Zs. k specifically lnplenent t h i s decomposition as fo!?ows:

For the f i r s t step o f the decomposition l e t XNS be the matrix o f nonsingular eigenvectors of M u l t i p l y Equation (5.1-1) on the l e f t by x i S g i v i n g

GG*.

56

Def ine

Equation (5.3-4) then becomes

Note t h a t G N i s nonsingular. o f E condit(ioned on ZNS i s

GS

Using the information form f o r the p o s t e r i o r d i s t r i b u t i o n , the d i s t r i b u t i o n CNSmc

mNS = E [ C I Z ~ * mc + ( ~ f i ~ ( ~ ~ ~ G + iP-l)-lCAS(GNSGNfS)-l(ZNS~I f ~ ) - l C ~ ~ pNS = C O V { E I Z ~' ; ~ (CfiS(GNSGiS)-lCNS + P")-'

- DNS)

(5.3-?a) (5.3-7b)

For *.he second step. l e t XS be the m a t r i x o f singular eigenvectors o f GG*. Equatioi (5.3-6) i s

Corresponding t o

where

zs
cs

- xpz
=

xpc

0 = XpD ,

t
Since GS i s 0, we m s t use the covariance

Use Equation (5.3-7) f o r the p r i o r d i s t r i b u t i o n f o r t h i s step. form f o r the p o s t e r i o r d i s t r i b u t i o n , which reduces t o

Equations (5.3-4). (5.3-6). (5.3-8). and (5.3-10) g i v e an a l t e r n a t e expression f o r the o s t e r i o r d i s t r i b u t i o n G of E given Z which we can use when G * i s singular. I t does require t h a t CsPn C[ be nonsingular. This I s a special case o f t h e r e q u i r e m n t t h a t CPC* + GG* t e nonsingular. which we !iscuss l a t e r . It i s i n t e r e s t i n g t o note t h a t the covariance (Equation (5.3-lob)) o f the estimate i s singular. M u l t i p l y Equation (5.3-lob) on the r i g h t by C$ and obtain

Therefore the columns o f C$ 5.3.3 G Singular CPC* + G *

a r e a l l singular eigenvectors o f the covariance o f the estimate.

The next special case t h a t we w i l l consider i s when CPC* + G * i s singular. Note f i r s t t h a t t h i s can G happen o n l y when GG* i s a l s o singular, because CPC* and G * are both p o s i t i v e semi-definite, and the sum G o f two such matrices can be singular only i f both terms a r e singular. Since both GG* and CPC* + GG* are singular, neither the covariance f9nn nor t h e information f o m circumvents the s i n g u l a r i t y . I n f a c t , there i s no way t o circumvent t h i s s i n g u l a r i t y . I f CPC* + GG* i s singular, the problem i s i n t r i n s i c a l l y ill-posed. The only s o l u t i o n i s t o r e s t a t e the o r i g i n a l p r o b l m .
I f we examine what i s implied by a singular CPC* + GG*, we w i l l be able t o see why i t necessarily means t h a t the problem i s ill-posed, and what kinds o f changes I n the problem statement are required. Referrlnpl t o Equation (5.1-6). we see t h a t CPC* + GG* I s the covariance o f the measurement 2. GG* i s the c o n t r i b u t i o n o f the measurement noise t o t h i s covariance, and CPC* i s the c o n t r l b u t l o n due t o the p r i o r variance o f E . I f CPC* + GG* i s singular, we can e x a c t l y p r e d i c t some p a r t o f the measurea response. For t h i s t o occur. there nust be n e i t h e r measurement noise nor parameter uncertainty a f f e c t i n g t h a t p a r t i c u l a r p a r t o f t h e response.

Clearly, there are serlous mathematical d l f f i c u l t l e s I n saylng t h a t we know exactly what the measured value w l l l be before taking the mascrement. A t best, the measurement can agree w f t h what we predlcted, which adds no new Information. I f , however, there i s any disagreement a t a l l , even due t o roundlng e r r o r I n the computatlons, there i s an Irresolvable c o n t r a d l c t l o n - m said t h a t we knew exactly what the value would be and we were wrong. This l s one s l t u a t l o n where t h e dlfference between almost p e r f e c t and p e r f e c t i s extremely Important. As CPC* + GGC approaches s i n g u l a r l t y , the correspondlng estimators diverge; we cannot t a l k about the l i m i t l n g case because the estlmdtors do n o t converge t o a l l m l t I n any meanlngful sense.

5.3.4

Infinlte

Up t o t h i s point. the special cases considered have a l l Involved slngular covarlance matrlces, correspondi n g t o perfectly known q u a n t l t l e s . The remaining specfal cases a l l concern l i m i t s as elgenvalues o f a covariance matrix approach i n f i n i t y , corresponding t o t o t a l ignorance o f the value o f a quantity. The f i r s t such special case t o dlscuss I s when an elgenvalue o f P approaches I n f i n i t y . The problem i s much easier t o discuss I n terms o f the lnformatlon matrix P' -. As an eigenvalue o f P approaches l n f l n l t y . the corresponding elgenvalue o f P - I approaches zero. At the l l m l t , P-' i s slngular. To be cautlous, we should not speak o f P" belng singular b u t o n l y o f the l l m i t as P" goes t o a s l n g u l a r l t y , as i t f s not meaningful tolsay t h a t P-' i s singular. Provided t h a t we use the fnformation form everywhere, a l l o f the l i r n l t s as P- goes t o a s l n g u l a r i t y are well-behaved and can be evaluated simply by substituting t h e slngular value f o r P". Thus t h l s s l n g u l a r i t y poses no d l f f i c u l t l e s I n practlce, as long s we avoid the use o f : goes t o zero I s p a r t l c u expressions i n v o l v i n g a nonlnverted P. As previously mentioned, the l i m i t as Pl a r l y I n t e r e s t i n g and r e s u l t s i n estimates i d e n t i c a l t o the maximum l l k e l l h o o d estimates. Using a slngular f s paramount t o saying t h a t there i s no p r i o r information about some parameter o r set o f parameters ( o r P-I t h a t we choose t o discount any such lnformatlon i n order t o obtain an independent check). There i s no convenient way t o decompose the problem so t h a t t h e covarlance form can be used w i t h slngular P-' matrlces. i s most c l e a r l y illustrated by some exanples using confidence regions. A The meanlng of a singular P" confidence r e fon I s the area where the p r o b a b l l i t y denslty function ( r e a l l y a generalized p r o b a b i l i t y density functlon here! i s greater than o r equal t o some glven constant. (See Chapter 11 f o r a more d c t a l l e d discusslon of confidence regions.) Let the parameter vector consist o f two elements, el and c,. Assume t h a t the p r i o r d i s t r i b u t i o n has mean zero and

The p r l o r confldence reglons are glven by

o r equivalently

whlch reduces t o

where C, and C, are constants depending on t h e l e v e l o f confidence desired. For current purposes, we are interested only i n the shape o f the confidence region. which i s independent of the values of the constants. Figure (5.3-1) i s a sketch o f the shape. Note t h a t t h l s confidence region i s a l i m l t i n g case o f an e l l l p s e w l t h major a x i s length going t o i n f i n i t y while the mlnor a x i s i s fixed. This p r i o r d l s t r l b u t l o n glves i n f o r , mation about el, but none about 6 .

Now conslder a second example, whlch I s l d e n t l c a l t o the f i r s t except t h a t

I n t h i s rase, the p r l o r confidence region i s

Figure (5.3-2) I s a sketch o f the shape o f t h l s confidence region. I n t h i s case, the dlfference between C, The singulai. and Cz f s known w i t h sane confldence. but there 4s no lnfonnatlon a b u t the sum & + 6 eigenvectors o f P m l correspond t o d i r e c t f o l s t n the parameter space about which there f s no p r i o r knowledge.

58 5.3.5 I n f l n l t e GG*

5.3.5

Correspondfng t o the case where P'l approaches a slngular p o l n t I s the s l m l l a r case where (GG*)" e approaches a s l n y u l a r i t y . As i n the case cr slngular P", there are no computational problems. W can r e a d l l y evaluate a l l o f the l l m l t s slmply by s u b s t l t u t l n g thelslngular w t r l x f o r ( a * ) - ' . The l n f o m t l o n m a t r l x would l n d l c a t e t h a t some measurement o r forin avolds the use o f a noninverted a*. A slngular (GG*)' l l n e a r comblnatlon o f measurements had l n f l n l t e nolse variance, which i s rdther u n l i k e l y . The primary use o f slncular (GG*)" matrlces i n p r a c t l c e I s t o make the estimator Ignore c e r t a l n measurements i f they are worthless o r slmplv unavallable. It I s m a t h c ~ t l c a l l ycleaner t o r e w r l t e t h e system model so t h a t the unused measurements are not Included I n the observatlon vector, b u t i t I s sometlmes more convenient t o slmply use a slngular (GG*)-I matrlx. The two methods give the same r e s u l t . (Not havlng a m e a s u r y n t a t a l l l s equlvaapproaches 0. Thls l e n t t o havlng one and lgnorlng I t . ) One l n t e r e s t l n g s p e c l f l c case occurs when (GG*)method then amunts t o lgnorlng a l l of the measurements. As might be expected. the a p e t e r L o r i estimate i s then the same as the a priori estlmate. 5.3.6 Singular C*(GG*)"C

+ P-I

The f i n a l speclal case t o be dlscussed I s when the C*(GG*)-'C + P'' I n the l n f o m t l o n form approaches a s!ngular value. Note t h a t t h i s can occur only l f P-' I s also approachln a s l n g u l a r l t y . Therefore. the problem cannot be avoided by uslng the covarlance form. I f t*(GG*)-lC + Pa' I s slngular, i t means t h a t there i s no p r l o r informatlon about a parameter o r combination o f parameters, and t h a t the experiment added no such l n f o m t l o n . The d l f f l c u l t y , then, i s t h a t there I s absolutely no basis f o r estlmatlng the value o f the slngul a r parameter o r comblnatlon. The system I s r e f e r r e d t o as belng unldentlf.iable when t h i s s f n g u l a r f t y I s present. I d e n t l f l a b l l l t y I s an Important lssue I n the theory o f parameter estimatlo~,. Tne easlest computat l o n r l solutlon I s t o r e s t a t e the problem, d e l e t l n g the parameter i n question from the l l s t o f unknowns. Essentlaliy the same r e s u l t comes from uslng a pseudo-Inverse I n Equatlon (5.1-14) (but see the dlscusslon i n Sectlon 2.4.3 on the b l l n d use o f pseudo-Inverses t o "solve" such problems). nf course, the best alternative i s o f t e n t o examlne why the experlment gave no l n f o m t l o n about the parameter. and t o redeslgn the experiment so t h a t a usable estimate can be obtalned. 5.4 NOIiLINEAR SYSTEMS WITH ADDITIVE GAUSSIAN NOISE The general form o f the system equations f o r a nonlinear system w i t h a d d l t i v e Gausslan nolse I s

Z = f(t.U) + G(U)w

(5.4-1)

As I n the case o f l l n e a r systems, we w i l l d e f i n e by convention the mean o f w t o be zero and the covariance t o be i d e n t i t y . I f c i s random. we w i l l assume t h a t i t i s independent o f J,I and has the d l s t r i b u t i o n glven by Equation (5.1-3). 5.4.1 J o i n t D l s t r i b u t l o n o f Z and

To define the estlmators o f Chapter 4. ke need t o know the d i s t r i b u t i o n P(Z1c.U). Thls d l s t r l b u t i o n l s and G(U) are both constants I f condltloned on The expressions f(:.U) e a s i l y derived from Equatlon (5.4-1). s p e c l f l c values o f c and U. Therefore we can apply the r u l e s dlscussed i n Chapter 3 f o r multlpl!catlon o f Gausslan vectors by constants and a d d l t i o n o f constants t o Gausslan vectors. Using these rules, we see t h a t the d l s t r l b u t l o n o f Z conditioned on c and U I s Gausslan w i t h mean f(6.U) and covarlance G(U)G(U)*.

Thls I s t h e obvious nonllnear g e n e r a l l r a ~ i o n f Equation (5.1-6); o m t h o d o f derlvation. I f 6 I s random, we w!ll puted by Bayes r u l e

the n o n l l n e a r i t y does not change t h e basic The j o i n t d i s t r l b u t l o n i s com(5.4-3)

need t o know the j o i n t d i s t r i b u t i o n p(Z,clU). P(Z,CIU)


a

P(Z~~~'J)P(FIU)

Using Equatlons (5.1-3) and (5.4-2) gives p(Z.6lU)

[ l2rPl

IZ~GG*II-~/'

exP{

[c

- m 6 ~ * ~ - x [-c mCl}

2 [Z - ~((.U)]*[G(U)G(U)*]-l[Z

- f((.U)]
(5.4-4)

Note t h a t p(Z.tlU) I s not, I n general, Gausslan. Although Z condltloned on ( i s Gausslan, and ; I s Gausslan, Z and c need hot be j o i n t l y Gausslan. This i s one o f the r m j o r dfiferences between l i n e a r and nonllnear systems w l t h a d d l t i v e Gausslan nolse.

f l *

Exam l e 5.4-1

L e t Z and 6 be scalars, P = 1, mc * 0, G(U) = 1. and Then p(~11.u) * ( ~ n ) - l / ' ezP((2

and

~ h l glves s

The general f o r n o f a j o i n t Gaussian d i s t r i b u t i o n f o r two variables Z and 6 i s

where a. b, c. and d are constants. The j o i n t d i s t r i b u t i o n o f Z &nu cannot be manipulated i n t o t h i s t o m because a y' term appears i n the exponent. Thus Z and t are not j o i n t l y Gaussian, even though Z conditioned on c i s Gaussian and 5 i s Gaussian. of

Given Equation (5.4-4). s can compute the marginal d i s t r i b u t i o n o f given Z from thc equations

2, and the conditional d i s t r i b u t i o n (5.4-5)

and

Tho i n t e g r a l i n Equation (5.4-5) i s not eary t o evaluate i n general. Since p(Z,i.) i s n o t necessarily : Gaussian, o r any other standard d i s t r i b u t i o n , the only general mems o f computing p(Z) 1 t o numerically integrate Equation (5.4-5) f o r a g r i d o f Z values. I f ( and Z a r e vectors, t h i s can be a q u i t e formidable task. Therefore, we w i l l avoid the use o f p(Z) and P(c1Z) f o r nonlinear systems. 5.4.2 Estimntors

The a posteriori expected value and Bayes s p t i n a l estimators are seldom used f o r nonlinear systems because t h e i r computation i s d i f f i c u l t . Computation o f the expected value requires the numerical i n t e g r a t i o n o f Equation (5.4-5) and t.me evaluation o f Equation (5.4-6) t o f i n d the conditional d i s t r i b u t i o n , and then the i n t e g r a t i o n o f 6 times the conditional d i s t r i b u t i o n . Theorem (4.3-1) says t h a t the Bayes optimal estimatcr f o r quadratic l o s s i s equal t o the u oeteriori expected value est.ilmtor. The computation o f the Bayes optlmal estimates -equires the same o r equivaPent multidimenslonal integrations, so Theorem (4.3-1) does not provide us w i t h a s i m p l i f i e d means o f computing the estimates. Since the p o s t e r i o r d i s t r i b u t i o n o f c need not be symnetric. the MAP estimate i s :.ot equal t o the a postorioTi expected value f o r nonlinear systems. The M estlmator does not r e q u i r e the use o f EquaP t i o n s (5.4-5) and (5.4-6). The H4P estima:e i s obtafned by maximizing Equation (5.4-6) w i t h respect t o c. For generai, nonlinear Since p(Z) i s n o t a function o f 6, we can equivalently maximize Equation (5.4-4). systems, we must do t h i s maximization using numerical optimlzation techniques.
It i s usually convenient t o work w i t h the logarlthm o f Equation (5.4-4). Since standard optimization conventions are phrased i n t e n s o f minimization, rather than mdximization, we w i l i s t a t e the problen as minimizi n g the negative o f the logarithm o f the p r o b a b i l i t y density.

Since the l a s t term o f Equation (5.4-7) i s a constant, i t does not a f f e c t the optimization. f o r e define the cost functional t o be minimlzed as J(c) =

W can theree (5.4-8)

$ [Z - f(f,U)]*(GG*)-'[Z

- f(i,U)]

1 T

[C

- mt]*P-'[t

W have omitted the dependence o f J on Z and U from the notation because i t w i l l be evaluated f o r s p e c i f i c e Z and U i n application; 6 i s the only v a r i a b l e w i t h respect t o which we are optimizing. Equatior! (5.4-8) makes i t c l e a r t h a t the HAP estimator i s also a least-squares estimator f o r t h i s problen~. The (a*)-' and P" matrices are m i g h t i n g s on the squared measurement e r r o r and the squared e r r o r i n the p r i o r estimste o f 6, respectively. As 1 the : For the maximum l i k e 1 ihood estimate we maximfre Equatjor~ (5.4-2) instead o f EquatiLi (5.4-4). goes case o f l i n e a r systems, the maximum l i k e l i h o o d estimate i s equal t o the l i m i t o f the MAP estimate as Pt o zero; i.e., the l a s t t e r n o f Equation (5.4-8) i s omitted. E For a s i n g l e measurement. o r even f o r a f i n i t e number o f measurmnts. the nonlinear MAP and M e s t i mators have none o f the o p t i m a l i t y properties discussed i n Chapter 4. The e s t i v a t e s a r e n e i t h e r unbiased, minimum variance. Bayes optimal, o r e f f i c i e n t . Uhen there are a large nunber o f measurements, the differences frw o p t i m a l i t y are u s u a l l y small enough t o ignore f o r p r a c t i c a l purposes The main b e n e f i t s o f the nonlinear MLE and HAP estimators are t h e i r r e l a t i v e ease o f computation and t h e i r l i c k s t o the i n t u i t i v e l y a t t r a c t i v e idea o f l e a s t squares. These l i n k s give s u m rrason t o suspect t h a t even i f some o f t h e assumptions about the noise d i s t r i b u t i o n are questionable, the estimators s t i l l make sense from a n o n s t a t i s r i c a l viewpoint. The f i n a l p r a r t i c a l judgmer~to f an e s t i n ~ t o ri s based on whether the estimates a r r adequate fur t h e i r intended use, rather than on whether they are exactly optimum. The extension of Equation (5.4-8) t o m u l t i p l e independznt experiments i s straightforward.

where N i s the number o f e~tperimentsperformed. The maxlrmm l i k e l i h o o d e s t i m t o r IS obtained by omltt.lng the l a s t term. The asymptotic properties are defined as N goes t o i n f i ~ i t y . The maximum l l k e l l h o o d e s t l mator can be shown t o be asymptotlcally unblased and a s y p t o t i c a l l y e f f i c i e n t (and thus a l s o asymptotlcally mlnimum-variance unbiased) under q u i t e general condltlons. The estlmator I s also conslstent. The ripcrous proofs o f these propertles (Cramr, 1946), although n o t extremely d l f f i c u l t , are f a l r l y lengthy and w i l l not be presented here. The only condltlon r e q l ~ l r e di s t h a t

converge t o a p o s i t i v e d e f l n i t e matrix. a Gaussian d l s t r l b u t l o n .

Cramer (1945) also proves t n a t the 0stl.nates asymptc.tically approach

Since the maximum l i k e 1 ihood estimates arc asymptotlcally efficient, the Cramr-Rao i n e q u a l l t y (Equat i o n (4.2-20)) glves a good estlmate o f the covarlance o f the estlmate f o r l a r g e N. Uslng Equation (4.2-19) f o r the Information matrix glves

The covarlance o f the maxirmm 1l k e l ihooC estimate 's thus approxlmated by

When c has a p r l o r d l s t r l h u t l o n , the corresponding approximation f o r the covariance o f the p o s t e r l o r d l s t r l b u t i o n of c i s

5.4.3

Computation o f the Estimates

The discussion o f t h r prevlous sectlon d i d not address the question o f hod t o compute the MAP and PL estimates. Equatlon (5.4-9) (wlthout the l a s t term f o r the RE) l s the cost functlonal t o mlnlmize. Hinlmlzation nf s ~ c h nonlinear functlons can be a d l f f i c u l t proposltton, as discussed i n Chapter 2. Equatlon (5.4-9) I s i n the form o f a sum of squares. Therefore the Cirss-Newton m t h o d i s o f t e n the b e t t cholce of optlmizatlon method. Chapter 2 dlscusse* the d r t a l l s o f the buss-Newton method. The p r o b a b i l i s t i c background o f Equatlon (5.4-9) allows us t o apply t:t- c e : ~ t r a l l l m l t theorem t o strer~gthenone o f the arguments usrd t o support the buss-Ntwton method. For s i n p l l c i t y , assume t h a t a!l o f the Ui are i d e n t i c a l . C m a r e the l l m i t l n g behavior o f the twc trtms I h e term' retained by the buss-Newton approximation o f the second gradlent. as expressed by Equatlon (2.5-10). l s N [ ~ ~ f ] * ~ i X i * ! ' ~ [ v ~ fwhlch grows l i n e a r l y w i t h k. At the t r u e value o f 6 , 21 f((,Uf) i s I Gausslan ], r a n d m varSable w l t h mean 0 and covariance GG*. Therefore, the omitted term o f the second gradient I s a sum of i.rdrpende~tt, identically distrlbuted, random variables w i t h zero mean. By the c e n t r a l 1 ,lnlt theorem, the va~ lance o f 1/N t l m s t h i s term goes t o zero as N goes t o t n f f n i t y . Since l/N times th? retained term goes t o a nonzero constant, the omitted term i s small compared t o the retailled one f o r l a r g e 6 . Thls conclusion I s s t i l l t r u e i f the Ui are not identfcal. as long as f and i t s gradients are bounded and the f i r s t gradlent does not converge t o zero.

This d a n s t r a t e s t h a t f o r l a r g e N the omitted term i s sinall conyrred t o the retalned term if c f s a t the t r u e value, and, by continuity, t f c i s s u f f i c i e n t l y close t o the t r b e value. When c i s f a r from the t r u e value, the arguments of Chapter 2 apply.

5.4.4

Singularitfes

The singular cases which a r i s e f o r nonlinear systems are h a s i c a l l y the same as f o r l i n e a r systems and have similar solutions. L i m i t s as P-' o r (GGf)-I approach singular values pose no d i f f i c u l t y . Singular P o r GG* matrices are handled by reducing the problem t o a nonsingular subproblem as i n the l i n e a r case. The one s i n g u l a r i t y which merits some additional discussion i n the nonlinear case corresponds t o singular

i n the l i n e a r case. given by

The equivalent matrix i n the nonlinear rase, i f rre use the Gauss-Newton algoritlun, i s
N

I f Equation (5.4.-13) i s singular a t the t r u e value, the system i s said t o be unidentifiable. W discussed the e computational problems o f t h i s s i n g u l a r i t y i n Chapter 2. Even i f the optimization algorithm c o r r e c t l y f i n d s a unique minimm, Equatior (5.4-11) indizates t h a t the covariance o f a maximum l i k e l i h o o d estimate would be very large. (The covariar.ce i s approximated by the inverse of a nearly singular , m t r i x . ) Thus the experimental data contain very l i t t l e information about the value o f s o w parameter a r combination o f parameters. Note t h a t the covariance estimate i s unrelated t o the optimization itigorithm; changes to the optimization algorithm might help you f i n d the minimum, b u t w i l l not change the properties o f the r e s u l t i n g estimates. The singular-, i t y call be eliminated by using a p r i o r d i s t r i b b t i o n w i t h a p o s i t i v e d e f i n i t e P' b u t i n t h i s case, t h e e s t i mated parameter values w i l l be strongly influenced by the p r i o f d i s t r i b u t i o n , since the experimental data i s lacking i n information.

As w i t h l i n e a r systems, u n i d e n t i f i a b i l i t y i s a serious problem. To obtain usable estimates, i t i s genera l l y necessary t o e i t h e r reformulate the problem o r redesign the experiment. k ' i t h r.onlinear systems, we have the additional d i f f i c u l t y o f diagnosing whether i d e n t i f i a b i l i t y problems are present o r not. This d i f f i c u l t y arises because Equation (5.4-13) i s a function c f 6 and i t i s necessary t o eva!uatc i t a t o r near the minimum t o ascertain whether th? system i s i d e n t i f i a b l e . I f the system i s not i d e n t i f i a b l e , i t may be d i f f i c u l t f o r the algorithm t o approach the (possibly nonunique) minimum because o f convergence problems. 5.4.5 Partitioning

I n both theory and computation, parameter estimation i s much more d i f f i c u l t f o r nonlinear than for l i n e a r systems. Therefore, means o f s i m p l i f y i n g parameter estimation problems are p a r t i c u l a r l y desirable f o r nonl i n e a r systems. Tine p a r t i t i o n i n g ideas o f Section 5.2 have t h i s p o t e n t i a l f o r some problems. The parameter p a r t i t i o n i n g ideas o f Section 5.2.3 m k 2 no l i n e a r i t y assumptions, and thus apply d i r e c t l y e t o nonlinear problems. W have l i t t l e more t o add t o the eal'lier discussion o f parameter p a r t i t i o n i n g except t o say t h a t parameter p a r t i t i o n i n g i s o f t e n extremely important i n nonlinear systems. It can make the c r i t i c a l difference between a tractable and an i n t r a c t a b l e problem formulation. Neasure~rentp a r t i t i o , lng. as formulated i n Section 5.2.1, i s impractfcal f o r most nonlinear systems. For general n o n l i n e l r syskems, the posterior density function p(6IZ ) w i l l not be Gaussian o r any other simple form. The p r a c t i c a l application o f measurement p a r t i t i o n i n g t o t i n e a r systems arises d i r e c t l y from the f a c t t h a t Gaussian distribu:ions are uniquely defined by t h e i r mean and covariance. The only p r a c t i c a l method o f applying measurement p a r t i t i o n i n g t o nonlinear systems i s t o approximate the function ~(EIZ,) ( o r p(Z,Ie) for MLE estimates) by some sin~pleform described by a few parameters. The obvious approximation i n most cases i s a Gaussian density function w i t h the same m a n and covariance. Tne exact covariance ;s d i f f ' c u l t t o compute. but Equations (5.4-11) and (5.4-12) give good approximations f o r t h i s purpose. 5.5 MULTIkLICATIVE GAUSSIAN NOISE (ESTIMATION O VARIANCE) F

The previous sections o f t h i s chaptei have assumed t h a t the G matrix i s known. The r e s u l t s are q u i t e d i f f e r e n t when G I s u n k n m because the noise n u l t i p l i e s G r a t h e r than adding t o it. For convenience, we w i ? l work d i r e c t l y w i t h GG* t o avoid the necessity o f taking rnatrix square roots. W compute the estimates o f G by taking the p o s i t i v e semidefinite, symetric-matrix square r o o t s of the e estimates o f GG*. The general form o f a nonlinear system w i t h unknown G i s

W w i l l consider N independent measurements Z i e - l t i n g from the experiments Ui. The Z i a r e then independent Gaussian vectors w i t h means f(c,Uj all . ,ariances G(t,Ui)G((.Ui)*. W w i l l use Equae s :ule (Equation (5.4-3)) then gives us the j o i n t d i s t r t t i o n (5.1-3) f o r the p r i o r d i s t r i b u t i o n s o f c. bution o f 6 and the Z i given the Ui. Equations (5.4-5) and (5.4-6) define the marginal d i s t r i b u t i o n o f i and the p o s t e r i o r d i s t r i b u t i o n o f 6 given Z. The l a t t e r d i s t r i b u t i o n s are cumbersome t o evaluate and thus seldom used.
l.:~.

Because o f the d i + f i c u l t y o f coaputing the p o s t e r i o r d i s t r i b u t i o n , the a posteriori expected value and e b y e s optimal e s t i m t o r s r , e seldom used. W can coapute the maximum l i k e l i h o o d estimates minimizing t h e negative o f the logarithm o f the 1ik e l ihood functional. Igrioring i r r e l e v a n t constant terms, the r e s u l t i n g cost functional i s N J(t) =

t[Zi

- f(c)]*[G(c)~(c)*]-~[Z~ - f(c)I

tnlG(OG(c)*I)

(5.5-2,'

o r equivalently

Ui from the n o t a t i o n and assume t h a t a l l o f the Ui are i d e n t i cal. (The genera!ization t o d i f f e r e n t Ui i s easy and changes l i t t l e of essence.) The M P estimator mini:mc]. The M P e s t i mizes a cost f u - c t i o n a l equal t o Equation (5.5-2) plus t e x t r a tetm 1/2[6 - me]*P-'[c k mate o f GG* i s se:dom used because the PL estimate i s easier t o compute and proves q u i t e satisfactory.
I can use numerical methods t o minimize Equation (5.5-2) and compute the M estimates. & I n most pract i c a l problems, the f o l l o w i n g parameter p a r t i t i o n i n g g r e a t l y s i n p l i f i e s the coaputation -*otbired: assume t h a t the 5. vector can be p a r t i t i o n e d i n t o independent vectors <G and cf such t h a t

Ye have omitted the e x p l i c i t deptndence on

The p a r t i t i o n cf may be empty, i n which case f i s a constant ( i f CG i s empty we have a k n m GG* n n t r i x , and the nroblem reduces t o t h a t discussed i n the previous section). Assume f u r t h e r t h a t the GG* w t r i x i s con@letely unknown, except f o r the r e s t r i c t i o n t h a t i t be p o s i r i v e semidefinite. Set t h e gradients o f Equation (5.5-2) w i t h respect t o GG* and 6f equal t o zero i n order t o f i n d t h e unconstrained minimum. Using the matrix d i f f e r e z t i a t i o n r e s u l t s (A.2-5) and (A.2-6) from Appendix A, we get

Equation (5.5-5)

gives

a*=

C [zi
irl

- f(kf)xzi

- fis,)l*

which i s the f a m i l i a r sample second moment o f the residuals. The estimate o f GG* frm Equation (5.5-7) i s always p o s i t i v e semidefinite. I t i s possible f o r t h i s estimate t o be singular, i n which case we nust use the techniques previously discussed for handling singular GG* matrices. For a given cf, Eouation (5.5-7) i s a simple n o n i t e r a t i v e estimator f o r GG*. This closed-form expression i s the reason f o r the p a r t i t i o n o f c i n t o Sf and 6 ~ .

We can constrain GG* t o be djagonal, i n which case the s o l u t i o n i s the diagonal elements o f Equat i o n (5.5-7). I f we place other Jpes o f constraints on GG*, such as knowledge o f the values o f i n d i b - iual off-diagonal elements, such simple closed-form solutions are not apparent. i n practice, such constraints are seldom required.

I f Sf i s empty, Equation (5.5-7) i s the s o l u t i o n t o the problem. I f cf i s n o t empty, we need t o cMnbine t h i s subproblem s o l u t i o n w i t h a s o l u t i o n f o r cf t o get a solution o f the e n t i r e problem. Let us investigate the two methods discussed i n Section 5.2.3.
The f i r s t method i s a x i a l i t e r a t i o n . Axial i t e r a t i o n involves successively estimating EG w i t h f i x e d Sf, and estimating cf w i t h f i x e d 66. Equation (5.5-5) gives the 66 estimate i n closed f o m f o r f i x e d cf. To estimate cf w i t h f i x e d c ~ we m s t minimize Equation (5.5-2) w i t h respect t o , Un!ess the system i s i s i n the form o f a linear, t h i s mrnimi?ation requires an i t e r a t i v e method. For f i x e d G. Equation (5.5:jj sum of squares and the Gauss-Newton method i s an appropriate choice ( i n f a c t t h i s sir5prob;em i s i d e n t i c a l t o the problem disctissed i n Sect'-? 5.4). W thus have an inner i t e r a t i o n w i t h i n the outer a x i a l i t e r a t i o n o f e .$f and c ~ . I n such situatjons, e f f i c i e n c y i s o f t e n fmproved by terminating the inner i t e r a t i o n before i t converges, inasmch as the l a r g e s t chatiges i n the cf estimates occur on the e a r l y inner i t e r a t i o n s . A f t e r these e a r l y i t e r i ians, more can be gained by r e v i s i n g GG* t o r e f l e c t these large changes thar by r e f i n i n g Since the estimates o f 6 and GG* a f f e c t one another, there i s no p o i n t i n obtaining extremely accurate u n t i l GG* f s known t o a corresponding accuracy. Ps Gauss (1809. p. 249) said concerning estimates o f e a d i f f e r e n t prodlem:

(f.

I t then can only be worth while t o aim a t the highest accuracy, when the f i n a l r o r r e c t i o r i i s t o be given t o the o r b i t t o he determined. But as long as i t appears probable t h a t new observatims w i l l give r i s e t o rlew corrections, i t w i l l be convenient t o relax, more o r less, as the case may be from extreme precisicn, i f i n t h i s way, the length o f the cooputations can be considerably dioinished.

E x p l o i t i n g t h i s concept t o i t s f u l l e s t suggests using only one i t e r a t f o n o f the Gauss-Newton algorithm f o r the inner " i t e r a t i o n . " I n t h i s case the inner i t e r a t i o n i s no longer i t e r a t i v e , and the o v e r a l l algorithm would be as follows:

1 Estimate GG* .
2. 3.

using Equation (5.5-71 and the current guess o f

cf.
if.

Use one i t e r a t i o n o f the Gauss-Newton algorithm t o r e v i s e the estimate o f Repeat steps 1 and 2 u n t i l rmvergence.

I n general, a x i a l i t e r a t i o n i s a very poor algorithm, as discussed i n Chapter 2. Thn convergence i s o f t e n extremely slow. Furthermore, the algorithm can converge t o a p o i n t t h a t i s n o t a s t r i c t l o c a l minimum and y e t give no h i n t o f a problem. For t h i s articular applicatinn, however, the performance o f a x i a l i t e r a t i o n borders on spectacular. Let us consider, f o r a while, the a l t e r n a t i v e t e a x i a l i t e r a t i o n : Equation (5.5-3). This s u b s t i t u t i o n gives J(C ) =
f

1N t r a c e { I l 2

1 2 N an

1i

s u b s t i t u t i n g Equation (5.5-7)

into

N [Zi

f(cf)l[Zi

- f(cf)l*

(5.5-8)

The f i r s t tenn i s i r r e l e v a n t t o the minimization, so we w i l l redefine the cost function as

-i;: s You may sometinss see t h i s c o s t function w r i t t e n i n the equivalent ( f o r our p u r ~ ~ s e m) Jkf) = Examine the gradient o f Equation (5.5-9). f r a n Appendix A, we obtain

(ce*j

(5.5-10)

Using t h e matrix d i f f e r e n t i a t i o n r e s u l t s (A.7-3) and (A.2-6)

This i s more compactly expressed as

which i s exactly the same as Equation (5.5-6) evaluated a t G = 6 . F u r t h e m r e , the Gauss-Newton methtid used t o solve Equation (5.5-6) i s a good method f o r s o l v i r g Eq~tation(5.5-12) because

Equation (5.5-13) neglects the d e r i v a t i v e o f G* w i t h respect t o cf, but we can e a s i l y show t h a t the term so neglected i s even smaller than the term containing v 2 f ( c f ) , t h n omissicn o f which we previously j u s t i f i e d . Therefore, a x i a l i t e r a t i o n i s i d e n t i c a i t o s u b s t i t u t i o n o f Equation (5.5-7) as a constraint. It seems 1i k e l y t h a t we could use t h i s e q u a l i t y t o make deductions about t h e g e m t r y o f the cost function and thence about the behavior o f various algorithms. (Perhaps there may be some k i n d o f orthogonality property buried here.) Several computer programs, including t h e I 1iff-Maine M E 3 code (Maine and I 1iff, 1980; and Maine, 1981). use a x i a l i t e r a t i o n , o r a modification thereof, o f t e n w i t h l i t t l e more j u s t i f i c a t i c n than t h a t I t seems t o work well. This i s , o f course, the f i n a l and most important j u s t i f i c a t i o n , b u t i t i s best used as v e r i f i c a t i o n o f a n a l y t i c a l arguments. Although Equations (5.5-12) and (5.5-13) are derived i n standard texts, we have n o t seen t h e r e l a t i o n s h i p between these equations and a x i a l i t e r a t i o n pursued i n the l i t e r a t u r e . It i s p l a i n t h a t t h i s equivalence r e l a t e s t o the e x c e l l e n t performance o f a x i a l i t e r a t f o n on t h i s problem. W w i l l e leave f u r t h e r i n q u i r y along t h i s l i n e t o the reader. An important special case o f Equation (5.5-1) occurs when f(cf) i s linear

w i t h i n v e r t i b l e C. and the solution i s

For l i n e a r

f, Equation (5.5-6)

i s solved exactly i n a s i n g l e Gauss-Newton i t e r a t i o n .

if
If C i s i n v e r t i b l e , t h i s reduces t o

(c*(wI*)-~c)-~c*(GG*)-~ i Z i R 1 1 .

(5.5-15)

independent o f GG*. This i s , o f course, C-' and (5.5-16) i n t o (5.5-15) gives

times the sanple mean.

S u b s t i t u t i n g Equations (5.5-14)

which i s the f a m i l i a r srntple variance.

Equation (5.5-17)

can be manipulated i n t o the a l t e r n a t e form

Because if i s not a function o f system model.

GG*, the computation o f

if &* and

does not r e q u i r e i t e r a t i o n f o r t h i s

I n general. the txximum l i k e l i h o o d estimates a r e asymptotically unbiased and e f f i c ~ e n t ,b u t they need have no such properties f o r f i n i t e N. For l i n e a r i n v e r t i b l e systems, the biases are easy t o compute. From Equatian (5.5-16),

E!

if 1 cf

= C'-

Ccf =
i=i

ci

(5.5-19)

This equation shows t h a t if i s unbiased f o r f i n i t e N f o r l i n e a r i n v e r t i b l e systems. From Equat i o n (5.5-18). using the f a c t t h a t zZi i s Gaussian w i t h mean NCcf and covariance NGG*.

Thus kc* i s biased f o r f i n i t e N. Examining Equation (5.5-20). we see t h a t the estimator defined by m u l t i p l y i n g the ML estimate by N/(N 1) i s unbiased f o r f i n i t e N if N > 1. This unbiased estimate i s o f t e n used instead o f the maximum l i k e l i h o o d estimate. For larpe N, the difference i s inconsequential.

are unknown. I f cf i s know, then t h e maxiI n t h i s discussion, we have assumed t h a t both GG* and mum 1ikelihood estimator f o r GG* i s given by Equation (5.5-5) and t h i s estimate i s unbiased. The proof i s l e f t as an exercise. This r e s u l t gives i n s i g h t i n t o the reasons f o r the b i a s o f t h e estimator given by Equation (5.5-17). Note t h a t Equations (5.5-17) and (5.5-7) a r e i d e n t i c a l except t h a t the sample mean i s used This s u b s t i t u t i o n o f the sample mean f o r i n Equation (5.5-17) i n p:ace o f the t r u e mean i n Equation (5.5-7). the t r u e mean has resulted i n a bias. The difference between the estimates from Equations (5.5-17) and (5.5-7) can t +ten i n the form

G As t h i s expression shows, the estimate o f G * using the sample mean i s l e s s than o r equal t o the estimate using the t r u e mean f o r every r e a l i z a t i o n ( i .e., the difference i s p o s i t i v e semidefinite), e q u a l i t y occurring This i s a stronger property than the b i a s difference; the b i a s o n l y when a l l o f the Z i are equal t o f ( c f ) . difference implies only t h a t the expected value using the sanple mean i s less. 5.6 NON-GAllSSIAN NOISE

Non-Gaussian noise i s so general a c l a s s i f i c a t i o n t h a t l i t t l e can be said beyond the discussion i n Chapter I . The forms and properties o f the estilrators depend strcngly on the types o f noise d i s t r i b u t i o n . The same comnents apply t o Gaussian noise i f i t i s n o t a d d i t i v e o r n u l t i p l i c a t i v e , because the conditional d i s t r i b u t i o n o f Z given I s then non-Gaussian. I n general, we apply the r u l e s f o r transformation o f variables t o derive the conditional d i s t r i b u t i o n o f Z given c. Using t h i s d i s t r i b u t i o n , and the p r i o r dist r i b u t i o n of f i f defined, we can derive the various r s t i n a t o r s i n p r i n c i p l e .

The optimal estimators of Chapter 4 ofter. require considerable computation f o r non-Gaussian noise. It i s o f t e n possible t o define much simpler estimators which have adequate performance. W w i l l examine one situae t i o n where such s i m p l i f i c a t i o n can occur. Let the system model be 1inear w i t h a d d i t i v e noise Z;C{+w The d i s t r i b u t i o n o f w must have f i n i t e mean and variance independent o f E, but i s otherwise unrestricted. C a l l the mean m and the variance GG*. , W w i l l r e s t r i c t ourselves t o considering onlp l i n e a r estimators e o f the f ~ r m

Within t h i s class, we w i l l look f o r minimum-variance, unbiased estimators. W w i l l require t h a t the variance e be minimizea only over the class o f unbiased l i n e a r estimators; there w i l l be no guarantee t h a t a smaller variance cannot be attained by a nonlinear estimator. o The b i a s o f an estimator o f the f o r n ~ f Equation (5.6-2) b ( t ) = E ~ ~ ( E I E = KC6
I f the estimator i s t o be unbiased. w must have e

is

- E + D - Km,

The variance o f an unbiased estimator o f the given form i s var(<) = KGG*K* Note t h a t the bias and variance o f the estimate depend only p o i the mean and variance o f the noise d i s t r i bution. The exact noise d i s t r i b u t i o n need n o t even be known. I f the noise d i s t r i b u t i o n were Gaussian, a minimum-v~rianceunbiared estimator would e x i s t and be given by

This estimator i s l i n e a r . Since no unbiased estimator, l i n e a r or not, can have a lower variance f o r the Gaussian case, t h i s estimator i s the minimum-variance, ~ n b i a s e dl i n e a r estimator f o r Gaussian noise. Since t h e b i a s and variance o f a l i n e a r estimator depend only on the mean and variance o f the noise, t h i s i s the minimum-variance, unbiased l i n e a r estimator f o r any notse d i s t r i b u t i o n w i t h the same mean and variance. The o p t i m a l i t y o f t h i s estimator can a l s o be e a s i l y proven without reference t o Gaussian d i s t r i b u t i o n s (although the above proof i s complete and rigorous). L e t
A = K

- (c*(GG*)-'c)-'c*(GG*)-'

(5.6-7)

f o r any

K.

Then 0

AGG*A* = KGG*K* + (C*(GG*)-'C)-'C*(GG*)-'GG*(GG*)-'C(C*(GG*)-lC,-l

- KGG*(GG*)-~c(c*(GG*)-~c)-~
(c*(GG*)-~c)-~c*(GG*)-~GG*K*

Using Equation (4.6-4b)

ds a constraint on

K, Equation (5.6-8) KGG*K*

becomes

0
or, using Equation (5.6-5)

<

(C*(GG*)-'C)-'

var(C) 2 (CC(GG*)-'C)" Thus no K s a t i s f y i n g Equation (5.6-4b) can achieve a variance loner than t h a t given oy Equation '5.6-10). The variance i s equal t o the minimum i f and only i f A i s zero; t h a t i s i f

Therefore Equation (5.6-6) defines the unique m i n i m - v a r i a n c e , unuiased l i n e a r estimator. t h a t GG* and C*(GG*)-'C are nonsingular; Section 5.3 discusses the singular cases.

W are assuming e

I n sumnary, i f the system i s l i n e a r w i t h a d d i t i v e noise, and the estimator i s required t o ue l i n e a r and unbiasea, the r e s u l t s f o r Gaussian d i s t r i b u t i o n s apply t o any d i s t r i b u t i o n w i t h the same mean and variance.

The use o f optimal nonlinear estimators i s seldom j u s t i f i a b l e i n view o f the current s t a t e of the a r t . Although exceptional cases exist, three f a c t o r s argue against using optimal nonlinear estimators. The f l i - s t f a c t o r i s the complexity and corresponding cost o f J e r l v i n g and implementing optimal nonlinear estimators. e For s o w problems, w can construct f a i r l y simple suboptimal nonlinear estimators t h a t give b e t t e r performance than the l i n e a r estimators (often by s l i g h t l y modifying the l i n e a r estimator), b u t optimal nonlinear estimat i o n i s a d i f f i c u l t task. The second f a c t o r i s t h a t l i n e a r estimators, perhaps s l i g h t l y modified, o f t e n can give q u i t e good e s t i mates, even i f they are n o t exactly optimal. Based on the central l i m i t theorem, several r e s u l t s show that, under f a i r l y general conditions, the l i n e a r estimates w i l l approach the optimal nonlinear estimates as the number o f samples increases. The precise conditions and proofs o f these r e s u l t s are beyond the scope o f t h i s book. The t h i r d f a c t o r i s t h a t we seldom have precise knowledge o f the d i s t r i b u t i o n anyway. The e r r o r s from inaccurate s p e c i f i c a t i o n of the d i s t r i b u t i o n a r e l i k e l y t o he as large as the e r r o r s from using a Suboptimal l i n e a r estimator. kc need t o consider t h i s f a c t i n deciding whet,her an optinla1 nonlinear estimator i s r e a l l y worth the cost. From Gauss (18C9. p. 253) The i n v e s t i g a t i o n of an o r b i t having, s t r i c t l y speaking, the maxirmm probabili t y , w i l l depend upon a knowledge of ...[the p r o b a b i l i t y d i s t r i b u t i o n ] ; but d t h a t depends upon so many vague ~ n doubtful considerations- physiological included- which cannot be subjected t o calculation, t h a t i t i s scarcely, and indeed less than scarcely, possible..

..

Figure (5.3-1).

Confidence region w i t h singular

P-'.

Figure (5.3-2).

Confidence region w i t h another singular

P-I.

CHAPTER 6 6.0 SI JCHASTIC PROCZSSES

I n simplest terms, a stochastic process i s a random r a r i a b l e t h a t i s a f u n c t i o n o f time. Thus stochastic processes are basic t o the study o f parameter estimation f o r dynamic systems. A complete and rigorous study of stochastic process theory requires considerable depth o f mathematical background. p a r t i c u l a r l y f o r continuous-time processes. For the purposes o f t h i s book, such depth o f background i s not required. Our approach does not draw heavily on stochastic process theory. This chapter focuses on the few r e s u l t s t h a t a r e needed f o r t h i s document. Astrom (1970), Papoulis (1965). L i ~ s t e and Shiryayev (1977). and numerocs other books give more complete treatments a t r a r y i n g l e v e l s r o f abstraction. The necessary r e s u l t s i n t h i s chapter a r e l a r g e l y concerned w i t h continuous-time m d e l s . Although we derive a few discrete-time equations i n order t o examine t h e i r continuous-time l i m i t s , the chapter can be omitted i f you are studying only discrete-time analysis. 6.1 DISCRETE TIME

A discrete-time randor* process x i s simply a c o l l e c t i o n o f random variables x i , one f o r each time point, defined on the same p r o b a b i l i t y space. There can be a f i n i t e Jr i n f i n i t e number o f time points. The stochastic process i s completely characterized by t h e j o i n t d i s t r i b u t i o n s o f a l l o f the x.. This can be a rather unwieldy means o f characterizing the process, however, particula1.1y i f the number o# time points i s infinite.
If the X i are j o i n t l y Gaussian, the process can be characterized by i t s f i r s t and second moments. NonGaussian processes are often 2150 analyzed i n terms o f t h e i r f i r s t two moments because exact analyses are too complicated. The f i r s t two moments o f the process x are

m ( i ) = E{xil

(6.1-1)

The function

R ( i . j ) i s ca!led the autocorrelation f u n c t i o n o f the process.

A process i s c a l l e d stationary i f the j o i n t d i s t r i b u t i o n o f any c o l l e c t i o n o f the x i depends only on differences o f the i values, not on the absolute time. This i s c a l l e d strict-sense s t a t i o n a r i t y . A process i s stationary t o second order o r wide-sense stationary i f the f i r s t moment i s constant and the second moments depend only on time differences; i.e.. i f

f o r a l l i, j, and k. For Gaussian processes wide-sense s t a t i o n a r i t y implies strict-sense s t a t i o n a r i t y . The autocorrelation function o f a wide-sense stationary process can be w r i t t e n as a function o f one vartable, the time difference.

A process i f R(i . j ) = 0 t e r i z e d by the o f X i i s the process. 6.1.1

i s c a l l e d white i f X i i s independent o f x j f o r a l l i # j. Thus a Gaussian process i s white when i # j. Any process t h a t i s n o t white i s c a l l e d colored. A white process can be characd i s t r i b u t i o n cif x i fo- each i. I f a process i s bbth white and stationary, the d i s t r i b u t i o n same as t h a t o f X i f o r a l l i and j, and t h i s d i s t r i b u t i o n i s suff;cient t o chardcterize the

Linear System; Forced by Gaussian White Noise

Our primary i n t e r e s t i n t h i s chapter i s i n the r e s u l t s o f passing random signals through dynamic systems. W w i l l f i r s ? iook a t the simplest case, stationary white Gaussian noise passing through a l i n e a r system. The e system equation i s

l where n i s a stationary, Gaussian, white process w i t h zero m a n a ~ i d e n t i t y covariance. The assumptinn of zero mean i s made s o l e l y t o simp1 i f y the equations. Results f o r nor~zeromean can be obtained by l i n e a r superp o s i t i o n o f the deterministic response t o the mean and t h e stochastic response t o the process w i t h the mean removed. W are a l s o given t h a t x, i s Gaussian w i t h m a n 0 and covariance Po, and t h a t x, i s independent e o f the n i . The x i form a stochastic process generated from the n i . W desire t o examine the properties of the e stochastic process x. It I s imnediately obvious t h a t x i s Gaussian because x i can be w r i t t e n as a l i n e a r combination o f x, and n , n,, ni_ I n f a c t , the j o i n t d l s t r c b u t t o n o f the x i can be e a s i l y derived by e x p l i c i t l y w r i t i n g t h t s Pinear r e l a t f o n and using Theorem (3.5-5). W w i l l leave t h i s derfvation as an exere cise, and pursue instead a d e r i v a t i o n using recursion along the l i n e s t h a t w i l l be used i n Chapter 7.

...

Assum we know t h a t x has mean U and covariance Pi. d i a t e l y from Equation (6.1-5):

Then the d i s t r i b u t t o n of

XI+,

follows im-

70 E { ~ ~ + ~ x ? +=, l@E{xixjio*

6.1.1

FE{n,n;lF*

+ @E{xin?)F* + FE(nix?l@* = @Pi@* + FF*

(6.1-7)

ni-,, a l l The cross terms i n Equation (6.1-7) drop out because x i i s a f u n c t i o n only o f xo and no, n o f which are independent of n i by assumption. W now have a recursive formula f o r the covafiance x i e Pitl where
a

...

@Pi#*

FF*

i = 0.1,.

..

(6.1-8)

Po i s a given p o i n t from wnich we can s t a r t the recursion.

W know t h a t the x i are j o i n t l y Gaussian zero-mean variables w i t h covariances given by the recursion e (6.1-8). To complete the characterization o f the j o i n t d i s t r i b u t i o n of the X,I we need only the crosscovariances E ~ X ~ X ; ) f o r i j. Assume without l o s s o f g e n e r a l i t y t h a t i > J. Then x i can be w r i t t e n as

Then E{X x*) = ei-j~(x.x*) IJ J J

C
k=j

i-i $i-l-k~E{n

, I +i-jP *= k j

,j

(6.1-10)

The cross terms i n Equation (6.1-10) a r e a l l zero by the same reasoning as used f o r Equation (6.1-7). i < j, the same d e r i v a t i o n ( o r transposition o f the above r e s u l t ) gives

For

This completes the d e r i v a t i o n o f the j o i n t d i s t r i b u t i o n o f the x i . white (except i n special cases). 6.1.2 Nonlinear Systems and Non-Gaussian Noise

Note t h a t

i s n e i t h e r s t a t i o n a r y nor

I f the i o i s e i s not Gaussian, ana:yzing the system becomes much more d i f f i c u l t . Except i n special cases, we then have t o work w i t h the p r o b a b i l i t y d i s t r i b u t i o n s as functions instead o f simply using the means and covariances. S i m i l a r problems a r i s e f o r n c ~ l i n e a rsystems o r nonadditive noise even i f the noise i s Gaussian, because the d i s t r i b u t i o n s o f the x i w i l l n o t then be Gaussian. Consider the system

Assume t h a t f has continuous p a r t i a l d e r i v a t i v e s a l m s t everywhere, dnd can be i n v e r t e d t o o b t a i n n i ( t r i v i a l i f the noise i s a d d i t i v e ) : n. = f - ' ( ~ ~ . x ~ + ~ ) 1 The n i are assumed t o be white and independent of xo, b u t not necessarily Gaussian. Equation (3.4-:) d i s t r i b u t i o n of xi+, given x i can be obtained f r ~ m (6.1-13) Then the conditional

where J i s the Jacobian o f the transformation obtained from

The j o i n t d i s t r i b u t i o n o f

cap then be

Equations (6.1-14) and (6.1-15) are, i n general, too unwieldy t o work w i t h i n practice. nonlinear systems o r non-Gaussian noise u s u a l l y involves s i m p l i f y i n g approximtions. 6.2 CONTINUOUS TIME

P r a c t i c a l work w i t h

W w i l l look a t continuous-time stochastic processes by looking a t l i m i t s o f d i s c r e t e - t i n e processes w i t h e the time i n t e r v a l going t o 0. The discussion w i l l focus on how t o take the l i m i t so t h a t a useful r e s u l t i s obtained. W w i l l n o t get involved i n the i n t t i c a c i e s o f I t o u r Stratanovich calculus (Astrom, 1970; e Jazwinski, 1970; and L i p s t e r and Shlryayev, 1977). 6.2.1 Linear Systems Forced by White Noise Consider a l i n e a r continuous-time dynamic system driven by white, zero-mean noise

W would l i k e t o look a t t h i s system as a l i m i t ( i n some sense) o f the discrete-time systems e

as A, the time i n t e r v a l between samples, goez t o zero. Equation (6.2-2) i s i n the fornl o f E u l e r ' s method f o r approximating the solution o f Equation (6.2-1). For the moment we w i l l consider the d i s c r e t e n(ti) t o be Gaussian. The d i s t r i b u t i o n o f the n ( t . ) i s not p a r t i c u l a r 1 important t o the end r e s u l t , b u t our argument i s sonnrhat easier i f the n ( t i ) are ~ a u s s l a n . Equation (6.2-2j corresponds t n Equation (6.1-5) w i t h I + An s u b s t i t u t e d f o r o, A F ~ s u b s t i t u t e d f o r F, and sore changes i n n o t a t i o n t o make the d i s c r e t e and continuous notations more s i m i l a r .
I f n were a reasonably behaved d e t e r m i n i s t i c process, we would get Equation (6.2-1) as a l i m i t o f Equat i o n (6.2-2) when A goes t o zero. For the stochastic system, however, the s i t u a t i o n i s q u i t e d i f f e r e n t . S u b s t i t u t i n g I + AA f o r 0 and AFc f o r F i n Equation (6.1-8) gives

Subtracting

P ( t i ) and d i v i d i n g by A

gives

Thus i n the l i m i t

Note t h a t Fc has completely dropped o u t o f Cquation (6.2-5). The d i s t r i h u t i o n o f x d i s t r i b u t i o n o f the f o r c i n g noise. :n p a r t i c u l a r , i f P = 0, then P ( t ) = 0 f o r a l l , does not respond t o the f o r c i n g noise.

does n o t depend on the t. The system simply

A model i n which the system does n o t respond t o the noise i s not very useful. A useful mode? would be one t h a t gives a f i n i t e nonzero covariance. Such a model i s achieved by m u l t i p l y i n g the noise by A - ' / ~ (and thus e i t s covariance by A - I ) . W r e w r i t e Equation (6.2-2) as x(ti + A) = (I bA)r(ti) + The
A

+~ ' / ~ ~ ~ n ( t ~ )

(6.2-6)

i n the

A,: FF

term o f Equatioo (6.2-4)

then disappears and the l i m i t becomes

behavior of the covariance ( o r something asymptotic t o Note t h a t only a A-I r e s u l t i n the l i m i t .

A-I) w i l l give

f i n i t e nonzero

W w i l l thus define the continuobs-time white-noise process i n Equation (6.2-1) as a l i m i t , i n some e sense, o f discrete-time processes w i t h covariance: A-l. The autocorrelation function o f the continuous-tiore process i s

The impulse function 6(s) i s zero f o r x f 0 and i n f i n i t e f o r s = 0, and i t s i n t e g r a l over any f i n i t e range i n c l u d i n g the o r i g i n i s 1. W w i l l n o t go through the l ~ t h e m a t i c a lformalfsm required t o r i g o r o u s l y define e the impulse f u n c t i o n - s u f f i c e i t t o say t h a t the concept can be defined rigorously. e This model for a conti~luous-time w h i t e - ~ ~ o i sprocess requires f u r t h e r discussion. I t i s obviously n o t a f a i t h f u l representation o f any physical process because the variance o f n ( t ) i s i n f i n i t e a t every time point. The t o t a l power of the process i s dlso i n f i n i t e . The response of a dynamic system t o t h i s process, however, appears we1 l-behaved. The reasons f o r t h i s apparently anomalous behavior are most e a s i l y understood i n the frequency domain. The p c w r spectrum o f the process n i s f l a t ; there i s the same power i n every frequancy band o f the same width. There i s f i n i t e power i n any f i n i t e frequency range, b u t because the process has i n f i n i t e bandwidth, the t o t a l power i s i n f i n i t e . Because any physical system has f . l i t e bandwidth, the system response t o the noise w i l l be f i n i t e . If, the other hand. we kept the t o t a l power o f the noise f i n i t e as we o r i g i n a l l y on t r i e d t o do, the power i n any f i n i t e frequency band would go t o zero as we approached i n f i n i t e bdndwidth; thus. a physical system would have zero response. The preceding paragraph explains why i t i s necessary t o have i n f i n i t e power i n a nleaningful continuoustime white-noise process. It a l s ~ suggests a r a t i o n a l e f o r j u s t i f y i n g such a moue1 even though any physical noise source must htve f i n i t e power. W can envision the physical noise as being band l i m i t e d , b u t w i t h a e band l i m i t much l a r g e r than the system band 1im:t. Ifthe noise batid l i m i t i s l a r g e eaough. i t s exact value i s uninportant because the system response t o i n p u t s a t a very high frequancy 1: n e g l i g i b l e . Therefore, we can analyze the system w l t h white noise o f i n f i n i t e bandwidth and obtain r e s u l t s t h a t a r e very good approximat i o n s t o the finite-bandwidth r e s u l t s . The analysis i s much simpler i n the infinite-bandwidth white-noise model (even though sone f a i r l y a b s t r a c t mathematics i s required t o make i t rigorous). I n sumnary. contlnuoustime white-noise i s not physically r e a l i z a b l e b u t car1 g i v e r e s u l t s t h a t are good apprcximations t o phystcal systems.

6.2.2

Addftive White Measurement Noise

W saw i n the previous section t h a t continuous-time white noise d r i v i n g a dyr~amic system must have e i n f i n i t e power i n order t o o b t a i n useful r e s u l t s . W w i l l show i n t h i s section t h a t the same conclusion e applies t o continuous-time white measurement noisc. Me suppose t h a t nolse-corrupted measurements z are made of the system o f Equatlon (6.2-1). surement equation i s assumed t o be l i n e a r w i t h a d d i t i v e white noise: ~ ( t = Cx(t) )
+

The mea(6.2-9)

Gcn(t)

For convenience. we w i ? l assume t h a t the mean o f the noise i s 0. W then ask what e l s e must be said about e n ( t ) i n order t o obtain useful r e s u l t s from t h l s model. Presume t h a t we have measured z ( r ) over the i n t e r v a l 0 < t < T, and we want t o estimate some characteri s t i c o f the system-say, x(T). This i s a f l t t e r i n g problem, which we w i l l discuss f u r t h e r i n Chapter 7. For c u r r e n t purposes, we w i l l s i m p l i f y the problem by assuming t h a t A = 0 and F = 0 i n Equation (6.2-1). Thus x ( t ) i s a constant over the i n t e r v a l , and dynamics do not enter the problem. W can consider t h i s a s t a t i c e problem w i t h repeated observations o f a random variable, l i k e those s i t u a t i o n s we covered I n Chapter 5. Let us look a t the l i m i t o f the discrete-time equivalents t o t h i s problem. I f samples are taken every seconds, there are A-'T t o t a l samples. Equation (5.1-31) i s the PV\P e s t i w t o r f o r the discrete-time problem. 'The mean square e r r o r of the estimate i s given by Equations (5.1-32) t o (5.1-34). As A decreases t o 0 and the number o f samples increases t o i n f i n i t y , the mean square e r r o r decreases t o 0. This r e s u l t would To get a useful imply t h a t continuous-time estimates are always exact; i t i s thus n o t a very useful mode:. ~ m d e l , we must l e t the covariance o f the measurement noise go t o i n f i n i t y l i k e I-' as A decreases t o 0. This argument i s very s i m i ? a r t o t h a t used i r ~ the previous section. I f the measurement noise had f i n i t e rar!anca, each measurement would g i v e us a f i n i t e amount o f i n f o r m t i o n , and we rmuld have an i n f i n i t e amount o f information (no uncertainty) when the number o f mtasurements was i n f i n i t e . Thus the discrete-time equival e n t o f Equatlon (6.2-3) i s
A

where

n(ti) has i d e n t i t y cu.;;!?nre.

B+causc any measurement i s made using a physical device w i t h a f i n i t e bandwidth, we stop g e t t i n g much new information as we take samples f a s t e r than the response time o f the instrument. I n f a c t , the measurement equat i o n i s sometimes w r i t t e n as a d i f f e r e n t i a l equation f o r the instrument response instead of i n the mare i d e a l ized form o f Equation (6.2-9). W need a noise model w i t h a f i n i t e power i n the bandwidth o f the measurements e because t h i s i s the frequency range t h a t we are r e a l l y working i n . This argument i s e s s e r ~ t i a l l ythe same as the one we used i n the discussion o f white noise forcing the system. The white noise cac ayain be v!ewed as an approximation t o band-limited noise w i t h a l a r g e bandwidth. The lack o f f i d e l i t y i n representing very hignfrequency c h a r a c t e r i s t i c s i s not too impcrtant, because h i g h frequencias h i 1 1 tend t o be f i l t e r e d out when we cperate on the data. (For' instance, most operatiol?s orl continuous-time data w i l l have i n t e g r a t i o n s a t some po'rt.) AS a consequence o f t h i s m d e l i n g , we shculd be dubious o~ the p r a c t i c a l a p p l i c a t i o t ~of any a l g o r i t h m which r e s u l t s from t h i s bnalysis and does n o t t i l t e r o u t high-freqlucncy data i n soae manner. W can generalize the conclusions i n t h i s and the p r ~ v i o u ssection. Continuous-time white no!se w i t h e f i n i t e variance i s generally not a useful coccept i n any context. W w i l l therefore take as p a r t o f the d e f i e e n i t i o n o f continbous-time white noise t h a t i t have i n f i n i t e covariance. W w i l l use the spectral density r a t h e r than the covariance as a meaningful measure o f the noise a m ~ l i t u d e . White noise w i t h a u t o c o r r e l a t l o n R ( ~ . T ) = GCGE6(t hac spectral density 6.2.3
G~G:.

- T)

Nonlinear Systems

As w i t h discrete-time n o n l i n e a r i t i e s , exact andlysis o f nonlinear c o n t i n u o u s - t i r ~ systems i s generally so d i f f i c u l t as t o be impossible f o r most p r a c t l c a l i n t e n t s and purposes. The usual approach i s t o use a I i n e a r i z a t i o n o f the system o r some other a,>proximation. L e t the system equation be

where n I s zero-inean white noise w i t h u n i t y power, spectral density. For compactness o f notation, l e t p represent the d i s t r i b u t i o n of x a t time t, given t h a t x das x, a t time t,. The e v o l u t i o n of t h i s d i s t r i b u t i o n i s described by the following parabolic p a r t i a l d i f f e r e n t i a l equation:

where n i s the length o f the x vector. The i n i t i a l c o n d i t i o n f o r t h l s equatio.: a t t = t is p S(x x,). See Jczwinskt (1970) f o r the d e r i v a t i o n s i Equatiorr ( 6 . 2 - 1 3 ) . This equation !s c a l l e d the Fokker-Planck equation o r the fo.ward K o l m g o w v equaiion. I t 1s considered one o f the basic equations o f nonlinear i i l Q r i n g theory. I n p r i n c i p l e , t h i s e q u t t i o n completely describes the behavior o f the system and thus the problem i s "solved." I n practice, the s o l u t i c n o f t h i s m l t i d i m e n s l o n a l p a r t i a l d i f f e r e n t i a l equat i o n i s u s u a l l y too formidable t o consider seriously.

CHAPTER 7 7.0 STArE ESTIMT ION F R DYNAMIC SYSTEMS O

I n t h i s chapter, we address the estimation o f the state o f dynamic systems. The emphasis I s on l l n e a r dynamic systems w i t h a d d l t l v e Gausslan noise. W w i l l I n i t i a l l y develop the theory f o r discrete-time systems e and then extend I t t o continuous-time and mixed continuous/discrete models. The general Form o f a iinear discrete-tlme system model I s

The n and r l i are assumed t o be independent Gausslan noise vectors w l t h zero mean t n d I d e n t i t y covariance. The noise n i s c a l l e d process noise o r state noise; II I s c a l l e d measurement nol:r. The input vectors, u a r e assumed t o be known exactly. The state o f the system a t the 4th time p o i n t i s x i . The i n i t r a i condit i o n x, I s a Gausslan random variable w l t h mean m dnd covtrlance Po. (Po can be zero, meaning t h a t the , i n i t i a l candition i s known exactly.) I n general, the systein matrlccs e, r , F. C. D, ar,G G can be functtor,s of time. This chapter w i l l assume t h a t the system i s tlme-invariant i n order t o s i m p l i f y the notation. Except f o r the discussion o f steady-state fonns i n Section 7.3, the r e s u l t s are e a s i l y generalized t o time-varyl.1~systems by adding appropriate time subscripts t o the matrices. The state estimatlon problem i s defined as follows: s t a t e x ~ . To shorten the notation, we define based on the measurements x,, x ,

...z

~ estimate the ,

State estimation problens are comnonly divided i n t o three classes, depending on the r e l a t i o n s h i p o f

M and N.

I f M i s equal t o N, the problem i s c a l l e d a f i l t e r i n g probiem. Based on a l l of the measurements taken up t o the current time, we desire t o estimate the current state. Thls type o f problem I s t y p i c a l o f those encountered I n real-t5me applications. I t I s the most widely treated one. and the one on whlch we w l i l concentrate.
I f M 1s greater than N, we have a p r e d l c t l ~ , ? ' ' ?m. N, and we desire t o p r e d i c t the state a t SOW f u t u r e t i n e M. solved, the p r e d i c t i o n problem I s t r i v i a l .

Tne ddta are available up t c the current time W w l l l see t h a t once the f i l t e r i n g problem i s e

I f M 1s less than N, the problem I s c a l l e d a smoothing protlem. Thls type of problem i s most comranly encountered I n postexperiment hatch [ ~ r o o s s i n gi n which a l l o f the data are gathered before processing begins. I n t h l s case, the estlmate o f x~ can be based on a l l o f the data gathered, both before and a f t e r time M. By using a l l values o f M from 1 t o N 1, plus the f i l t e r e d s o l u t i o n f o r M = N, we can construct the e s t l mated state time h i s t o r y f o r tho i n t e r v a l being processed. Thls i s r e f e r r e d t o as f i :ed-interval wnoothing. Smothing can also be dsed in a real-time environment where a few time points o f delay i n obtaining current state estimater i s an acceptable p r i c e for the improved accuracy gained. For instance, It m;ght be acceptable t o gather data up t o time N = M + 2 before conputfng the estimate o f x ~ . This i s c a l l e d f i x e d - l a g smoothing. A t h i r d type o f smoothing i s fixed-point w o t h l n g ; i n t h i s cose, i t I s desired t o estimate xpl f o r a p a r t i c u l a r f i x e d M I n a real-time e n v i r m t e n t , using new data t o i w r o v e the estimate.

I n a i l cases, x # w i l l have a p r i o r d i s t r i b u t i o n derived from Equatlon (7.0-la) and t:le nolse d i s t r l b u ttdns. Since Equation (7.0-1) i s l i n e a r i n the nolse, and the noise i s assumed Gausslan. the p r l o r and p o s t e r i o r d i s t r i b u t i o n s o f XN w i l l be Gaussian. Therefore, the o poetariori expected value. MP, and man Bayes' mlninum r i s k astimators w l l l be I d e n t i c a l . Therc a r e the obvious estimators f o r a problem w i t h a myldefined p r i o r d i s t r i b u t i o n . The remainder o f the chapter assums the use o f these estimators. 7.1 EXPLICIT FORMULATION

By manipulat+ng Equatlon (7.0-1) i n t o an appropriate form, we can w r i t e the s t a t e estimatlon problem as a special cdse of the s t a t i c estimatlon proSlem studied i n Chapter 5. I n t h i s section, we w i l l solve the problem fnvolved w i l l thus play no special r o l e I n the meaning by such manipulation; the f a c t t h a t a dynamic system e o f the estimation problem. W w i l l examine only the t h l t e r i n g problem here. Our alm i s t o manipulate the s t a t e estimatlon problem i n t o the f o n of i q u a t l o n (5.1-1). The most obvfous o f Equation (5.1-1) t o be XN, the vector which he desire t o approach t o t h i s groalem i s t o define the and the input, U, would bc a concatenaestimate. The observation. 2, would be a concatmation o f zi,...,i~; ,UN-, t i o n of u,,. The noise vector, W , would then have t o be a concntenatlon o t n , ~nN-~,n~.. .qd. The problem can indeed be m i t t e n I n t h l s manner. Unfortunatkly. the p r l o r d i s t r i b u t i o n o f r N I s not independent a f r, n (except f o r t h e case N = 0); therefore, Equatjon (5.1-16) t s not the c o r r e c t expressioh f o r the e s t k t e c f XN Of course, we could derive an appropriate expression a l l m i n g f o r the correl a t i o n , but we w i l l take an a l t e r n a t e apprurch which allows the d i r e c t use OF Equation (5.1-16).

.. .
,....

...

h P

L e t the unkncun parameter vector be the concatenation o f the i n i t i a l condition and a l l o f the process

noise vector:.

The vcctsr x , whlch w r e a l l y desire t o cstlmate, can be w r l t t e n as an e x p l i c i t functlon o f the elements o t e c; I n p a r t l c u f a r , Equatlon (7.0-la) expands l n t o

tieta W can compute the MAP estimate o' XN by usin? the MAP estimates o f x, and n i i n Equation (7.1-2). e t h a t we can f r e e l y t r e a t the !if as noise o r as unknown parameterr w i t h p r l a r d i s t r i b u t i o n s without changing tcle essential nature o f the problem. The p r o b a b i l i t y d l s t r i b ~ t l o n f Z i s i d e n t l c a l I n e i t h e r Lase. The o only d i s t l n c t l o n i s whether o r not we wlnt ertfmates o f the n l . For t h l s cholce o f c, the remafning items o f Equation (5.1-1) must be

W get an e x p l i c i t formula f o r e

zi

by s u b s t i t u t i n g Equation (7.1-2)

i n t o Equation (7.0-lb),

giving

whlch can be w r i t t e n i n the form o f Equation (5.1-1) w i t h

You can e r s i l y verify these matrlces by s u b s t l t u t i n g them l n t o Equrtlon (5.1-1). the p r i o r d i s t r i ' :!on o f t are

The mean and covariance

The HAP e r t i n u t e o f c i s then glven by Equatlon (5.1-16). obtained fm, t h a t o f ( by uslng Equati,~n(7.1-2).

The HAP

e s t l n r t e o f x ~ whlch we seek, i s ,

The f l l t e r l n g problem I s thus "solved." This solution, however, i s unacceptabiy cumbersome. I f the system state I s an i-vector, the Inversion o f an (N + 1)r-by-jN + l ) i matrix I s requlred I n ordcr t o estimate The conputatlonal costs become unacceptable a f t e r a very few t l m polnts. W could Investigate whether i t e XN. I s possible t o take advantage o f the structure o i the matrices glven I n Equatlon (i.1-5) i n ordcr t,o s i m p l i f y the computatlon. kz can more r e a d i l y achleve the same ends, however, by adoptin9 a d l f f e r e n t approzch t o solvlng the problen: from the s t a r t . 7.2 RECURSIVE FORMUlkTlClN

To f l n d a simpler solutlon t o the f l l t e r l n g problem than tni; d e r l v r d i n the prrccedlng section, we need t o take b e t t e r advantage o f the speclal structure of the prnblem. The above d e r l v a t l o n used the l l n e a r l t y o f the problem and the B3ns:!an assumption on t h t nolsc, which a r t secondary features o f the problem s t r b c t u r r . The t f a c t t h a t the problem lnvolves a dynamlc state-space model Ss much rr.?re baslc, but was n o t used ~ b o v e o any rpectal advantage; the f l r s t step I n the d e r l v a t l o n was t o recast the system I n the fonn o f a s t a t i c model. L e t - r reexamlne the problem, m k l n g use o f the properties o f dynamlc state-space systens. The deflnlng property o f a state-space model l s as follows: the future output I s dependent only on the current state and the f u t u r e Input. I n cther words, provlded t h a t the current s t a t e o f the system I s know;., knmledge o f any prevfous states. inputs, o r outputs, 4s I r r e l e v a n t t o the p r e d l c t l o n o f f u t u r e system behavl o r ; a l l relevant facts about previous behavior a r e sbbsumed I n the knowledge o f t h e current state. Thls i s e r s e n t l a l l y the definition o f the state o f a system. The p r o b a b l l i s t l c expresslon o f t h i s idea l s

I t l s t h l s property t h a t allows the systt.n t o be descrlbed I n a recurslve fonn. cuch as t h a t o f Equat i o n (7.0-1). The recursive form involves much less computatlon than the mathematically q u i v a l e n t e x p l i c i t funn o f Equation (7.1-4).

Thls reasonlng suggests t h a t rcr.urslon might be used t o s a c advantage i n obtaining a s o l u t i o n t o the f i l t e r i n g problem. The e s t l m t o r s under conslderatlon (WiP, etc.) are a l l deflned frm the conditicnal dlst r i b u t l o n o f XN glven ZN. Ye w l l l reek a r x u r s l v e exprf-sion f o r the condltlonal d i s t r i b u t i o n , and thus f o r t h e e s t l m t e s . Ut w l l l prove t h r t such an expresslon e x l s t s by d e r l r l n g it. I n the nature o f recurslve forms. .:c s t a r t by assunlny t \ a t the condltlonal d l s t r l b u t i o n o f XN glvcn zh I s k r l m f o r some N, and then we attempt t o derlve an expresslon f >r the condltlonal d i s t r +button o f XN+, glven Z +,. W recognize t h i s task as s l m l l a r t o the m e a s u r m n t p a r t i t i o n i n g o f Sectlon 5.2.2. i n t h a t e w want t o s!wl i f y the s o l u t l o n by processing the measurements one a t a tlme. E uations (5.2-1) and (7.2-1) e express s l m l l a r ideas and glve the basis f o r the s i m p l l f l c a t i o n s i n both cases. The XN o f i q u a t l o s (?.2-1) corresponds t o the c o f Equation (5.2-2).)

Our task then 1 t o derlve p(xN+ ZN+,!. ; W w i l l d i v i d e t h i s task ia.0 two steps. F!rst, derive e p(xN+, Z ) from x Z Thls I s ca1/-- be prediction s:ep, because we are predicting x ~ + , based on previous nvormation. ! I s a l s o c a l l e d t tlme update because we are updating the estlmate t o a new t l w p o l r ? based on the Inme data. The second step i s t o derlve p ( x ~ +Z1 IN i) from pixNt1IzN). This I s c a l l e d the Correctlon step, because we are correcting the prcdlcted o f AN+X based on the dew informat'on I n r ~ t , . It I s also c a l l e d the measurement update because we are updatlng the e s t i n r t e bused on the new measurmnt.

estimate

Slnce a l l o f the d i s t r l b u t i o n s are assumed t o be Gausslan, they are completely deflned by t h e i r mans and covarlance m t r l c e s . Denote t h r (presumed known) m a n and covariance o f the d l s t r l t u t i o n p(xNIZN) by i r , an: PN, respectlvely. I n general. x and PN are functions o f ZN, but. m w l l l n o t encunber the notatfon w i t h t h l s l n f o ~ t l o t l . L l k w l r e , denote tffc mean a l d covarlance o f p(xN+,IZN) by TN+, and Q +,. The task ;thus t o N derlve expressions f o r 1 ~ + , and ON+, l n terns o f i and PN and exyresslons f o r by+, and PN+, I n t e n s o f X + and Q + NI N. I 7.2.1 kredlctlon

S x
For iN+,, simply take the expected valur of (7.2-2)

The p r e d l c t l o n step (tlme update) I s straightfornerd. Equation (7.0-1s) conditioned on ZN.

E{xN+,IZN) = ~ E t x ~ l Z ~ ) + F ~ t n ~ l Z ~ ; + YUN

Tho q!rantltles E{xN+,I?~) and ErXNlzN} a n . by d e f i n l t l o n , xN+ and i respectlvely. ZN I s a functfon of x,, no,...,nN-l,nl....n , and detennlttlstlc quantltles; nN I s fndepenlent of a l i of :hew, dtld therefore Independent o f ZN. T ~ U S

Substltutlng t h i s i n t o Equrtlon (7.2-2)

gives
XN+,

exN +

'UN 011

Slnce tclc t; ree ;en*% I n order t o evaluate Qw,, take the covrrldncr o f both s l des o f Equatlon (7.0-11). the rlght-hand side of tb. equatlon a r e tndapendent. the covsrlance o f t h e l r sum I s the sum o f th.!r covarlances.

The terms cov{xN+,IZ~l and c o v ( x ~ 1 Z ~ere, by d e f i n i t i o n . QN+~and PN, respectively. ) and, thus, has zero covarlance. By the independence of nN and ZN

YUN

i s deterministic

Substituting these relationships i n t o Equation (7.2-5)

gives

iM+, + FF' - #P,,on


Equations (7.2-4) and (7.2-7) const;tute the r e s u l t s t e s i r e d f o r the p r e d i c t i o n step ( t i m update) o f the f i l t e r i n g problem. T h g r e a d i l y generalize t o p r e d i c t i n g more than one sample ahead. These equations j u s t i f y our e a r l i e r statement that, once the f i l t e r i n g problem i s solved, the p r e d i c t i o n problem i s easy; f o r suppose we desire t o estimate x n based on ZN w i t h M > N. I f we can solve the f i l t e r i n g problem t o obtain iN. the f i l t e r e d estimate of XN, then, by a straightforward extension o f Equation (7.2-4).

i s the des:aed KAP estimate o f 7.2.2 Correction Step

XM.

For the correction step (measurement update), assume t h a t we know the mean. AN+,, and covariance, QN+~. of the d i s t r i b u t i o n o f XN+, given ZN. W seek the d i s t r i b u t i o n o f X N + ~ given both ZN and z~+,. e From Equation (7.0-lb)

i s Gaussian w i t h zero ,man and i d e n t i t y covariance. I+ I~ The d i s t r i b u t i o n o f N f o r nN, TIN+= i s independent o f ZN. Thus, we can say t h a t p(nN+,IZN)

By the s a w argument as used (7.2-10)

' p(nN+,)

This t r i v i a l - l o o k i n g statement i s the key t o the problem, f o r now everything i n the problem i s conditioned i n ZN. we know the d i s t r i b u t i o n s o f XN+, and TIN+, conditioned nn ZN, i d we seek the d i s t r i b u t i o n o f XN+, conditioned on ZN, and a d d i t i o ~ a l l ycocdi tioned on ZN+,. This problem i s thus exactly i n the form o f Equation (5.1-1). except t h a t a l l o f the d i s t r i b u t i o n s i n v o l i - d are conditioned on ZN. This amounts t o nothing more than r e s t a t i n g the problem o f Chapter 5 on a d i f f e r e n t p r o b a b i l i t y space, one conditioned on ZN. The previous r e s u l t s apply d i r e c t l y t o the new probabili t y space. Therefore, f r c , Equations (5.;-14) and (5.1-15)

I n obtaining Equations (7.2-11) and (7.2-12) following quantities:

from Equations (5.1-i4) and (5.1-15). (7.2-11),(7.2-12)


X~+l

we have i d e n t i f i e d the

(5.1-14),(5.1-15) m 5 P Z
C D

QNt1
Z ~ + ~

C Du~+l

EIcIZI COV{F]~

i ~ + ~
P ~ + ~ GG*

GG*

This completes the d e r i v a t i o n o f the correction step (measur-ment update), whtch r,e see t o be a d i r e c t a o p l i c a t i o n o f the r e s u l t s fron Chapter 5.
7.2.3 Kalman F i l t e r

To cunplete the recurs;ve s o l u t i o n t o the f i l t e r i n g probiem, we need only know the solution f o r soRe value o f N, an? we can now propagate t h a t solution t o l a r g e r N The s o l u t i o n f o r N = 0 i s i m d l a t e frm the i n i t i a l problem statement. The d i s t r l b u t i o n o f xo, cond1t:oned on Zo (t.e., conditioned on nothing because , Z i = (zl, zl)*), i s given t o be Gaussian w i t h mean m and covariance Po.

....

Let us now f i t together the pieces derived above t o show how t o solve the f i l t e r i n g problem: Step 1: Initialization Define i = m , , Po i s given Step 2: Prediction (time update), s t a r t i n g w i t h
i = 0.

= o~? ~ @+ FF* + *

(7.2-15)

Step 3:

Correction (wasurenent update)

W have defined the quantity i;+, by Equation (7.2-14) i n order t o make the form o f Equation (7.2-17) more e apparent; ii+, can e a s i l y be shown t o be E(zi+,lZi j . Repeat the p r e d i c t i o n and correction steps f o r i = 0. 1 N 1 i n urder t o obtain iN, estimate o f XN based on zl the MAP zy,.

..... -

,...,

Equations (7.2-13) t o (7.2-17) c o n s t i t u t e the Kalman f i l t e r f o r discrete-time systems. The recursive form of t h i s f j l t e r i s p a r t i c b l a r l y suited t~ real-time applications. Once i has been computed, i t i s not N necessary, as i t was using the methods o f Section 7.1, t o s t a r t from scratch i n order t o compute i + we need , ; do only one nure p r e d i c t i o n step and one m r e correction step. ~ti s extremely important tt note tRat t h e conputational cost o f obtaining in+, from i i s n o t a f u n c t i o n o f R. Thi, means t h a t real-time Kalman N f i l t e r s can be implemented using f i x e d f i n i t e resources t o run f o r a r b i t r a r i l y long time i s , t e r v a l s . This was not the case using the methods o f Section 7.1, where the estimator started from scratch f o r eazh time point, and each new estimate required more computation thas the previous estimate. For some applications, i t i s also do important t h a t the Pi.and Qi not depend cn the measurements, and can thus be precclputed. Such precomput a t i o n can s i g n i f i c a n t l y reduce real-time c o m p ~ t a t i o n a lrequirements. hone o f these advantages should obscure the f a c t t h a t the Kalman f i l t e r obtains the s ~ estimates as .;ere w obtained i n Section 7.1. The advantages a f the Kalman f i l t e r l i e i n the easier computation o f the estimates. not i n improvements i n the accuracy o f the estimates. 7.2.4 A1ternate Forms

The f i l t e r Eqbrtions (7.2-13) t o (7.2-17) can be a l g e b r a i c a l l y manipulated i n t o several equivalent a l t e r nate forms. Although a l l o f tCe variants are f o r m a l l y equivalent, d i f f e r e n t ones have computational advantages i n d i f f e r e n t situations. Son= o f th? advantages l i e i n d i f f e r e n t p o i n t s o f s i n g u l a r i t y and d i f f e r e n t s i z e matrices t o i n v e r t . W w i l l show a few o f the possible a l t e r n a t e forms i n t h i s section. e The f i r s t v a r i a n t comes from using Equations (5.1-12) and (5.1-13) (the covariance form) instead of (5.1-14) and (5.1-15) (the information form). Equations (7.2-16) and (7.2-17) then become

The covariance form i s p a r t i c u l a r l y useful i f GG* o r any o f the Qi are singuldr. The exact conditions under which Qi become singular a r e f a i r l y ccmplicated, b u t we can draw some simple conclusions from lookcan F i r s t . i f FF* i s nonsingular, then Qi never be singular. Second, a singular can i n g at 'fluation (7.2-15). Po (pal .,..clarly P, = 0 ) i s l i k e l y t o cause p r o b l e m i f FF* i s also singular. The only matrix t o i n v e r t i n Equation, i7.2-18) and (7.2-19) i s CQi++C* + GG*. I f t h i s matrix i s singular the problem i s ill-posed; the s i t u a t i o n i s the same as t h a t t~sclrssedi n Section 5.1.3. Note t h a t the covariance form involves inversion cf an r-bj-a matrix, where r i s the length o f the observation vector. On the other hand, the information form i,,volves inversion o f a p-by-p matrix, where p i the length o f the s t a t e vector. For some systems, t h e difference between r and p may be s i g n i f i c a n t , ; r e s u l t i n g i n a strong preference f o r one form o r the other. If ir i s diagonal (or i f GG* i s diagonalizable the system can be r e w r i t t e n w i t h a diagonal G), Equations (7.2-18) and (7.2-19) can be manipulated i n t o a form t h a t involves no matrix inversions. The key t o t h i s ~nanipulationi s t o consider the system t o have r independent scalar observations a t each t4me p a i n t Instead o f a single vector observation o f length r. The scalar observations can then be processed o r - a t a time. The Kalman f i l t e r p a - t i t i o n s t h e estimation problem by processing the measurements one tlme-point a t a time; w i t h t h i s modlficatlon, we extend the same p a r t i t i o n i n g concept t o process one element o f t h e measurement vector a t a time. The d c r i v a t t o n o f the measurement-update Equations (7.2-18) and (7.2-19) applies without e change t o a system w i t h several independent observatlons a t a time point. W need only apply the me. SurenIente update equation a times wi.'~ no intervening time updates. W do need a l i t t l e more complicated not3tlon t o keep track o f the process, b u t the equations a r e b a s i c a l l y the same.

Let I c!') and D!') be the j t h rows o f t h e C and D w t r i c e s . ~ ( j * j be the j t h diagonal element of ) * \ G, and z\?' be the j t h element o f z +,. Define fi+,, j t o be the estimate o f xi+, a f t e r the- j t h scalar o b s e h a t i o n a t time i + 1 has been processed, and aefine Pi+,,j t o be the covariance o f ..,+,,j. W s t a r t t h e measurement vpaate a t each time p o i n t w i t h e

Then, f o r each scalar measurement, we do the update

where + 1+1

1+1 , J

D(J+l)u
it1

Note t h a t the inversions i n Equations (7.2-22) and (7.2-23) are scalar inversions rather than matrices. None o f these scalars w i l l be 0 unless CQi+lC* + GG* i s singular. A f t e r ;recessing a l l e o f the scalar ineasurernents f o r t h e time point, we have

7.2.5

Innovations The inno(7.2-27)

A discussion o f the Kalman f l l t e r would be incomplete without some mention o f the innovations. v a t i o n a t sample p o i n t i, also c a l l e d the residual, i s

v. = z .
i i

- ii
= Cxi + Dui

where

i = EIzilZi-lj i
Following the notation f o r Zi, we define

d Now V i i s a l i n e a r function o f Zi. This i s shown by Equations (7.2-13) t o (7.2-17) ~ n 17.2-27). which g i v e formuiae f o r computing the V i i n terms o f the Z i . It may not be immediately obvious t h a t t h i s function i s i n v e r t i b l e . W w i l l prove i n v e r t i b i l i t y by w r i t i n g the inverse function; i.e., by expressing Z i i n telms of e Vi. Repeating Equations (7.2-13) and (7.2-14):

iit1 = ei. + 1

YU.

(7.2-30a)

S u b s t i t u t i n g Equation (7.2-27)

i n t o Equation (7.2-17) ii+l

gives

= iitl + Pi+lC*(GG*)-l~i+l

F i n a l l y , from Equation (7.2-27)

Equation (7.2-30) i s c a l l e d the i n ~ o v a t i o n sform o f the system. t h e z i from the v i .

It gives t h e recursive formula f o r computing

L e t us examine the d i s t r i b u t i o n o f the innovations. Tile innovations are obviously Gaussian, because they are l i n e a r functions of 2 , which i s Gaussian. Using Equation (3.3-10). i t i s i m d i a t a t h a t the mean o f the innovation i s 0. E(vil = E[zi E(zilZi_l)l

E I z i l - E { E ( z ~ ~ Z ~=- ~ ) ) 0

(7.2-31)

Derive the covaria;..r

matrix c: the innovation by w r i t i n g

The two terms on the r i g h t are independent, so cov(vi) = C cov(xi ii)C* = CQiC* + GG*

+ GGw

The most i n t e r e s t i n g property of the innovations i s t h a t v i i s independent o f v j f o r i # j. To prove t h i s , i t i s s u f f i c i e n t t o s+ow t h a t v. i s independent o f Vj-,. Let us examine E{vi(Vi,&). Since Vi-I i s obtained front Z i by an i n v e r t i b l e continuous transformation, c m d i t i o n i n g on Vi-, IS the same as conditioning on Zi-,. one i s known, so i s the other.) Therefore,

11f

as shown i n Equation (7.2-31).

Thus we have

Comparing t h i s equation w i t h the 'ormula f o r t h e Gaussian conditional mean given !n Theorem (3.5-9), t h a t t h i s can be t r u e only i f v i and V i are uncorrelated (A,, = 0 i n the thecrem). Then by Theorem (3.5-8). v i and Vi-, are indepenaent.

we see

The innovation i s thus a discrete-time white-noise process (i.e., each time p o i n t i s independent o f a l l o f the others). Thus, the Kalman f i l t e r i s o f t e n c a l l e d a wnitening f i l t e r ; i t creates a white process ( Y ) as a f u n c t i o n o f a nonwhite process (Z). 7.3 STEADY-STATE F R OM

The largest computatiorlal cost o f the Kalman f i l t e r i s i n the computatian o f the covariance matrix P i using Equations (7.2-15) and (7.2-16) (or any of the a l t e r n a t e f o m s ) . For a l a r g e and important class o f problems, we can replace P . and Qi by constants P and Q, independent o f time. This approach s i g n i f i c a n t l y lowers computational cost o? the f ~ l t e r . W w i l l r e s t r i c t the discussion i n t h i s section t o time-invariant systems; i n only a few special cases e do time-invariant f i l t e r s make sense f o r time-varying systems. Equations t h a t a time i n v a r i a n t f i l t e r m s t s a t i s f y are e a s i l y derived. and (7.2-15), we cdn express as a function o f Qi. Using Equations (7.2-18)

Thus, f o r 9:

t o equal a constant

Q, we must have
P =

o[Q

- QC*(CQC* + GG*)-'CQ]@*

+ FF*

This i s the algebraic matrix Riccat: equation f c r disroete-time systems. (An a l t e r n a t e form can be obtained by using Equation (7.2-16) i n place o f Equation (7.2. 18); the condition can also be w r i t t e n i n terms o f P iqstead o f 0).
I f Q i s a scalar, the algebraic R i c c a t i equation i s a quadratic equition i n Q and the solution i s simple. For nonscalar Q, the s o l u t i o n i s f a r more d i f f i c u l t and has been the subject o f numerous papers. W w i l l not ccver the d e t a i l s o f d e r i v i n g and in~plementingnumerical methods f o r solving the R i c c a t i equation. e Thf most widely used methods are based on eigenvector decomposition ( P ~ t t e r , 1966; Vaughan, 1970; and Geyser and Lehtinen. 1975). When a unique s o l u t i o n exists, these methods give accurate r e s u l t s w i t h small computat i o n a l costs.

The d e r i v a t i o n o f the conditions under which Equation (7.3-2) has ap acceptable solution i s more complie cated than would be approprirte for i n c l u s i o n i n t h i s text. W therefore present the f o l l o w i n g r e s u l t without proof: Theorem 7.3-1 I f a l l unstable o r marginally stable modes o f the system are c o n t r o l l a b l e by the process naise and are observable, and i f CFF*C* + GG* i s i n v e r t i b l e , then Equaticn (7.3-2) has a unique p o s i t i v e semidefinite solut i o n and Qi converges t o t h i s s o l u t i o n f o r a l l choices o f the i n i t i a l covariance, P o .

TfiWT)

Proof See Schweppe (1973. p. 142) f o r a h e u r i s t i c argument, o r Balakrishnan and K a i l a t h and Ljung (1976) f o r more rigorous treatments.

The condition on CFF*C* + GG* ensures t h a t the problem i s well-posed. Without t k i s condition, the inverse i n Equation (7.3-1) may not e x i s t f o r some i n i t i a l Po ( p a r t i c u l a r l y P = 0). Some statements o t the theorem , incorporr :.e the stronger requirement t h a t GG* be i n v e r t i b l e , but the weaker condition i s s u f f i c i e n t . Perhaps the most Important p o i n t t o note i s t h a t the system i s not required t o be stable. Although the existence and uniqueness o f the solution are easier t o prove f o r stable systems, the more general conditions o f Theorem (7.3-1) a r e important i n the estimation and c o n t r o l o f unstable systems. W can achieve a h e u r i s t i c understanding o f the need f o r t h e conditions or' Theorem (7.3-1) by examining e one-dimensional systems, f o r which we can w r i t e the solutions t o Equation (7.3-2; e x p l i s l t l y . I f the system i s one-dimensional, then i t i s observable i f C i s nonzero (and G i s f i n i t e ) , and i t i s con. . l l a b l ~~ ythe t process noise i f F i s nonzero. W w i l l consider the problem i n several cases. e

Case 1: G = 0. I n t h i s case, we mrst have C # 0 and F # 0 i n order f o r the problem t o be well-posed. ~ q u a t % m 3 - 1 then reduces t o Ui+l = FF*. g i v i n g a unique t i m - i n v a r i a n t covariance s a t i s f y i n g Equation (7.3-21. Case 2: G ? J. C = 0, F = 0. I n t h i s case. Equation (7.3-1) becomes Pi+, = 02Qi. This converges t o If l o ( = 1 Q i remains a t the s t a r t i n g value, and thus the steady s t a t e . Q = 0 V o l < 1 (stable system covariance i s not unique. I f 11 > i . the s o l u t i o n diverges o r stays a t 0, depending on the s t a r t i n g value. 1

Case 3: G f 0. C = 0. F # 0. I n t h i s case. Equation (7.3-2)

reduces t o

For

19: < 1, t h i s equation has a unique, nonnegative solution

and convergence o f Equation (7.3-1) t o t h i s solution i s e a s i l y shown. I f lo1 2 1, the s o l u t i o r ~i s negative. which i s n o t an adnissible covariance. o r i n f i n i t e ; i n e i t h e r event. Equation (7.3-1) diverges t o i n f i n i t y . Case 4: G # 0. C f 0, F 9. I n t h i s case, Equation (7.3-2) i s a quddratic equation w i t h roots zero and (oa I f 161 < 1 the sec:)nd r o o t i s negative, and thus there i s a unique r m n e g a t i v e root. If , 191 = 1, there i s a double yoot a t z r o , and the s o l u t i o n i s s t i l l unique. I n both o f these events, convergencc o f Equation (7.3-1) t o the s o l ~ t i o n t 0 i s easy t o show. I f 191 > 1, there are two nonnegative roots, a and the system can converge t o e i t h e r one, depending on whether o r n o t t h e i n i t i a l covariance i s zero.

-'ma.
Case 5:

G # 0, C f 0. F # 0 .

I n t h i s case, Equation (7.3-2)

i s a ,uadratic equation w i t h r o o t s (7.3-5)

0 = (1I2)H + m H z +
where

Regardless o f t h e value ~!f O, tile square-root term i s always l a r g e r 'n magnitude than (1/2)H; therefore, there i s one p o s i t i v e and one n t g a t i v e root. Convergence o f Equation (7.3-1) t o the p o s i t i v e r o o t i s easy t o show. Let us now sumnarize the r e s u l t s o f these f i v e cases. I n a l l well-posed cases, the covariance converges t o a unique value i f the sjstem i s stable. For unstable o r margiea?ly stable systems, a unique converged value i s dSSured i f both C and F are nonzero. For one-dimensional systems, there i s a l s o a unique convergent solut i o n f o r lo1 = 1. G # 0, C # 0. F = 0; t h i s case i l l u s t r a t e s t h a t the conditions o f Theoreh (?.3-1) are n o t necessary, although they are s u f f i c i e n t . H e u r i s t i c a l l y , we can say t h a t o b s e r v a b i l i t y (C # 0) prevents the covariance from diverging t o i n f i n i t y f o v unstable systems. C o n t r o l l a b i l i t y by the process noise (F # 0) ensures uniqueness by e l i m i n a t i n g the p o s s i b i l i t y of p e r f e c t p r e d i c t i o n (Q = 0).
An important r e l a t e d question t o consider i s the s t a b i l i t y o f the f i l t e r . vector t o be

W define the corrected e r r o r e

Using Equations (7.0-1).

(7.2-15),

(7.2-16).

and (7.2-19) gives the recursive r e l a t i o n s h i p

W can snow that. given t h e conditions o f Theorem (7.3-1). the system o f Equation (7.3-8) i s stab;e. e This s t a b i l i t y implies that, i n the absence o f new disturbances, (noise) e r r o r s i n t h e s t a t e estimate w i l l d i e o u t w i t h time; furthennore, f o r bounded disturbances, the e r r o r s w i l l always be boended. A rigorous proof i s n o t presentea here.

It i s i n t e r e s t i n g t o examine the s t a b i l i t y o f the one-dimensional example w i t h G # 0, C f 0, F = 0, and lo1 = 1. W previously noted t h a t Q i f o r t h i s case cor-lerges t o 2 f o r a l l i n i t i a l covariances. L e t us e examine the steady-state f i l t e r . For t h i s case. Equation (7.3-8) rsetlucer t o

which I s o n l y marginally stable. Recall t h a t t h i s case d i d not meet the conditions o f Theorem (7.3-1). so our s t a b i l i t y guarartee does n o t apply. Although a steady-state f I l t e r e x i s t s , i t does n o t perform a t a l l l f k e the time-varying f i l t e r . The time-iarying f i l t e r reduces the e r r o r t o zzro asymptotically w i t h time. The steadys t a t e f i l t e r has no feedback, and the e r r o r remains a t i t s i n i t i a l value. Balakrishnan (1984) discusses t h e steady-state f i l t e r I n m r e d e t a i l . Two special cases of time-invariant Kalman fi!ters deserve special note. The f i r s t case i s where F i s zero and the system I s stable (and GG* mrst be I n v e r t i b l e t o ensure a well-posed problem). I n t h i s case, the

steady s t a t e Kalman gain K i s zero. The Kalman f i l t e r simply integrates the s t a t e equation, ignoring any available measurements. Since the system i s stable and has no disturbances, the e r r o r w i l l decay t o zero. The same f i l t e r i s obtained f o r nonzero F if C i s zero o r if G i s i n f i n i t e . The e r r o r does tiot then decay t o zero, but t h e output contains no useful information t o feed back. Thc second special case i s where G i s zero and C ii square and i n v e r t i b l e . FF* must be i n v e r t i b l e ; e estimator then reduces t o h t o ensure a well-posed problem. For t h i s case, the Kalman gain i s C-'.

which ignores a l l previous information. The current s t a t e can be reconstructed exactly from the current measurement, so there i s no need t o consider past data. This i s the a n t i t h e s i s o f the case where F i s 0 and no information frm the current measurement i s used. Host r e a l i s t i c systems l i e somewhere betwet,, these two extremes. 7.4 CONTINUOUS TIME The fonn o f a l i n e a r continuous-time system m d e l i s

where n and n are assumed t o be zero-mean white-noise processes v t h u n i t y power spectral density. The input u i s assumed t o be known exactly. As i n the discrete-time analysis, we w i l l s i m p l i f y the notation by assuming t h a t the system i s time invariant. The same d e r i v a t i o n applies t o time-varying systems by evaluating the matrices a t the appropriate time points. He w i l l analyze Equation (7.4-1) as a l i m i t o f the discrete-time systems

wbore r! and n are discrete-time white-noise processes w i t h i d e n t i t y covariances. f a c t o r s were discussed i n Section F 2.

The reasons f o r the 'A / '

The f i l t e r f o r the system o f Equation (7.4-2) i s obtained by making appropriate s u b s t i t u t i o n s i n Equat i o n s (7.2-13) t o (7 2-17). W need t o substitute (I AA) i n place o f o , AB i n place o f Y , AF e + F; , i n place o f FF*. and A - ' G ~ G ~ i n place o f GG*. Combining Equations (7.2-13). (7.2-14). and (7.2-17) and making the substitutions gives

Subtracting i ( t i )

and d i v i d i n g by A

gives

Taking the l i n i t as A

0 gives the f i l t e r equation

i ( t ) = A i ( t ) + Bu(t) + P(t)C*(GcGt)-l[z(t)
It remains t o f i n d the equation f o r

- Cs(t, - Du(t)]
becomes (7.4-6)

P(t).

F i r s t note t h a t Equation (7.2-15)

Q(ti + A) = ( I + A A ) P ( ~ ~ ) + IAA)* + AF~F; ( and thus

Equation (7.2-18) i s a more convenient form f o r our current purposes than (7.2-16). s t i t u t i o n s i n Equation (7.2-18) t o get P(ti Subtract P(ti)

Make the appropriate sub-

A)

= Q(tl + A )

- Q(ti

+ A)C*(CQ(~, + A!C* + A"G~G:)"CQ(~,

+ A)

and d i v i d e by A

t o give

F3r the f i r s t term on the r i g h t o f Equi~!on (7.4-9).

s u b s t i t u t e from t q u a t i o n (7.4-7)

t o get

Thus i n t h l~i m i t Equation (7.4-9) becomes P ( t ) = AP(t) + P(tjAt

+ FCF:

- P(t)C*(GCG;)-lCP(t)

Equation (7.4-11) i s the continuous-time R i c a t t i equation. The i n i t i a l condition f o r the equation i s Po = 0, the covariance o f the i n i t i a l state. Po i s assumed t o be known. Equations (7.4-5) and (7.4-11) c o n s t i t u t e the s o l u t i o n t o the continuous-time f i l t e r i n g problem f o r l i n e a r systems w i t h white process and measurement noise. The continuous-time f i l t e r requires GG* t o be nonsingular. One p o i n t worth n o t i n g about the continuous-time f i l t e r i s t h a t t h e innovation z ( t ) - 4 ( t ) i s a whitei o i s e process w i t h th? same power spectral density as the measurement noise. (They are not, however, the same process.) The power spectrum o f the innovation can be found by l o o k i n g a t the l i m i t o f Equation (7.2-33). Making the appropriate s u b s t i t u t i o n s gives

The power spectral density o f the innovation i s then

The disappearance o f the f i z s t term o f Equation (7.4-12) than the discrete-time one i n many ways.

i n the 1 i m i t makes the continuous-time f i l t e r simpler

For time-invariant continuous-time systems, we can i n v e s t i g a t e the p o s s i b i l i t y t h a t the f i l t e r reaches a steady state. As i n the discrete-time steady-state f i l t e r , t h i s outcome would r e s u l t i n a s i g n i f i c a n t comput a t i o n a l advantage. I f the steady-state f i l t e r e x i s t s , i t i s obvious t h a t t h e steady-state P ( t ) most s a t i s f y t n e equation

-ained by s e t t i n g b t o 0 i n Equation (7.4-11). The eigenvector decomposition methods referenced a f t e r duatior: (7.3-2) are a l s o the best p r a c t i c a l n1:merical methods f o r solving Equatiorl (7.4-14). The f o l l o w i n g theorem, comparable t o Theorem (7.3-1). i s n o t proven here. rheorem 7.4-1 I f a l l unstable o r n e u t r a l l y stable modes o f t h e system are ~ o n t r o a e by the process noise and are observable, and i f G Gc i s invertl:lL! then Equation (7.4-14) has a unique p o s i t i v e semidednite solut i o n , and P ( t ) converges t o t h i s s o l u t i o n f o r a l l choices o f the i n i t i a l covariance Po. Proof See K a i l a t h and Lyung (1976). Ba1ak:ishnan E(1961). 7.5 CONTINUOUS/OISCRETE TIME w i t h continuousofter, debate these f i l t e r s the t r u e system. (1981), o r Kalman and

Many p r a c t i c a l a p p l i c a t i o n s o f f i l t e r i n g involve d i s c r e t e sa:. .led measurements o f systems time dynamics. Since t h i s problem has elements of both d i s c r e t e a ~ continuous time, there i s ~ d ober whether the discrete- o r continuous-time f i l t e r i s more appropriate. I n f a c t , n e i t h e r o f i s appropriate because they are both based on models t h a t are not r e a l i s t i c representations o f As Schweppe (1973, p. 206) says, Some r a t h e r i n t e r e s t i n g arguments sometimes r e s u l t when one asks the question, Are the discrete- o r the continuous-time r e s u l t s more useful? The answer i s , n e i t h e r i s superior i n a l l cases. o f course, t h a t the question i s s t u p i d

....

The appropriate model f o r a contin~rous-time dynamic system w i t h discrete-time measurements i s a continuous-time model w i t h discrete-time measurements. Although t h i s ctatement sounds l i k e a tautology, i t s p o i n t has been missed enough t o ~ m k ei t worth emphasizing. Some o f the confusion may be due t o the mistaken impression t h a t such a mixed node1 could no* be analyzed w i t h the a v a i l a b l e t o o l s . I n f a c t , the d e r i v a t i o n o f the appropriate f i l t e r i s t r i v i a l , given the pure continuous- and pure discrete-time r e s u l t s . The f i l t e r f o r t h i s c l a s s o f problems simply involves an appropriate combination o f the discrete- and continuous.,time f i l t e r s previously e derived. I t takes o n l y a few l i n e s t o show how t h e previously derived r e s u l t s f i t t h i s problem. W w i l l spend most o f t h i s section t a l k i n g about implementation issues i n a l i t t l e more d e t a i l . L e t the system be described by i ( t ) = Ax(t)

Bu(t)

Fcn(t)

Equation (7.5-la) i s i d e n t i c a l t o Equation (7.4-la); and. except f o r a notation change. Equation (7.5-lb) i d e n t i c a l t o Equation (7.0-lb). Note t h a t the observation i s only defjned a t the discrete points ti. although the state i s defined i n continuous time.

is

Between the times o f two observations, the analysis o f Equation (7.5-1) i s i d e n t i c a l t o :hat o f Equat i o n (7.4-1) w i t h an i n f i n i t e G matrix o r a zero C matrix; e i t h e r o f these conditions i s equivalent t o having no useful observation. Let i ( t i ) be the s t a t e estimate a t time t i based on the observations up t o and including z ( t i ) . Then the predicted estimate i n the i n t e r v a l ( t j , t i + & ] i s obtained f. .i i ( t t ) = i(ti) (7.5-2)

The covariance o f the p r e d i c t i o n i s ~(t;)


= P(tii

Equations (7.5-3) and (7.5-5) are obtained d i r e c t l y by s u b s t i t u t i n g C = 0 i n Equations (7.4-5) and (7.4-11). The natation has been chai~geat o i n d i c a t e t h a t , because there i s no observetion i n the i n t e r v a l , these are predicted estimates; whereas, i n the pure continuous-time f i l t e r , the observations are continuously used and f i l t e r e d estimates are obtained. Integrate Equations (7.5-3) and (7.5-5) over the i n t e r v a l ( t i s t i + n) t o obtain the predicted estimate i ( t i + A ) and i t s covariance Q ( t i + A ) . I n practice, although u ( t ) i s defined tontinuously, i t w i l l o f t e n be measured ( o r otherwise known) only a t the time points ti. F u r t h e m r e , the i n t e g r a t i o n w i l l l i k e l y be done by a d i g i t a l computer wnich cannot integrate continuous-time data exactly. Thus Equation (7.5-3) w i l l be integrated numerically. The simplest i n t e g r a t i o n approximation would give

This approximation may be adequate f o r some purposes, b u t i t i s more o f t e n a l i t t l e too crude. I f the matrix i s time-varying, there are rcveral reasonable i n t e g r a t i o n schemes which we w i l l not discuss here; the most c o m n are based on Runge-Kutt~ algorithms (Acton. 1970). For systems w i t h time-invariant A matrices and constant sample i n t e r v a l s , w e t r a n s i t i o n matrix i s by f a r the most e f f i c i e n t approach. Fir;t define $ = exp(Ad) (7.5-7)

This approximation i s the exact solution t o Equation (7.5-3) i f u ( t ) holds i t s value between samples. Wiberg (1971) and Zadeh and Desoer (1963) derive t h i s solution. Woler an^ Van Loan (1978) discuss various means o f numerically evaluating Equations (7.5-7) and (7.5-8). Equation (7.5-9) has an advantage of beins i n the exact form i n which discrete-time jystems a r e usually w r i t t e n (Equation (7.0-la)). Equation (7.5-9) introduces about 1/2-sample delay i n the modeling o f the response t o the control inpuL unless the continuous-time u ( t ) holds i t s value between samples; t h i s delay i s o f t e n unacceptable. Figure (7.5-1) shows a sample input signal and the signal as modeled by Equation (7.5-9). A b e t t e r approximt i o n i s usually x(::

+ A)

+i(tt)

(1/2)l(u(ti)

+ u(ti + A ) )

(7.5-13)

This equation models u ( t ) between samples as being constant a t the average o f the two sample values. Figure (7.5-2) i l l u s t r a t e s t h i s model. There i s l i t t l e phase l a g i n the model represented by Equation (7.5-10). and the difference i n implementation cost between Equations (7.5-9) and (7.5-10) i s n e g l i g i b l e . Equat i o n (7.5-10j i s probably the most comnonly used approximation method w l t h time-invariant A matrices. The high-frequency content introduced by the jumps i n the above models can be removed by modeling u ( t ) as a l i n e a r i n t e r p o l a t i o n between the measured values as i l l u s t r a t e d i n Figure (7.5-3). This model adds another u ( t i ) . I n our experience, t h i s degree of f i d e l i t y i s t e r n t o Equation (7.5-10) proportional t o u ( t i + A ) u s u a l l y unneressary, arid i s not worth the e x t r a c o s t and complication. There are sonle applications where the accuracy required might j u s t i f y t h i s o r even more complicated methods, such as higher-order spline f i t s . (The l i n e a r i n t e r p o l a t i o n I s a f i r s t - o r d e r spline.)

I f you are using a Runge-Kutta algorithm instead o f a t r a n s i t i o n - m a t r i x algorithm f o r solving the d i f f e r e n t i a l equation, l i n e a l i n t e r p o l a t i o n o f the i n p u t :r,:roduces n e g l i g i b l e e x t r a cast and i s c o m n practice. Eqliation (7.5-5) doe; n o t involve measured data and thus does not present the problems c f i n t e r p o l a t i n g betwecn the measurements. The exact s o l u t i o n o f Equation (7.5-5) i s

as can be v e r i f i e d by substitution. Note t h a t Equation (7.5-11) i s exactly i n the form o f a discrete-time update o f the covariance (Equation (7.2-15)) if F i s defined as a square r o o t o f the i n t e g r a l term. For small A. the i n t e g r a l term i s w e l l approximated by AF~F:, r e s u l t i n g i n

The e r r o r s I n t h approximation are usually f a r smaller than the uncertainty i n the value of be negiected. This approximation i s s i g n i f i c a n t l y b e t t e r than t h e a l t e r n a t e approximation

Fc, and can thus

obtained by inspecti010 from Equation (7.5-5). The above discussion has conce~itratedon propagating t h e estimate between measurements, i.e., t h e time update. I t remains only t o discuss t h e measurement update f o r t h e discrete measurements. W have ic(t ) and e Q ( t i ) a t sore time point. W need t o use these and t h e measured data a t the time p o l n t t o o b t a i n ir(tlj and e P ( t i ) . This i s i d e n t i c a l t o the discrete-time measurement update problem solved by Equations i7.2-16) and (7.2-17). W can also use the a l t e r n a t e forms discussed i n Section 7.2.4. e To s t a r t the f i l t e r , we are given the a priori m a n i(t,)-and covariance Q(t,) o f the s t d t e a t time to. Use Equations (7.2-16) and (7.2-17) ( o r alternates) t o obtain x(t,) and P(t,). I n t e r a t e Equations (7.5-2) t o (7.5-5) from t t o t, by some means (most l i k e l y Equations (7.5-10) and (7.5-1218 t o o b t a i n i ( t l ) and : Q(tlj. This completes one time step o f the f i l t e r ; processing o f subsequent time points uses the same procedure. The solution f o r the steady-state form of the d l screte/continuous f i l t z r follows imnediately from t h a t o f the discrete-time f i l t e r , because the equations f o r the covariance updates are i d e n t i c a l f o r t h e two f i l t e r s w i t h the appropriate substitutior! o f F i n terms o f Fc. Theorem (7.3-1) therefore applies. W can s u m r i z e t h i s section by saying t h a t there i s a continuous/discrete-time f i l t e r derived from e appropriate r e s u l t s i n the pure discrete- and pure continuous-time analyses. If the Input u holds i t s value between samples. then the form o f the continuousldiscrete f i l t e r i s i d e n t i c a l t o t h a t o f the pure discrete-time f i l t e r w i t h an appropriate s u b s t i t u t i o n f o r the equivalent discrete-time process noise covariance. For more r e a l i s t i c behavior of u, we mrst adopt approximations i f t h e analysis i s done on a d i g i t a l computer. It i s also possible t o view the continuous-time f i l t e r equations as g i v i n g reasonable approximations t o the continuous/discrete-time f i l t e r i n some situations. I n any event, we w i l l n o t go wrong as long as we recognize t h a t we can w r i t e the exact f i l t e r equations f o r the continuous/discrete-time system and t h a t we must consider any qther equations used as approxi~mtionst o the exact solution. With t h i s frame of mind we can o b j e c t i v e l y evaluate the adequacy o f the approximations involved f o r s p e c i f i c problems. 7.6 SMOOTHING

The d e r i v a t i o n o f optimal smooth~rsdraws heavily on t h e d e r i v a t i o n o f the Kalman f i l t e r . S t a r t i n g from the t i l t e r results, only a s i n g l e step i s required t o compute the smoothed estimates. I n t h i s section. we b r i e f l y derive the f i x e d - i n t e r v a l smoother f o r discrete-time l i n e a r systems w i t h a d d i t i v e Gaussian noise. Fixed-interval smothers are the most widely used. The stme lene,,al p r i n c i p l e s apply t o d e r i v i n g fixed-point and fixed-lag smothers. See Meditch (1969) for derivations and equations f o r fixed-point and f i x e d - l a g m o t h e r s and f o r continuous-time forms. There are a l t e r n a t e computational forms f o r the f i x e d - i n t e r v a l smother; these forms give mathematically equivalent results. W w i l l not discuss computational advantages ~f the various f u n s . See Bierman (1977) e and Bach and Wingrove (1983) for a1 ternate forms and discussions o f t h e i r advantages. Consider the f i x e d - i n t e r v a l smoothing p r o b l m on an i n t e r v a l w i t h N time points. As i n the f i l t e r derivation, we wil: concentrate on two time p o i n t s a t a time irl nrder t o get a recursive form. It i s s t r a i g h t forward t o w r i t e an e x p l i c i t formulation f o r the smother, l i k e the e x p l i c i t f i l t e r form o f Section 7.1. but such a form i s impractical.

i I n t h e nature of recursive derivations, assume t h a t we have previously computed ; t , the smoothed e s t i mate of xpl. and S tl. the covariance o f given Z W seek t o derive an expression f o r k t and St. e Note t h a t h i s recursion runs backwards i n t i r e instead o forwards; a forward recursion w i l l n o t work, f o r ! w reasons which , e w i l l see l a t e r .

The smoothed estimates, I and Xitl, i

a r e defined by

W w i l l use the measurement p a r t i t i o n i n g ideas of Section 5.2.2, e Z i and

w i t h the measurement ZN p a r t i t i o n e d i n t o

From the d e r i v a t i o n of the Kalman f i l t e r , we can w r i t e the j o i n t d i s t r i b u t i o n o f x i and tioned on Zi. It i s Gaussian w i t h

condi-

W d i d not previously derive the cross term i n the above covariance matrix. e

To derive the form shown, w r i t e

For the second step o f the p a r t i t i o n e d algorithm, we consider the measurements 1 . . using Equat i o n s (7.6-3) and (7.6-4) f o r the p r i o r d i s t r i b u t i o n . The measurements I can be w r l t t e n i n the form i

for some matrices ti, i f , and 64. and some Gaussian, zero-mean, identity-covariance noise vector ii Although we could laboriously wri:e out expressions f o r the matrices i n Equation (7.6-6). t h i s step 1s unnecessary; we need only know t h a t such a form exists. The important t h i n g about Equation (7.6-6) i s t h a t x i does not appear i n it. Using Equations (7.6-3) and (7.6-4) f o r the p r i o r d i s t r i b u t i o n and Equation (7.6-6) f o r the measurement equation, we can now obtain the j o i n t p o s t e r i o r distribution o f x and x +, given Zi. This d i s t r i b u t i o n i s Gaussian w i t h mean and cova:iance given b Equazions (5.1-12) and 15.1-131, s u b s t i t u t i a g Equation (7.6-3) f o r mc. Equation (7.6-4) f o r P. f o r D. f o r 6. and

&

By d e f i n i t i o n (Equation (7.6-l)),

the mean o f t h i s d i s t r i b u t i o n gives the smoothed e s t i m t e s

if+,. s u b s t i t d t i o n s i n t o Equation (5.1-12) and expanding gives Making the

i and i

W can solve Tquation (7.5-8) f o r jii e step of the backwards recursion.

i n terms o f

ji+,,

which we assume t o have been computed i n the previous

Equation (7.6-9) i s the backwards recursive form sought. Note t h a t the equation does not depend e x p l i c i t l y on the measurtments o r on the matrices i n Equation (7.6-6). That information I s a l l subsumed irt ;i+,. The " i n i t i a l " condition f o r the recursion i s

i rn i N N

(7.6-10)

which f o l l o w s d i r e c t l y from the d e f i n i t i o n s . W do not have a corresponding known boundary condition a t the e beginning o f the interval. which i s why we must propagate the smoothing recursion backwards. instead of forwards. W can n w describe the complete process o f computing the smoothed state estimates f o r a f i x e d time i n t e r e F i r s t propagate the Kalman f i l t e r through the e n t i r e i n t e r v a l , saving a l l o f the va:ues i t . R i , Pi, and 01. Then propagate Equation (7.6-9) backwards i n time, uslng the saved values from the f i l t e r . and s t a r t i n g frm the boundary condition given by Equation (7.6-10). val. appropriately i n t o Equation (5.1-13) W can derive a formula f o r the smother covariance by s u b s t i t u t i ~ ~ g e t o get

( T k off-dfagondl blocks are not relevant t o t h i s derivation.) terms of Sf+,, g i v i n g

Ye can solve Equation (7.6-11)

for

Sf

in

This gives us a kckwards recursion f o r t h e smoother covariance.

The " i n i t i a l " c o n d i t i o n

f o l l o w s from the d e f i n i t i o n s . Note that, as i n the recursion f o r the smoothed estimate, the measurements and A l l the necessary data about the the measurement equation matrices have dropped out o f Equation (?.6-12). f u t u r e process i s subsumed i n Si+,. Note a l s o t h a t i t i s n o t necessary t o compute the smother covariance S i i n order t o compute the smoothed estimates. 7.7 NONLINEAR SYSTEMS AND NON-GAUSSIAN NOISE

Optimal s t a t e estimation f o r nonlinear dynamic systems i s s u b s t a n t i a l l y more d i f f i c u l t than f o r l i n e a r systems. Only i n r a r e special cases are there t r a c t a b l e exact solutions f o r optimal f i l t e r s f o r nonlinear systems. The same comnents apply t o systems w i t h non-Gaussian ncise. P r a c t i c a l implementations of f i l t e r s f o r nonlinear s y s t ~ m si n v a r i a b l y i n v o l de approximations. The most c o m n approximations arc based on l i n e a r i z i n g the system and using the o p t i m l f i l t e r f o r the l i n e a r i z e d system. S i m i l a r l y , non-~aussiannoise i s approximated, t o f i r s t o;.der, by Gaussian noise w i t h the same mean and covariance. Consider a nonlinear dynamic system w i t h a d d i t i v e noise i ( t ) = f(x(t),u(t)) 2 ( t i ) = 9(x(ti).u(ti))

+ nit)
+ llj

Assum..? t h a t we have some nominal estimate, xn(t), o f the s t a t e &ime h i s t o r y . Equation (7.7-1) about t h i s nominal t r a j e c t o r y i s i ( t ) = A(t)x(t) + B(t)u(t) + fn(t) + n(t)

Then the l i n e a r i z a t i o n o f (7.7-2a)

where

For a given r,ominal t r a j e c t o r y . Equations (7.7-2) t o (7.7-4) define a time-varying l i n e q r system. The Kalman filter/!moother algorithms derived i r l previous sections o f t h i s chapter give optimal s t a t e estimates f o r t h i s 1 inearizod system. The f i l t e r based on t h i s l i n e a r i z e d system i s c a l l e d a l i n e a r i z e d Kalman f i l t e r o r an extended Kalman f i l t e r (EKF). I t s adequacy as an approximation t o the optimal f i l t e r f o r the nonlinear system depends on several factors which we nil; n o t analyze i n depth. It i s a reasonable supposition t h a t i f the system i s n e a r l y l i n e a r , then the l i n e a r i z e d Kalman f i l t e r w i l l be a close approximation t o the optimal f i l t e r f o r the system. If, the other hand, n o n l i n e a r i t i e s play a major r o l e i n d e f i n i n g the c h a r a c t e r i s t i c system on responses, the reasonableness o f the 1 inearized Kalman f i l t e r 1s questionable. The above d e s c r i p t i o n i s intended o n l y t o intt'oduce i h e s i m p l c r i leers o f l i n e a r i z e d Kalnwn f i l t e r s , S t a r t i n g from t h i s point, there are numerous extenslor~s. m o d i f i c a t ~ons, and nuances o f sppl i c a t i o n . Nonlinear f f l t e r i n g I s an area o f current research. See Each and Wingrove ('983) and Cox and aryson (19801 f o r a few of the many i n v e s t i g a t i o n s i n t h i s f i e l d . Schweppe (1973) and Jazwinski (1970) have f a i r l y extensive discussions o f nonlinear s t a t e estimation.

Figure (7.5-1 )

Hold-last-val ue input model

---- -Figure (7.5-2). Average value input model.

Figure (7.5-3).

Llnesr Interpolation input model.

CHAPTER 8 8.0 OUTPUT ERROR METHOD F R DYNAMIC SYSTEMS O

I n prevtcus chapters. we have cove;ed the s t a t t c e s t l m t t o n problem and t h ? estlmatton o f the s t a t e o f dynamtc systems. With t h l s background, we can now begtn t o address the p r t n c t p l e subject ~f t h i s book, esttmat t o n o f the parameters o f dynamtc systems. Before addressing the more d t f f t c u l t p a r a l n t e r estimatton problems posed by more genercl system models. we w t l l conslder the s t m p l i f t e d case t h a t leads t o the algortthm c a l l e d output e r r o r . The s l m p l t f l c a t t o n t h a t leads t o the output-error method I s t o omit t h * process-notse term from the s t a t e equatton. For tht: reason, the output-error methtd t s o f t e n desc thed by terms l t l r e "the no-n-ocess-noise algortthm" o r "the measurementnoise-only a l g o r t thm. W w t l l f t r s t dtscuss mlxed contlnuous/dtscrete-ttme systems, which are most approprtate f o r the mnajortty e o f the practical appltcatlons. W w l l l follow t h l s discusston by a b r i e f sumnary o f any dtfferences f o r pure e discrete-ttme systems, wlifch are useful f o r some appllcattons. The d e r t v a t t o n and r e s u l t s are e s s e n t t a l l y I d e n t i c a l . The pbre continuous-ttme results, although s l m t l a r t n expression. Involve e x t r a compltcattons. W e nave never seen an appropriate p r a c t i c a l a p p l i c a t t o n o f the pure continuous-ttme r e s u l t s ; we therefore f e e l j u s t i f i e d I n o m t t t t ~ ~them. g I n mixed conttnuous/dtscrete ttme. the most general system model t h a t we w t l l seriously consider I s x ( t o ) = xo i ( t ) = f[x(t),u(t),~! (8.0-la) (8.0-lb)

The masurerent notse n I s assumed t be a sequence o f independent Gaussian random vartables and i d e n t i t y covariance. The t i p u t u 's assumd t o be known ertactly. The t n i t i ~ lcondttlon treated I n several ways, as discussed i n Sectton 8.2. I n general, the functtons f and g can functtons o f t. W omit t h t s from the notatton f o r stmp:tclty. e ( I n any event, e x p l i c i t time be put i n the notatton o f Equatton (8.0-1) by deftntng an extra c o n t r o l equal t o t . ) The correspondtny nonlinear model for purc d i s c r e t e - t t w systems 1s

w t t h zero m a n x, can be a l s o be e x p l t c t t dependence can

The assumpttons are the s a w as I n the conttnuous/discrete cise. Although the output-error method app!ies t o nonlinear systems, we w i l l gtve special a t t e n t i n n t o the treatment o f 1 tnear systems. The 1 inear form of Equatton (8.0-1) i s

-I . .

i'

-.

The matrices A, 8, C, D, and G a r e functions of i;we w l l l n o t compltcate the cattng t h l s relattonship. Of course, x and z are a l s o functtons o f 6 through system matrtces.

tatt ton by e x p l t c t t l y i n d t ~ e t dependence on the r

I n general, the m t r t c e s A, 8, C. D, and G can a l s o w functtons of t t m . For n o t a t i o n a l s t m p l l c t t y , we have not e x p l t c t t l y Indicated t h i s dependence. I n several places. ttme tnvariance o f the m t r t c e s introduces s i g n t f t c a n t computattonal savrngs. The t e x t w l l l indlcate such sttuattons. Note t h a t 6 cannot be a f u n c t i o n of time. Problems w t t h time-varyfng L must be reformulated w t t h a t f m - i n v a r i r s * c i n order f o r the techntques o f t h i s chapter t o be applicable. 'he l t n e a r forin o f Equatlnn (8.0-2) is

The t r a n s t t t o n m t r i c c s

and

a r e functions o f

c,

and possibly o f t t m .

For any ~f the model forms, a p r t o r d t r t r i b u t t o n for C: may o r my n o t e x i s t . depending on the p a r t t c u l r r applicatton. When there f s no prtoi. distrlbu+,ion, o r when you desire t o o b t a i n an e s t t m t e Independent o f tne

PfieCEDmG PAGE D L M N m

m.wL
..

--, > -

. ..

434 l " m m a u V -.
A*

. -

.:#

p r i o r d l s t r i b u t l o n , use a mxlmum-likelihood estimator. When a p r l o r d l s t r l u u t l o n I s considered, R4P e s t l i i u t o r s are appropriate. For the parameter estimatlon problem. a ~ s t e r i o r iexpected-value estlmatus and Bayesian optfmal estimates are impractical t o compute, except i n speclay cases. The p o s t e r i o r dls:rlbvrlun o f r. i s not, i n general. symnetrlc; thus the a p o s t e r i o r i expectpd value need r o t equal the HAP e s t l c e t e .

The basic method o f d e r i v a t i o n f o r the output-error meihod i s t o reduce the prdblem t o the stat,, form o f Chapter 5. W w i i l see i h a t the dynamic system makes the models f a i r l y c m p l l c a t e d , b u t n o t d l f f e r e r i t i n any e e essential way from those o f Chapter 5. W f l r s t consider the case where 6 and the I n i t i a l condition are assumed t:, be known. Clioose an a r b i t r a r y value of E . Given the i n i t i a l c o n d i t i t . . x, and a specifieL I n p u t t l m e - h i s t o r y u. the s t a t e equation (8.0-lb) can be s o l v r d t o give the s t a t e as a f u n c t l o n o f tlme. W assume t h a t f i s e s u i f i c i e n t l y smooth t o guarantee the existence and uniqueness o f the s o l u t i o n (Brauer and Noel, 1969). For may be d i f f i c u l t o r impossible t o %press lr. closed form, bu'. t h a t complicated f functicns, the s o l u t i ~ n aspect l j irrelevant t o the theory. (The p r a c t i c a l implication I s t h a t the s o l u t i o n w l l l be obta+-,dd uslng numerical approxlmatjon methods.) The important t h i n g t o note I s that.. because o f the e14nlnatlon o f the process noise, the s o l u t l o n i s d ~ t e r m l n i s t l c . For a s p e c i f i e d i n p u t u, the system s t a t e i s thus a d e t e r m i n i s t i c f u n c t i o n o f c bnd tlme. For conslstency w i t h the yot tat ion o f the f i l t e r - e r r o r method discussed l a t e r , denote t h l s f u n c t i o n by ic ( t ) . The c subscript emphaslzei L S dependence on c. The dependence on u i s n o t relevant t o the c u r r e i t discussion, ~ so the n o t a t i o n ignares ih!s dependence f o r s i m p i l c l t y . Assuming known G, Equat.:on (8.0-lc) then becomes

Equa'ion (8.1-1) i s ir, the form of Equation (2.4-1); i t I s a s t a t i c nonlinear model w i t h a d d i t i v e roise. There The assutnptlons are m u l t i p l e experiments, c-r a t each ti. The estlmators o f Sectlon 5.4 applq direct:y. adopted have a'iowed us t o solve the system dycamlcs, leavlng an essential1,v s t a t i c problem. The MAP estlmate i s obtained by minimizing Equatlon (5.4-9). t i o n becdmes
N

Ir! the n o t a t i o n o f t h i s chapter, t h l s equa-

where

The quantities m~ and P are the mean and covarlaace o f the p r i o r d l s t r i b u t l o n the MLE estlmatoi-: omit the l a s t term o f Equation (8.i-2). g f v l n g

-.C

6, as I n Chapter 5.

FIJ~

Equatlon (8.1-4) I s a quadratic form i n the di:ference between r , the measurf:d response (output), and it, tne response computed from the d e t e r m i n i s t i c p a r t o f the system mode;. T h i s motivates the name "output e r r o r . " The mlnlmizatlan o f Equation (8.1-4) I s an i n t u i t i v e l y p l a u s i b l e r s t . i m t o r defensible even without s t a t l s t l c a l derivation. The mlnim!zlng value o f ( gives the system mndel '.t~at bect approximates ( I n a lerst-squares sense) the actual system rzsponse t o the t e r t input. Although t h i s does n o t necessarily guarantee t h a t the model response and the system respolise w l l l be s i m t l a r f o r other t e s t input:,, t h e mlnlmizing value o f c I s certainly a plauslble e* ate. The estimates t h a t r e s u l t :tom m l n l r l z i n g Equation (8.1-4) are sometifi-s c a l l e d " l e a s t squares" estimates, reference t o tk quadratic form o f the equation. W p r e f e r t, avc'id the us- of this,termlnology because f t e o I s p o t e n t l a l l y confusing. k n y o f the estlmators a p p l i c a b l r : dynam;c systa,, 1;3ve a ,east-rduares form, so the t e r n I s n o t d e f l n l t i v e . F u r t h e m r e , the term "?east squrrer" i s ~ m s o f t e n r 3 b l l e d 'I :quation (8.1-4) t t o c o n t r a s t i t from other fornrs labeled "maxlnum l i k e l i n o o d " (typ#r.,~ly the estl~a.ai.)rro f Sectlon 8.4. which apply t o ucknwn G, o r m e estimators o f Chapter 9, ~ h l c h acr:.int f o r p-oceis ncisej. l n i r c o n t r a s t I s r i s leading because Equation (8.1-4) descrlbcs a conpletely igorolls, maxlnrlrn-i ;Lei i:,?od estlmatcr f n r the problem as posed. The dlfferences betmen Equation (8.1-4) and the estlmstors o f Sections C . 4 rnd Chapter 9 I r e differences I n thr problem statement, n o t d l f f c r m c e s I n the s t a t l s t l c a l p r l n c l p l e s used f o r soluLion.

I.

70 derive the output-error method f o r pure dlscrete-time systems, s u b s t f t u t e the dlsctSete-tioe Equat l c ~ (8.0-2b I n place o f Equation (8.0-lb). The d e r i v a t i o n and the r e s u l t are unchanoco except t h a t Equat i c n (8.1-3bI becomes

8.2

8.2

INITIAL CONDITIONS

The above derivation of the output-error method assumed t h a t the i n t t i a l c o n d i t i o n was known exactly. This assumption i s seldom s t r i c t l y true, except when using fonns where the i n i t i a l condition i s zero by definition. The i n i t i a l ccndition i s t y p i c a l l y based on i n p e r f e c t l y measu-ed data. This c h a r a c t e r i s t i c suggests t r e a t i n g the i n i t i a l condition as a random v a r i a b l e w i t h some mean and covariance. Such treatment, however, i s inconpatible w i t h t h e output-error method. The output-error method i s predicated on a deterministic solution o f the s t a t e equation. T :*tment o f a random i n i t i a l condition requires the more complex f i l t e r - e r r o r method discussed l d t e r .
I f the system i s stable, then i n i t i a l condition e f f e c t s d x a y t o a n e g l i g i b l e l e v e l i n a f i n i t e time. I f t h i s decay i s s u f f i c i e n t l y f a s t and t h e e r r o r i n the i n i t i a l condition i s s u f f i c i e n t l y small, the i n i t i a l condi:ion e r r o r w i l l have n e g l i g i b l e e f f e c t on the system response and can be ignored. I f the e r r o r s i n the i n i t i a l condition a r e too large t o j u s t i f y neglecting them. there are several ways t o resolve the problem without s a c r i f i c i n g the r e l a t i v e s i m p l i c i t y o f the output-error method. One way i s t o simpiy improve +he i n i t i a l - c o n d i t i o n values. This i s sometimes t r i v i a l l y easy i f the i n i t i a l - c o n d i t i o n value i s computed fv-om the measurement a t the f i r s t time p o i n t o f the maneuver (a comnon practice): change the s t a r t tim oy on.lple t o avoid ar, obvious w i l d point. average the f i r s t few data points, o r draw a f a i r i n g through the noise r, .se the f a i r e d value.

..

When these methods are inapplicable o r i n s u f f i c i e n t , we can ,r~known parameters t o estimate. The i n i t i a l condition i s then a o f the stdte equation i s thus s t i l l a deterministic function o f method. The equations of Section 5.1 s t i l l apply, provided t h a t ic,(to) for Equation (8.3-la).

include the i n i t i a l condition i n the l i s t o f deterministic function o f 6. The s o l u t i o n 6 and time, as required f o r the output-error we substitute

= xo(c)

It i s easy t o :how t h a t the i n i t i a l - c o n d i t i o n estimates have poor asymptotic properties a the time : i n t e r v a l increases. The i n i t i a l - c o n d i t i o n information i s a l l near the beginnir,g o f the maneuver, and increasi n g the time i n t e r v a l does not add t o t h i s information. Asymptotically, we can and should ignore i n i t i a l cond i t i o n s for stable systems. This i s one case where asymptotic r e s u l t s are misleading. For r e a l data w i t h f i n i t e time i n t e r v a l s we should always c a r e f u l l y consider i n i t i a l conditions. Thus, we avoid making the mistake of one published paper (which we w i l l leave anonymous) which b l i t h e l y set the model i n i t i a l condition t o zero i n s p i t e o f c l e a r l y nonzrrs data. I t i s not c l e a r whether t h i s was a simple oversight o r whether the author thought t h a t asymptotic r e s u l t s jur;t!fied tha practice; i n any event, the r e s u l t i n g e r r o r s were so egregious as t o render the r e s u l t s worthless (except as an object lesson).

8.3

COMPUTATIONS

Equations (8.1-2) and (8.1-31 d e f i r r the cost function t h a t must be minimized t o o b t r i n the MAP estimates (or, i n the special case t h a t P- i s zero, the WE estimates). This i s a f a i r l y complicated function o f 5 . Therefore we must use an i t e r a t i v e minimization scheme.
It i s edsy t o become overwhelmed by the apparent complexity o f J as a function o f 6; I t ( t i ) i s i t s e l f a complicated function o f E , involving the solution o f a d i f f e r e n t i a l equation. To get J as a function o f c we m s t substitute t h i s function f o r i ( t i ) i n Equation (8.1-2). You might give up a t the thought o f evaluating f i r s t and second gradients o f t h i s function, as required by most i t e r a t i v e optimization methods. The conplexity, however, i s only apparent. It i s c r u c i a l t o recognize t h a t we do not need t o develop a closed-form expression, the development o f which would be d i f f i c u l t c t best. W are only required t o develop e a workable procedure f o r computing the r e s u l t .

To evaluate the gradients o f J, we need o n l y proceed one step a t a time; each step i s q u i t e simple, Involvlng nothing more complicated than chain-rule d i f f e r e n t i a t i o n . This step-by-step process follows the advice from Alice i n Wondarkand: The White Rabbit p u t on h l s spectacles. Fbjesty?" he asked. "Beqir, a t the beginning:" come t o the end: then stop. 8.3.1 Gauss-Newton Method "Where s h a l l I begin, please your

the King said, very gravely, "and go on t i l l you

The cost function i s i n the form o f a sum o f squares, which makes Gauss-Newton the preferred optimization a l g o r i t h . Secttons 2.5.2 and 5.4.3 discussed the Gauss-Newton algorithm. To gather together a l l the import a n t equations. we repeat the basic equations o f the Gauss-Newton algorithm i n the notation o f t h i s chapter. Gauss-Newton I s a quasi-Newton a l g o r i t h . The f u l l Newton-Raphson algorithm i s

The f i r s t gradient i s N vCJ(e) =

i=1

[z(ti)

- ie(tf)l*(GG*)-1[v6i(ti)l

(C

- mC)*P-l

92 For the Gauss-Newton algorithm, we approximate the second gradient by

which corresponds t o Equation (2.5-11) applied t o the cost function o f t h i s chapter. Equations (8.3-1) through (8.3-3) are the same, whether the system i s i n pure discrete time o r mixed continuous/discrete time. The only quantities i n these equations requiring any discussion are i c ( t i ) and v E i C ( t i ) . 8.3.2 System Reswnse

The methods f o r computation of the system response depend on whether the tystem i s pure discrete time o r mixed continuous/discrete tlme. The choice of method i s also influenced by whether the system i s linear o r nonlinear. Computation of the response of discrete-time systems i s simply a matter o f plugging i n t o the equations. The general equations f o r a nonlinear system are

i E ( t 1+1 ) = f [ i F. (t.).u(ti),tl . 1

i = 0.1

....

(8.3-4b)

The more specific equations f o r a lirtear discrete-time s y s t a are i,(t,) = x0(O

i,(ti)

C i E ( t . ) + Du(ti) 1

i = 1.2

,...

(8.3-5c)

For mixed continuous/di;crete-time systems, nunerical methods f o r approximate integration are required. You can use any o f n w r o u s nunerical methods, hut tha u t i l i t y o f the more coaplicated methods i s often l i m i t e d by t h r available data. It makes l i t t l e sense t o use a high-order method t o integrate the system equations between the time points where the input i s measured. The errors i m p l i c i t i n interpolating the input measurements are p.-obably larger than the errors i n the integration method. For most purposes, a second-order Runge-Kutta algorithm i s probably an appropriate choice:

For linear systems, a transition matrix method i s m r e accurate and e f f i c i e n t than Equation (8.3-6).
+ ( t o ) = xo(E) (8.3-7a)

where

Section 7.5 discusses the form of Equation (8.3-7b). k l e r and Van Loan (1978) describe several ways of numer!cally evaluatlng Equations (8.3-8) and (8.3-9). I n t h l s application. because ti+, tl i s small compared t o the sjstem natural periods, s l ~ l p l eseries expansion works well.

8.3.2

where

8.3.3

F i n i t e Difference Response Gradient

It remains t o discuss the computation o f v i C ( t . ) , the ,radi?nt o f t h e system response. l h e r e are two basic methods f o r evaluating t h i s gradient: d i f f e r e n t i a t i o n and a n a l y t i c d i f f e r e n t i a t i o n . This section discusses the f i n i t e difference approach. 2nd the next section discusses the a n a l y t i c approach.

finhe-difference

Finite-difference d i f f e r e n t i a t i o n i s applicable t o any rmdel form. The method i s easy t o describe and equally easy t o code. Because i t i s easy t o coae, f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n i s appropriate f o r programs where quick r e s u l t s are needed o r the production workload i s s m l l enough t h a t saving program developr n t time i s more important than i r p r o v i n g program e f f i c i e n c y . Because i t applies w i t h equal ease t o a l l model forms, f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n i s a l s o appropriate f o r programs t h a t must handle nonlinear models, f o r which a n a l y t i c d i f f e r e n t i a t i o n i s numerically complicated (Jategaonkar and Plaetschke. 1983). To use f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n , perturb the f i r s t element o f the 5 vector by some Small amount $E('). Recompute the system response using t h f s perturbed ( .v ctor, obtaining the perturbed system response zp. The p a r t i a l d e r i v a t i v e o f the response w i t h respect t o i s then approximately

(('f

Repeat t h i s process, perturbing each element o f c i n turn, t o approximate the p a r t i a l d e r i v a t i v e s w i t h respect t o each element o f C. The f i n i t e - d i f f e r e n c e gradient i s then the concatenation o f the p a r t i a l derivatives.

.-]

(8.3-14)

Selection o f the s i z e o f the perturbations requires some thought. I f the perturbation i s too large. Equation (8.3-13) becomes a poor approximation o f the p a r t i a l derivative. I f t h e perturhation i s too small, roundoff e r r o r s become a problem.
Some people have reported excellent r e s u l t s using simple perturbation-size r u l e s such as s e t t i n g the perturbation magnitude a t 1% f a t y p i c a l expected magnitude o f the corresponding ( element (assuming t h a t o you understand t h e problem w e l l enough t o be able t o e s t a b l i s h such t y p i c a l magnitudes). You could alternat i v e l y consider percentages o f the current i t e r a t i o n estimates ( w i t h some special provision f o r handling zero o r e s s e n t i a l l y zero estimates). Another reasonable rule, a f t e r the f i r s t i t e r a t i o n , would be t o use percentages o f the diagonal elements o f the second gradient, r a i s e d t o the -112 power. As a f i n a l r e s o r t ( i t takes more computer time and i s m r e complex), you could t r y several perturbation sizes, using the r e s u l t s t o gauge the degree o f n o n l i n e a r i t y and roundoff error, and adaptively selecting the best perturbation size.

llue t o our l i m i t e d experienct w i t h the f i n i t e difference approach, we d e f w makfng s p e c i f i c recomnendat i o n s on perturbation sizes, b u t o f f e r the opinion t h a t the problem i s amenable t o reasonable solution. A l i t t l e experimentation should s u f f i c e t o e s t a b l i s h an adequate perturbation-size r u l e f o r a s p e c i f i c class o f problems. Note t h a t the higher the precisfon o f your computer, t h e m r e margin you have between the boundaries o f l i n e a r i t y problems and roundoff problems. Those o f us w i t h 60- and 6 4 - b i t conputers ( o r 32-bit canputers i n double precision) seldom have serious roundoff problems and can use simple perturbation-size r u l e s w i t h impunity. I f you t r y t o get by w i t h single precision on a 3 2 - b i t conputer, careful perturbation-size selection w i l l be more important. 8.3.4 Analytic Response Gradient

The other approach t o conputing the gradient o f the system response i s t o a n a l y t i c a l l y d i f f e r e n t i a t e the system equations. For l i n e a r systems, t h i s approach i s sometimes f a r more e f f i c i e n t than f i n i t e difference d i f f e r e n t i a t i m . For nonlinear systems, a n a l y t i c d i f f e r e n t i a t i o n i s i n p r a c t i c a l l y clumsy ( p a r t i a l l y because you have t o redo it f o r each new nonlinear model form). He w i l l , therefore, r e s t r i c t our discussion of a n a l y t i c d i f f e r e n t i a t i o n t o l i n e a r systems.
It i s c r u c i a l t o We f i r s t consider pure discrete-tfme l i n e a r systems i n the form o f Equation (8.3-5). r e c a l l t h a t we do not need a closed form f o r the gradient; we only need a method f o r ca;puting it. A closedform expression would be formidable, u n l i k e the f o l l o w i n g equation, which i s the almost enbarassingly obvious gradient of Equation (8.3-5). obtained by using nothing more complicated than the chain r u l e :

p,i(t,)

vEx0(0

(8.3-13a)

Equation (8.3-13b) gives a r,?cursive fornula f o r v ;(ti), w i t h Equation (8.3-13r) as the i n i t i a l condition. expresses v c i ( t i ) i n t e r n o f tke s o l u t i o n o f Equation (8.3-13b). Equation 16.3-13:)

The q u a n t i t i e s vce. vgr, vcC, and v D i n Equation (8.3-13) a r e gradients o f ,natrices w i t h respect t o t h e vector c. The r e s u l t s are vectors, the elements o f which are matrices ( i f you a r e fond o f buzz words, these a r e third-order tensors). I f t h i s s t a r t s t o sound complicated, you w i l l be pleased t o know t h a t the products l i k e (v O ) u ( t i ) are ordinary matrices (and indeed sparse matrices-they have l o t s o f zero elements). You can colnpute the products d i r e c t l y without ever forming the vector o f matrices i n your program. A program t o implement Equation (8.3-13) takes fewer l i n e s than the explanation. W could w r i t e Equation (8.3-13) without using gradients o r matrices. Simply replace v by a l a c ( j ) e throughout, and then concatenate the p a r t i a l derivatives t o get the gradient o f ? ( t i ) . We tkan have, a t worst, p a r t i a l derivatives o f matrices w i t h respect t o scalars; these p a r t i a l derivatives a r e matrices. The only difference between c 8 r i t i n g the equations w i t h p a r t i a l derivatives o r gradients i s notational. W choose e t o use t h e gradient notation because i t i s shorter and more consistent w i t h the r e s t o f the book. Let us look a t Equation (8.3-13c) i n d e t a i l t o see how these equations would be inplemented i n a program, and perhaps t o b e t t e r understand t h e equations. The left-hand side i s a matrix. Each column of t h e matrix i s the p a r t i a l d e r i v a t i v e o f i ( t i ) w i t h respect t o one element o f c:

i s a s i m i l a r matrix, cunputed from Equation (8.3-13b); thus C(veii(ti)) The quantity vgi(ti) t i o n o f a matrix times a matrix, and t h i s i s a c a l c u l a t i o n we can handle. The q u a n t i t y VcC matrices

i s a multiplicai s the vector of

-.

and the product ( v E C ) i ( t i ) i s

(Our notation cues not i n d i c a ~ ee x p l i c i t l y t h a t t h i s i s the intended product f o m l a , b u t the o t h e r conceivable i n t e r p r e t a t i o n o f the notation i s obviously wrong because the dimensions are incompatibl,?. Formal tensor notation would make the i n t e n t i o n e x p l i c i t , but we do not r e a l l y need t o introduce tensor notation here because the c o r r e c t i n t e r p r e t a t i o n i s obvious). I n many cases the matrix aclac'j) w i l l be sparse. T y p i c a l l y these matrices are e i t h e r zero o r have only one nonzero element. W can take advantage o f such sparseness i q t h e canputation. I f C i s n o t a function o f e 6") (presumably 5 ' ' ) a f f e c t s other o f the system matrices), then ac/ac(j) i s a zero matrix. I f only the (k.m) element o f C i s affected by ~ ( j ) .then [ac/ac(j)]i(t,) i s a vector w i t h [ a ~,(.. * ~ ) / a c ( j ) ] i ( t ~ ) ( " n the ~ i) k t h element and zeros elsewhere. I f more than one elernent o f C i s a f f e c t e d by c ' ~ ) , then the r e s u l t i s a sum of such terms. This approach d i r e c t l y forms [ a C / ~ c ( j ) ] i ( t ~ ) , taking advantage o f sparseness, instead of forming the f u l l ac/ac(jl matrix and using a general-purpcse matrix m u l t i p l y routine. The terms (vcD)u(ti). (V e ) i ( t ), and ( V ~ ) u ( t i )are a l l s i m i l a r i n form t o ( v C C ) i ( t i ) . The i n i t i a l condition Vgxo i s a zero matrix i) x, i s known; otherwise i t has a nonzero element f o r each unknown element o f x,.
We r,ow know how t o evaluate a l l o f the terms i s Equation (8.4-13). This i s s i g n i f i c a n t l y f a s t e r than , f i n i t e differences f o r some applications. The speed-up i s most s i g n i f i c a n t i f a, r. C and D are functions o f time r e q u i r i n g s i g n i f i c a n t work t o evaluate a t each point; straighforward f i n i t e difference methods would have to reevaluate these matrices f o r each perturbation.

k p t a and k h r a (1974) discuss a method t h a t i s b a s i c a l l y a modification o f Equation (8.3-13) f o r canputi ~ g c i ( t i ) . Depending on the nunber o f inputs, states, outputs, and unknown parameters, t h i s method can v sometimes save conputer time by redticing the length o f the gradient vector needed f o r propagation i n Equation (8.4-13). He now have everything needed t o implement t h e basic Gauss-Newton minimization a l g o r i t h m . P r a c t i c a l a p p l i c a t i o n w i l l t y p i c a l l y require some k i n d o f start-up algorithm and methods f o r handling cases where t h e algorithm converges slowly o r diverges. The I l i f f - M a i n e code, WLE3 (Maine and Iliff, 1980; and Maine, 1981), incorporates several such modifications. The line-search ideas (Foster, 1983) b r i e f l y discussed a t the end o f Section 2.5.2 a l s o seem appropriate f o r hand1i n g convergence problems. We w i l l n o t cover t h e d e t a i l s o f such p r a c t i c a l issues here. The discussions of s i n g u l a r i t i e s i n Section 5.4.4 and o f p a r t i t i o n i n g i n Section 5.4.5 apply d i r e c t l y t o the problem o f t h i s chapter, so we w i l l not repeat then. 8.4
UNKNOWN G

The previous discussion i n t h i s chapter has assumed t h a t the G-matrix i s known. Equations (8.1-2) and (8.1-4) are derived based on t h i s assumption. For unknown G, the nethods o f Section 5.5 apply d i r e c t l y . Equation (5.5-2) substitutes f o r Equation (8.1-4). I n the terminology o f t h i s chapter, Equation (5.5-2) becomes

J(c) =
is1

lz(ti)

- ic(t,)l*[G(c)G(O*l~lI~(ti)fc(ti)l + ~ n l G ( c ) G ( O * l p l u s a constant.

(8.4-1)

I f G i s known, t h i s reduces t o Equation (8.1-4)

As discussed i n Section 5.5, the best approach t o ~ i n i m i z i n gEquation (8.4-1) i s t o p a r t i t i o n the parame t e r vector i n t o a p a r t CG a f f e c t i n g G and a p a r t ~f a f f e c t i n g i. For each f i x e d 6 , the Gauss-Newton , equations o f Section 8.3 apply t o r e v i s i n g the estimate o f f,f. For each f i x e d tf, the revised estimate of G i s given by Equation (5.5-7). which becomes

%*

x
N
it1

[z(ti)

(ti)][z(ti)

- iC(ti)ln

(8.4-2)

i n the current notation. Section 5.5 describes the a x i a l i t e r a t i o n method, which a l t e r n a t e l y applies the . Gauss-Newton equations o f Section 8.3 f o r ~f and Equation (8.4-2) f o r G The cost function f o r estimation w i t h unknown G i s o f t e n w r i t t e n i n a l t e r n a t e forms. Although the above form i s usually the most useful f o r computation, the following forms provide some i n s i g h t i n t o the r e l a t i o n s o f the estimators w i t h unknown G versus those w i t h fixed G. When G i s completely unknown. the minimization o f Equation (8.4-1) i s equivalent t o the minimization o f

which ;orres$onds t o Equation (5.5-9). Section 5.5 derives t h i s equivalence by e l i m i n a t i n g G. t o r e s t r i c t G t o be diagonal, i n which case Equation (8.4-3) becomes

It i s conron

This form i s a product o f the e r r o r s i n the d i f f e r e n t signals, instead o f ?he weighted sum-of-the-errors o f Equation (8.1-4). 8.5 CHARACTERISTICS

form

We have shown t h a t the output e r r o r estimator i s a d i r e c t application o f the estimators derived i n Section 5.4 f o r nonlinear s t a t i c systems. To describe the s t a t i s t i c a l c h a r a c t e r i s t i c s o f output e r r o r e s t i mates, we need only apply the corresponding Section 5.4 r e s u l t s t o the p a r t i c u l a r fonn o f output error.

I n most cases. the corresponding s t a t i c system i s nonlinear, even f o r l i n e a r dynamic systems. Therefore, we nust use the forms o f Section 5.4 instead o f the simpler forms o f Section 5.1, which apply t o l i n e a r s t a t i c systems. I n p a r t i c u l a r , the output e r r o r MLE and WP estimators are both biased f o r f i n i t e time. Asymptotic a l l y . they are unbiased and e f f i c i e n t . From Equation (5.4-11). the covariance o f the MLE output e r r o r estimate i s approximated by

From Equation (5.4-12). mator i s

the corresponding approximation f o r the p o s t e r i o r d i s t r i b u t i o n o f

i n an MAP e s t i -

C O V ( C ~ .= ) ?

;{

[~~;~(t~)]*(GG*)-~[v~?~(t~)] +

(8.5-2)

1'1

9.0 CHAPTER 9 9.0 FILTER ERROR METHOD F R DYNAMIC SYSTEMS O

I n t h i s chapter, we consider the p a r a m t e r estimation problem f o r dynamic systems w i t h both process and measuremnt noise. W r e s t r i c t the consideration t o l i n e a r systems w i t h a d d i t i v e Gaussian noise, because the e exact analysis o f more general systems i s i n p r a c t i c a l l y c o r p l i c a t e d except i n special cases l i k e output e r r o r (no process noise). Th? easiest way t o handle nonlinear systems w i t h both measurement and process noise i s usually t o l i n e a r i z e the system and apply the l i n e a r r e s u l t s . This method does not g i v e exact r e s u l t s f o r nonlinear systems, b u t can give adequate approximations i n some cases. I n mixed continuous/discrete time, the l i n e a r system model i s

The lneasurenent noise n i s assumed t o be a sequence o f independent Gaussian random variables w i t h zero mtan and i d e n t i t y covariance. The process noise n i s a zero-mean, white-noise process. independent of the measuretnent noise, w i t h i d e n t i t y spectral density. The i n i t i a l c o n d i t i o n xo i s assumed t o be a Gaussian random variable, independent o f n and n, w i t h mean xo and covariance Po. AS special cases, Po can be 0 . implying t h a t the i n i t i a l c o r ~ d i t i o ni s known exactly; o r i n f i n i t e , i n p l y i n g colnplete ignorance of the i n i t i . 1 1 condition. The input u i s assumed t o be known exactly. As i n the case of output error. the system matrices A. B. C. D. F, and G, a r e functions o f be functions o f time. The corresponding pure discrete-time model i s x ( t 0 ) = xo 6 and may

A l l o f the same assumptions apply, except t h a t n i s a sequence o f independent Gaussian random variables w i t h zero mean and i d e n t i t y covariance.

9.1

DERIVATION 6, we need t o choose

I n order t o obtain the maximum l i k e l i h o o d estimate o f L(6.Z) = P(ZNIE) where

t o maximize

For the Irl4P estimate, we need t o maximize p ( Z ~ l c ) p ( ~ ) . I n e i t h e r event, the c r u c i a l f i r s t step i s t o f i n d a .e tractable expression f o r ~ ( 2 ~ 1 6 ) W w i l l discuss three ways of d e r i v i n g t h i s density function. 9.1.1 S t a t i c Derivation

The f i r s t means o f d e r i v i n g an expression f o r p(Z 16) i s t o solve the system equations, reductng t h t o the s t a t i c form o f Equation (5.0-1). This technique, arthough simple i n p r i n c i p l e , does n o t give a t r a c t a b l e solution. W b r i e f l y o u t l i n e t h e approach here i n order t o i l l u s t r a t e the p r i n c i p l e , before considering t h e e more f r u i t f u l approaches o f the f o l l o w i n g sections. For a pure discrete-time l i n e a r system described by Equation (9.0-2). z(t1) i s the e x p l i c i t s t a t i c expression f o r

This i s a nonlinear s t a t i c model i n the general form o f Equation (5.5-1). However, the separation o f E i n t o EG and (f as described by Equation (5.5-4) does n o t apply. Note t h a t Equation (9.1-2) i s a nonlinear function o f 6, even i f the matrices are l i n e a r functions. I n fact, the order o f n o n l i n e a r i t y increases w i t h the number o f time points. The use of estimators derived d i r e c t l y from Equation (9.1-2) i s unacceptably d i f f i c u l t f o r a l l but the simplest special cases, and we w i l l n o t pursue i t f u r t h e r . For mixed continuous/discrete-time systems, s i m i l a r p r i n c i p l e s apply, except t h a t the w o f Equat i o n (5.0-1) must be generalized t o allow vectors o f i n f i n i t e dimension. The process noise i n a mixed continuous/discrete-time system i s a functlon of time, and cannot be w r i t t e n as a finlte-dimensional random vector. The material o f Chapter 5 covered only finite-dimensional vectors. The Chapter 5 r e s u l t s general$ze

n i c e l y t o infinite-dimensional vector spaces (function spaces), but we w i l l not f i n d t h a t l e v e l o f abstraction necessary. Application t o pure continuous-time systems would require f u r t h e r generalization t o allow i n f i n i t e dimensional observations. 9.1.2 Derivation by Recursive Factorinp

W w i l l now consider a d e r i v a t i o n based on f a c t o r i n g p(ZNI{) by means o f Bayes r u l e (Equation (3.3-12)). e The derivation applies e i t h e r t o pure discrete-time o r mixed continuous/discrete-time systems; the d e r i v a t i c n i s i d e n t i c a l i n both cases. For the f i r s t steo, w r i t e

Recursive a p p l i c a t i o n of t h i s formula gives

For any p a r t i c u l a r 6, the d i s t r i b u t i o n o f Gaussian w i t h mean

i5 ( t i

= EICx(ti) = Ci,(ti)

Z(ti) given Zi-,

i s known from the Chapter 7 r e s u l t s ; i t i s

E(z(ti)lZ1-l,cl
+

Du(ti) + Goi lZi-l,C)

+ Du(ti)

and covariance

Note t h a t iE(ti) and i (ti) are functions o f 5 because they are obtained from the Kalman f i l t e r based on a p a r t i c u l a r value o f 5; Ehat i s , they are conditioned on E . W use t h e e subscript notation t o emphasize t h i s depe~~dence. Ri i s also a function o f 5, although our notation does not e x p l i c i t l y i n d i c a t e t h i s . Substituting the appropriate Gaussian density functions characterized by Equations (9.1-5) and (9.1-6) i n t o Equation (9.1-4) gives N 1 (9.1-7) L ( C . Z ~ ) :~ ( 2 ~ 1 6 ) fl / 2 ~ ~ ~exp{- ~ [z(ti) = l - 7 / ~ iC(ti)]*Ri1[z(ti) - 2 5 ( t .1 ] } ) i=1

This i s the desired expression f o r the l i k e l i h o o d functional. 9.1.3 Derivation Using the Innovation This d e r i v a t i o n a l s o applies e i t h e r t o

Another d e r i v a t i o n involves the properties o f the innovation. mixed continuous/discrete-time o r t o pure discrete-time systems.

W proved i n Chapter 7 t h a t the innovations are a sequence o f independent, zero-mean Gaussian r a r i a b l e s e w i t h covariances R i given by Equation (7.2-33). This proof was done f o r the pure discrete-time case. but extends d i r e c t l y t o mixed continuous/discrete-time systems. The Chapter 7 r e s u l t s assumed t h a t the system matrices were known; thus the r e s u l t s are conditioned on 6. The conditional p r o b a b i l i t y density f u n c t i o n o f the innovations i s therefore

W a l s o showed i n Chapter 7 t h a t the innovations are an i n v e r t i b l e l i n e a r function o f the observations. e Furthermore, i t i s easy t o show t h a t the determinant o f the Jacobian o f the transformation equals 1. (The Jacobian i s t r i a n g u l a r w i t h 1's on the diagonal). Thus by Equation (3.4-1). we can s u b s t i t u t e

i n t o Equation (9.1-8)

t o give

which i s i d e n t i c a l t o Equatfon (9.1-7). We see t h a t the d e r i v a t i o n by Bayes f a c t w i n g and the d e r f v a t i o n using the innovation g i v e the same r e s u l t .

9.1.4

Steady-State Form

For many applications, we can use t h e time steady-state Kalman f i l t e r i n the cost functional, r e s u l t i n g i n e major computational savirigs. This usage requfres, o f course, t h a t the steady-state f i l t e r e x i s t . W discussed the c r i t e r i a f o r the existence o f the steady-state f t l t e r i n Chapter 7. The most fmportant c r i t e r i o n i s obviously t h a t the system be time-invariant. The r e s t o f t h i s section assumes t h a t a steady-state form e x i s t s . When a steady-state form exists, two approaches can be taken t o j u s t i f y i n g i t s use. The f i r s t j u s t i f i c a t i o n i s t h a t the steady-state form i s a good approximation i f the time i n t e r v a l i s long enough. The time-varying f i l t e r gain converges t o t h e steady-state gain w f t h time constants a t l e a s t as f a s t as those o f the open-loop system, and sometimes s i g n i f i c a n t l y faster. Thus, i f the maneuver analyzed i s long conpared t o the systen time constants, the f i l t e r gain would converge t o the steady-state gain i n a small port i o n o f the maneuver time. W could v e r i f y t h i s behavior by computing time-varying gains f o r representative e values o f 5. I f the f i l t e r gain d w s converge q u i c k l y t o the steady-state gain, then the steady-state f i l t e r should g i v e a good approximation t o the cost functional. The second possible j u s t i f i c a t i o n f o r the use o f the steady-state f i l t e r involves the choice o f the , The time-varying f i l t e r requires P, t o be specified. I t i s a conmun p r a c t i c e i n i t i a l s t a t e covariance P. t o set Po t o zero. This p r a c t i c e arises more from a lack o f b e t t e r ideas than from any r e a l argument t h a t zero i s a good value. It i s seldom t h a t we know the i n i t i a l state exactly as imp1l e d by the zero covariance. One circumstance which would j u s t i f y the zero t n i t i a l covariance would be the case where the i n i t i a l condition i s included i n t h e l i s t o f unknown parameters. I n t h i s case, the i n i t i a l covariance i s properly zero because the f i l t e r i s conditioned on the values o f the unknown parameters. Any p r i o r information about the i n i t i a l condition i s then r e f l e c t e d i n the p r i o r d i s t r i b u t i o n o f ( instead of i n P. , Unless one has a s p e c i f i c need f o r estimates o f the i n i t i a l condltion, there a r e u s u a l l y b e t t e r approaches. W suggest t h a t the steady-state covariance i s o f t e n a reasonable value f o r the i n i t i a l covariance. I n e t h i s case, the tins-varying and steady-state f i l t e r s are i d e n t i c a l ; arguments about the speed of convergence and the length o f the data i n t e r v a l are not required. Since the time-varying form requires s i g n i f i c a n t l y more computation than the steady-state form. the steady-state form i s preferable except where i t i s c l e a r l y and significantly inferior. If the steady-state f i l t e r i s used, Equation (9.1-7) becomes
N

where R i s the steady-state covdriance of the innovation. I n general. R i s a function o f f . The i c ( t t ) W use the e i n Equation (9.1-11) corns from the steady-state f i l t e r , u n l i k e t h e i 5 ( t i )i n Equation (9.1-7). same notation f o r both quantities, distinguishing them by context. (The z c ( t . ) from the steady-state f i l t e r i s always associated w t t h the steady-state covariance R, whereas the ic(ti) brom the time-varying f i l t e r i s associated w f t h the time-varying covariance Rf .) 9.1.5 Cost Function Discussion

The maximum-likelihood estimate o f c i s obtained by maximizinq Equation (9.1-11) (or Equation !9.1-7) ifthe steady-state form i s inappropriate) w f t h respect t o 6. Because o f the exponential i n Equation (9.1-11). i t i s more convenient t o work w i t h the logarithm o f the l i k e l i h o o d functional, c a l l e d the l o g l i k e l i h o o d functional f o r short. The l o g l i k e l i h o o d functtonal !s maximized by the same value o f c t h a t inaxfmizes the 1 i k e l ihood functional because the logarithm i s a monotonic increasing function. By convention, most optfmization theory i s w r i t t e n i n t e n s o f minimization instead o f maximization. W therefore define the negative o f the l o g l i k e l i h o o d functional t o be 1 cost functional e e which i s t o be minimized. W also omit the a n ( 2 ~ )term from the cost functional, because i t does not a f f e c t the mintmizatton. The most convenient expression f o r the cost functional i s then

I f R i s known, then Equatfon (9.1-12) i s i n a least-squares f o m . This i s sometimes c a l l e d a predictione r r o r form because the q u a n t i t y being minimized i s the square of the one-step-ahead p r e d t c t i o n e r r o r z(ti) i (ti). The term " f i l t e r e r r o r " i s a l s o used because the q u a n t i t y minimized i s obtained from the Kalman f i f t e r .

Note t h a t t h l s f o r n o f t h e l i k e l i h o o d functlonal involves the K a l m n f i l t e r - n o t a smoother. There i s sometimes a temptation t o replace the f i l t e r i n t h i s cost function by a smoother, assuming t h a t t h i s w i l l g i v e improved r e s u l t s . The smoother gfves b e t t e r s t a t e estimates than the f i l t e r , b u t the p r o b l s considered i n t h i s chapter i s not s t a t e estimation. The s t a t e estimates are an i n c i d e n t a l side-product o f t h e algorithm f o r estirnatfng the parameter vector 6 . There are ways o f dertvfng and w r f t f n g the parameter estimation problem which tnvolve smothers (Cox and Bryson, 1980). b u t the d i r e c t use o f a smoother i n Equation (9.1-12) i s simply Incorrect. For MAP estimates. we modify the cost functional by adding the negative o f the logarithm of the p r i o r p r o b a b i l i t y density o f E . I f the r t o r d i s t r i b u t i o n o f i s Gaussidn w t t h mean mc and covariance W, the cost functional o f Equation (9.1-12e b e c m s (tgnoring constant terms)

The f i l t e r - e r r o r forms o f Equations (9.1-12) and (9.1-13) are p a r a l l e l t o the output-error forns of When there i s no process noise, the steady-state lblnvn f i l t e r becomes an Equations (8.1-4) and (8.1-2). integration o f the system equations, and the innovation covariance R equals the mensurement noise covariance 66'. Thus the output error quations o f the previous chapter are special cases o f the f i l t e r error equations w l t h zero process noise. 9.2 COMPUTATION

The best methods f o r minimizing Equation (9.1-12) o r (9.1-13) are based on the Lust-Newton a l g o r i t h . Because these equations are so similar i n form t o the output-error equatlorls o f Chapter 8, most o f the Chapt e r 8 material on conputation applies d i r e c t l y o r wlth only minor modlflcation. The primary differences between conputational methods f o r f i l t e r error and those f o r output error center on the treatment of the noise covariances, particularly when the covariances are unknown. k i n e and I ? i f f (1981a) discuss the isplearntation d e t a i l s o f the f i l ter-error algorithm. Tne I l i f f - k i n e code, M.D ( k i n e and I l i f f , 1980; and k i n e , 1981), inplements the f i l t e r - e r r o r algorithm f o r linear continuous/discrete-t4m systcns.

W generally presume the use o f the steady-state f i l t e r I n the f i l t e r - e r r o r algorithm. e s i g n i f i c a n t l y more complicated using the time-varying f i l t e r .
9.3 FOWLATION AS A FILTERING PROBLEM

1;nplcmntation i s

An alternative t o the d i r e c t approach o f the previous section i s t o recast the parameter e s t i m t i o n prnblem I n t o the fonn o f a f i l t e r i n g problem. The techniques o f Chapter 7 then apply.

Suppose w s t a r t wlth the system model e

This i s the same as Equation (9.0-1). except t h c t here we e x p l i c i t l y indicate the dependence o f the m t r l c e s on 6 . The problem i s t o estimate c. I n order t o apply state estimation techniques t o t h i s problem, 6 must be part o f the state vector. Therefore. we define an augmnted state vector

W can combine Equation (9.3-1) wlth the t r i v i a l d i f f e r e n t i a l equation e

E=o
to w r i t e a system equation with xa as the state vector. Note t h a t the resulting system i s nonlinear I : I xa (because i t has products o f c and x). even though Equation (9.3-1) i s l l n e a r i n x. I n principle. we can apply the extended Kallnan f i l t e r , discussed i n Section 7.7, to the problem o f e s t i m t i n g xa. Unfortunately, the nonlinearity i n the a u w n t e d system i s crucial t o the system behavior. The adequacy o f the extended Kalman f i l t e r for t h i s problem has seldm been analyzed i n d e t a l l . Schwappe (1973, p. 433) says on t h i s subject the system i d e n t i f i c a t i o n problem has been t r a n s f o r n o i n t o a problea which has already been discussed extmsively. The discussions are not terminated a t t h l s point f o r the sinple reason t h a t Part I V d i d not provide any "best" one way to solve a nonlinear state e s t i mation problem. A major conclusion of Part I V was that the best m y to proceed depends heavily on the e x p l i c i t nature o f the problm. S s t m i d m t i f i c a t i o n leads to special types of nonltnear r s t i m t i o n p r o ~ l r s ,so spadalized discussions are needad. the state a u p n t a t i o n approach i s not dnphrsized, as the author feels t h a t i t i s nuch mom appropriate to approrch the r y s t r n i r l e n t t f i c r t i o n problem directly. Houever, them a m specfa1 cases where state augnntatlon works very we11.

...

...

CHAPTER 10 10.0 EQUATION ERROR METHOD FOR DVNMIC SYSTEMS

Thts chapter discusses the q u a t t o n e r r o r approach t o parameter e s t t m t t o n f o r dynamtc systems. W w t l l e f t r s t define a r e s t r t c t e d form o f q u a t t o n e r r o r , p a r a l l e l t o thc "reatments o f output e r r o r end f t l t e r e r r c r t n the prevlous chapters. This form o f equatton e r r o r I s a s p u l a l case o f f t l t e r e r r o r where there I s process notse, but no measurement noise. It therefore stands l n counterpoint t o output error, which I s the speclal case where there I s measurement notse, but no process notse. W w t l l then extend the d e f t n t t t o n of equatlon e r r o r t o a e catlons o f equation e r r o r do not flt prectsely l n t o the overly I n I t s most general forms, the term q u a t l o n e r r o r encompasses the forms most comonly associated w t t h the tern). The primary i n t h i s chapter t s t h e t r computational stmpltctty. 10.1 PROCESS-NOISE APPROACH more general form. Some o f the p r a c t t c a l appl t r e s t r t c t t v e form based on process noise only. output e r r o r and f l l t e r error, t n addttton t o dtsttngutshlng feature o f the methods emphasized

I n t h i s sectton, we conslder equatton e r r o r I n a manner p a r a l l e l t o the prevtous trectnrents o f output e r r o r and f t l t e r error. The f t l t e r - e r r o r method t r e a t s systems w t t h both process noise and m e a s u r m n t notse. and output e-ror t r e a t s the speclal case o f systems w t t h measurement notse only. Equatton e r r o r completes t h t s by t r l a d o f alg~>rithms t r e a t t n g the speclal case o f systems w l t h process nolse only. The eqiatlon-error method applies t o nonltnear systems w t t h a d d i t i v e Gaussian process nolse. W w l l l e r e r t r l c t the discusston o f t h t s sectton t o pure discrete-time models, f o r which the d e r i v a t i o n I s stralghtfornard. Mixed contlnuous/dtscrete-ttme models can be handled by converting them t o equtvalent pure dt screte-ttme models. Equatton e r r o r does not s t r i c t l y apply t o pure conttnucus-ttme models. (The problem becomes l 1-posed). 1 The general form o f the nonlinear, dtscrete-ttme system model we wt 11 constder 4s

The process notse, n, I s a sequence o f independent Gaussian random varlables w t t h zero mean and l d e n t l t y covartance. The matrlx F can be a function o f althou h the s t n p l t f t e d n o t b t l o n Ignores t h t s p o s s t b t l t t y . It w t l l prove conventent t o assume t h a t the measurements z?tl) are deflned f o r t = O,.. .N; previous chapters have defined them o n l y f o r t = 1,. ,N.

..

(.

The followtng d e r l v a t l o n o f the equatton-error method c l o s e l y p a r a l l e l s the d e r l v a t l o n r , i the f t l t e r - e r r o r method l n Sectlon 9.1.3. Both are based p r f m r i l y on appltcatton o f the t r a n s f o m t a t f ~ nof vartables f o m l a , Equation (3.4-1). .,tarting from a process known t o be a sequence o f Independent Gauisian varlables. By assumptto:~, the probabtl i t y denslty functton o f the process ,~otse i s N-1 p(nN) =

fl
I-0

( 2 ~ ) ' ~ 'exp(nlnt) ~

where nN i s the concatenatton o f the nt. W f u r t h e r assume that F i s I n v e r t i b l e f o r a l l permtsstble e values o f 6; t h i s assumption t s necessary t o ensure t h a t the problem I s well-posed. W deftne XN to be the e concatenatton of the x(t4). Then, f o r each value of 6. XN t s an i n v e r t t b l e l i n e a r functton of nN. The inverse functton i s

where, f o r conventence and f o r conststency w i t h the notatton o f prevlous chapters, we heve deftned i,(tl+,) f[x(ti),~(ti),61 (10.1-4)

because the Inverse t r a n s f o m t l o n The detarmtnant o f the Jacoblan o f the lnverse t r a n s f o m t i o n 4s IF m t r l x I s block-triangular w i t h F" i n the dtagonal blocks. D f m c a p p l i c a t l o n o f the t r a n s f o m t f o n - o f varlables formula, Equation (3.4-l), gives N ~ ( ~ ~= 6 ) 1 t=1

I"

I~.FF*I-~~.

exp{-

j [x(tt)

- ic(ti)]*(FF*)-l

[x(tt)

- i6(tl)])

(10.1-5)

tl0n

Of

I n order t o dertve a stnple e x p r o s f o n f o r p(ZN 6). we requlre t h a t g be a conttnuous, I n v e r t l b l e funcx f o r erch value o f 6. The t n v e r t t b t l t t y s c r i t l c r l to the s t n p l l c l t y o f UH q u r t i o n - e r r o r

algorithm. This assumption, combined w i t h t h e l a c k o f measurement noise, means t h a t we can reconstruct the s t a t e vector perfectly, provided t h a t w know 6. The inverse function gives t h i s reconstruction: e

It g i s not i n v e r t i b l e , a recursive state estimator becomes imbedded i n the algorithm and we are again faced w i t h something as complicated as the f i l t e r - e r r o r algorithm. For I n v e r t i b l e g, the transformation-ofvariables formula, Equation (3.4-1). gives

where i c ( t i ) i s given by Equation (10.1-6). "(ti)

and = f[ic(ti_l).u(t,,l).C1

Most p r a c t i c a l applications o f equation e r r o r separate the problems o f s t a t e reconstruction and parameter estimation. I n t h e context defined above, t h i s i s possib;e when g i s not a function o f 6. Then Equat i o n (10.1-6) i s a l s o independent o f c; thus. we can reconstruct the s t a t e exactly wjthout knowledge o f 5 . Fu*themre, the estimates o f 6 depend only on the reconstructed state vector and the control vector. There i s no d i r e c t dependence on t h e actual measurements z ( t i ] o r on the exact form o f the g-fuqction. This i s evident i n Equation (10.1-7) because the Jacobian o f g' i s independent o f c and, therefore, i r r e l e v a n t t o the parameter-estimation problem. I n many p r a c t i c a l applications, the state reconstruction i s more complicated than a simple pointwise function as i n Equation (10.1-6). b u t as long as the s t a t e reconstruction does n o t depend on 6, the d e t a i l s do not matter t o the parameter-estimation process. You w i l l seldom ( i f ever) see Equation (10.1-7) elsewhere i n the form shown here, which includes the fact o r f o r the Jacobian o f 9". The usual d e r i v a t i o n ignores the measurement equation and s t a r t s from the assumption t h a t the state i s known exactly, whether by d i r e c t measurement o r by some reconstruction. W have e included the measurement equation only i n order t o emphasize tne p a r a l l e l s between equation e r r o r , output c error, and f i l t e r error. For the r e s t o f t h i s section, we wi 11 assume t h a t g i s independent o f c. W w i l l s p e c i f i c a l l y assume t h a t the determinant o f the Jacobian o f g i s 1 (the actual value being i r r e l e v a n t t o the estir,ator anyway), so t h a t we can w r i t e Equation (10.1-7) i n a more conventional form as

where

You can derive s l i g h t generalizations. useful i n sane cases, from Equation (10.1-7). The maximum-likelihood estimate o f 6 i s the value t h a t maximizes Equation (10.1-9). As I n previous c h a p t e ~ ' ~ i t i s convenient t,o work i n terms o f minimizing the n e g a t i v e - l o g - l i k e l ~ h o o dfunctional ,

If

has a Gaussian p r i o r d i s t r i b u t i o n w i t h mean mc and covarlance P, then the U P estimate minimizes

10.1.2

Special Care o f F l l t e r E r r o r

For l i n e a r systeats, m can a l s o derive s t a t e - q u a t i o n e r r o r by plugging i n t o the l i n e a r f i l t e r - e r r o r algorithm derived In Chapter 9. Assume t h a t G I s 0; FF* i s i n v e r t i b l e ; C i s square. i n v e r t i b l e , and known exactly; and D i s known exactly. These a r e t h e assumptions t h a t mean we have p e r f e c t measurarmnts o f the state o f the system. The Kalmn f i l t e r f o r t h i s case I s (repeating Equation (7.3-11)) i(t,) and the covarimce, PI, = C'l[z(t,)

- Du(t,)]

::(ti)

o f t h i s f i l t e r e d e s t i n r t e I s 0.

The one-step-ahead p r e d i c t i o n i s

The use o f t n i s form i a an q u a t i o n - e r r o r method presumes t h a t the state x ( t i ) can be reconstructed as a function o f the z ( t i ) and u ( t f ) . This p n s u n p t i o n I s i d e n t i c a l t o t h a t f o r discrete-time state-equation error, and i t implies the same conditions: there must be noise-free measurements o f the state, irdependent o f 6. I t i s l t n y l i c i t t h a t a known i n v e r t i b l e transformation o f such measurements i s s t a t i s t i c a l l v equivalent. As i n the discrete-time case, we can define the estimator even when the meast+rerwnts a r e noisy, b u t i t rill no longer be a maxlnwn-likelihood estimator. Equation (10.2-7) also presures t h a t the d e r l v a t i v e i ( t ) can be reconstructed from the measurements. Neglect~ngf o r the moment the s t a t i s t i c a l implications, note t h a t we can form a p l a u s i b l e equation-error e s t i mator using any reasonable means o f approximating a value f o r i ( t i ) independently o f 6 . The simplest case o f t h i s i s when the observation vector includes measur.ements o f t h e s t a t e d e r i v s t i v e s i n a d d i t i o n t o the measurements o f the states. I f such d e r i v a t ~ v emeasurements are not d i r e c t l y available, we csn always approximate i i t i ) by f i n i t e - d i f f e r e n c e d i f f e r e n t i a t i o n o f the s t a t e measurements, as i n

Both d i r e c t measurement and f i n i t e - d i f f e r e n - e approximation are used

it:

practice. To a r r i v e a t

Rigorous s t a t i s t i c a l treatment i s easiest f o r the case o f f i n i t e - d i f f e r e n c e ,,~proximations. such a form, we w r i t ? the state equation i n i n t e r r a t e d form as

A n approximate solution (not necessarily the t e s t approximation) t o Equation (10.2-9)

is (10.2-10) F-matrix.

x(tit1)

:x(ti)

+ (ti+l

'

ti)f[x(ti)*u(ti)*~l

Fdni

where n i i s a sequence o f independent Gaussian variables, and Fd i s the equivalent discrete Sestions 6.2 and 7.5 discuss such approximations. Equation (10.2-10) i s i n the form o f a discrete-time state equation. e r r o r method based on t h i s equation uses

The discrete-time state-equation

Redefining h by d i v i d i n g by t i

- ti-,

gives the form ).ti.cl = i(ti)

h[z(.;,u(.

- fCx(ti),u(ti),c1

where the d e r i v a t i v e i s obtained from the f i n i t e - d i f f e r e n c e formula

Other discrete-time approximations o f Equation (10.2-9) r e s u l t i n d i f f e r e n t f i n i t e - d i f f e r e n c e formulae. The ceatral-difference form of Equatlon (10.2-8) i s usually b e t t e r than the one-sided form of Equat i o n (10.2-13), although Equation (10.2-8) has a lower bandwidth. If the bandwidth o f Equation (10.2-8) presents problems. a b e t t e r approach than Equation (10.2-13) i s t o use

where we have used the notation


f.i/lz

f (ti

and

There are several other reasonable f i n i t e - d i f f e r e n c e formulae applicable t o t L i s problem. Rigorous s t a t i s t i c a l treatment o f the case i n which d i r e c t state d e r i v a t i v e measurenents are a v a i i a b l e rdises several complications. Furthermore, It i s d i f f i c u l t t o g e t a rigorous r e s u l t i n the form t y p i c a l l y used-an equation-error methcd based on i measurements substituted i n t o Equation (10.2-7). I t i s probably best t o regard t h l s approacii as an equation-error estimator derived from plausible, b u t ad hoc, reasoning.
We w i l l b r l e f l y o u t l i n e the s t a t i s t i c a l issues raised by s t a t e d e r i v a t i v e measurements, without a t t m p t i n g a complete analysis. The f i r s t problem i s that. f o r systems w l t h white process nois,., the s t a t e d e r i v a t i v e i s i n f f n i t e a t every p o i n t i n time. (Careful argument i s required even t o define the derivative.) I& could avoid t h l s problem by r e q u i r i n g the process noise t o be band-limited, o r by other means, b u t t h e r e s u l t i n g estlmetor

w i l l n o t be i n the desired form. A ;.euristic explanation i s t h a t the x measurements contain i m p l i c i t the d e r i v a t i v e (from the f i n i t e differences), and simple use o f the measured d e r i v a t i v e information a b o u ~ ignore: t h i s infontlation. A :-igorous maximum-likelihood estimator would use both sources of i n f o m t i o n . This statement assumes t h a t the i measurements and the f i n i t e - d i f f e r e n c e derivatives are inoe~enilentdats . It i s conceivable t h a t the x "measurements" are obtained as sums o f the i measurentents ( f o r instance, i n an i n e r t i ~ lnavigation u n i t ) . Such cases a r e merely integrated versions o f the f i n i t e - d i f f e r e n c e approach, not r e a l l y comparable t o cases o f independent i measurements. The lack o f a rigorous d e r i v a t i o n f o r the state-cquation e r r o r method w i t h independently measured s t a t e derivatives does not necessarily mean t h d t i t i s a poor estimator. i f the i n f o m t i o n i n t h e s t a t e d e r i v a t i v e measurements i s much b e t t e r than the information i n the f i n i t e - d i f f e r e n c e state derivatives, we can just:'fy t h e approach as o gcod approximation. Furthermore, as expressed i n our discussions i n Section 1.4, an e s t i mator does not have t o be s t a t i s t i c a l l y derived t o be a good estimator. For some problems, t h i s estimator gives adequate r e s u l t s w i t h low computational costs; when t h i s r e s u l t occurs, i t i s s u f f i c i e n t j u s t i f i c a t i o n in itself.

Another s p e c i f i c case o f the equation-error method i s observation-quation error. !n t h i s case, the s p e c i f i c form o f h comes from the observation equation, ignoring the noise. The equailon i s the same for pure discrete-time o r mixed continuous/discrete-time systems. The observation equation f o r a system w i t h a d d i t i v e noise i s

The

h f u n c t i o n based on t h i s equation i s

As i n the case o f state-equation error, observation-equation e r r o r requires measurements o r reconstruct i o n s 3: the state, because x(ti) appears i n the equation. The corn,~nts i n Section 10.2.1 about noise i n t h e s t a t e mpasurement apply here alss. Observation-equation e r r o r does not require measurements o f the s t a t e derivative. The observation-equation e r r o r method also requires t h a t there be some measurements i n a d d i t i o n t o the states, o r the method reduces t o t r i v i a l i t y . I f the states were the only measurements, the obscrvat;on equat i o n would reduce t o

which has 00 unknown parameters.

There would, therefore, be nothing t o t;timate.

The observation-equation e r m r method applies only t o estimating parameters i n the d b ~ e r v ~ t i o n equation. Unkncm parameters i n the state equarlon do I I O ~ enter t h i s f o m l a t i o n . I n f a c t , the existence o f the s t a t e equation i s l a r g e l y i r r e l e v a n t t o th, method. This irrelevance perhaps explains why observation-equaticn e r r o r i s usually neglected i n discussions o f estimators f o r dynamic systems. The method i s e s s e n t i a l l y a d i r e c t a p p l i c a t i o n o f the s t a t i c estimatrrs o f Chapter 5 , taking no advantage o f the dynamics o f the system ( t h e s t a t e equation). From a t h e o r e t i c a l viewpoint, i t may seem out o f place i n t h i s chapter. I n practice, the observation-equation-error method i s widely used, someti~escontorted t o look 1i k e a state-equation-error method. The observation-equation-error method i s o f t e n a competitor t o an output-error method. Our treatment o f observation-equation e r r o r i s intended t o f a c i l i t a t e a f a i r evaluation o f such choices and t o avoid unnecesszry contortions i n t o state-equation e r r o r forms.

We have previously mentioned t h a t a u n i f y i n g c h a r a c t e r i s t i c of the methods discussed i n t h i s chapter i s e t h e i r cmputational simp1 i c i t y . W have not, however, given much d e t a i l on the computational issues. !-quation (10.2-3), which encompasses a l l equation-error forms. i s i n the form o f Equation (2.5-1) i f the aleighting t m t r i x W i s known. Therefore, the Gauss-Newton optimization a 1 g o r i t h n applies d i r e c t l y . Urknown ma+* ices can be handled by the method discussed i n Sections 5.5 and 8.4.

I n the most general d e f i n i t i o n of equation error, t h i s i s nearly the lidt o f what we can state about .nputation. The d e f i n i t i o n o f Equatton (10.2-3) i s general enough t o allow output e r r o r and f i l t e r e r r o r as special cases. Both output er;-or and f i l t e r e r r o r have the special property t h a t the dependence o f h on z and u can be cast I n a recursive form, s i g n t f i c a n t l y lowering the computational costs. Because o f t h i s recursive form, the t o t a l computational cost i s roughly proportional t o the number o f time points. N. The general d e f f n i t i o n o f equation e r r o r a l s o encompasses nonrecurstve forms, which could have computational costs proportional t o N2 o r higher powers. The equa tion-error methods discussed i n t h i s chapter have the property that, f o r each ti, the dependence o f h on z(. ) and u(. ) i s r e s t r i c t e d t o one o r two time points. Therefore, the computational e f f o r t f o r each evaluation o f h i s independent o f N, and the t o t a l conputational cost i s roughly praportional t o N. I n t h t s regard, state-equation e r r o r and output-equatton e r r o r are comparable t o output e r r o r and f l l t e r e r r o r . For a colnpletely general, nonlinear system, the conputatfonal c o s t , o f state-equation e r r o r o r output-equation

e r r o r i s roughly s i m i l a r t o the cost of output e r r o r . f i l t e r e r r o r without using l i n e a r i z e d approximations.)

(General nonlinear models are c u r r e n t l y impractical f o r

I n the l a r g e m a j o r i t y o f p r a c t i c a l applications, however, the f and g functions have special properties which make the conputational costs o f state-equation e r r o r and output-equation e r r o r f a r smaller than the computational costs o f output e r r o r o r f i l t e r error. The f i r s t property i s t h a t the f and g functions are l i n e a r i n c. This property holds t r u e even f o r systems describea as nonlinear; the n o n l i n e a r i t y meant by the term "nonlinear system" i s as a functior! o f x and u - n o t 3s a function o f 6 . Equation (1.3-2) i s a simple example o f a s t a t i c system nonlinear i n the input, b u t l i n e a r i n the parameters. The output-error method can seldom take advantage o f l i n e a r i t y i n the parameters, even when t h e system i s also l i n e a r i n x and u, because the system response i s usually a nonlinear function o f t . (There are some s i g n i f i c a n t exceptions i n special cases.) State-equation e r r o r and output-equation e r r o r methods, i n contrast, can take e x c e l l e n t advantage o f l i n e a r i t y i n the parameters, even when t h e system i s nonlinear i n x and u. I n t h i s situation, state-equation e r r o r and cutput-equation e r r o r m e t the conditions o f Section 2.5.1 f o r the Gauss-Newton algorithm t o a t t a i n t h e exact minimm i n a single i t e r a t i o n . This i s both a q u a n t i t a t i v e and a q u a l i t a t i v e conputational improvement r e l a t i v e t o output e r r o r . The q u a n t i t a t i v e improvement i s a d i v i s i o n o f the computational cost by the n u d e r o f i t e r a t i o n s required f o r the The q u a l i t a t i v e improvement i s the e l i m i n a t i o n o f the issues associated w i t h i t e r a t i v e output-error methc:. methods: s t a r t i n g values, convergence-testicg c r i t e r i a . f a i l u r e t o converge, convergence accelerators, rml tip l e l o c a l solutions, and other issues. The most conrnonly c i t e d o f these b e n e f i t s i s t h a t there i s no need f o r reasonable s t a r t i n g values. You can evaluate the equations a t any a r b i t r a r y p o i n t (zero 's a f t e n convenient) without a f f e c t i n g the - e s u l t . Another s i m p l i f y i n g property o f f and g, not q u i t e as universal, but t r u e i n the m a j o r i t y of cases, i s t h a t each element of 6 a f f e c t s only one element o f f o r g. The simplest example o f t h i s i s a l i n e a r system e where t h e unknown parameters are i n d i v i d u a l elements o f the system matrices. With t h i s structure, ifw cons t r a i n L' t o be diagonal, Equation (10.2-3) separates i n t o a sum o f independent minimization problems w i t h e scalar h, one problem f o r each element o f h. I f e i s the n u ~ h e ro f elements o f the h-vector, w now have n independent functions i n the form of Equation (10.2-3), each w i t h scalar h. Each element o f 5 a f f e c t s one and only one o f these scalar fdnctions. This p a r t i t i o n i n g has the obvious benefit, c o m n t o most p a r t i t i o n i n g algorithms, t h a t the sum o f the I n-problems w i t h scalar ! requires less computation than the unpartitioned vector problem. The outer-product ccinputation o f Equation (2.5-ll), o f t e n the most time-consuming p a r t o f the algoritnm, i s proportional t o the square o f the number o f unknowns and t o a. Therefore, i f the unknowns a r e evenly d i s t r i b u t e d among the a e l e m n t s o f h, the computational cost o f the vector' problem coulc be as much as a 3 times the cost o f each o f the scalar problems. Other portions o f the computational cost and overhead w i l l reduce t h i s f d c t o r somewhat. b u t the improvement i s s t i l l dramatic. Another b e n e f i t o f the p a r t i t i o r r i n g i s t h a t i t allows us t o avoid i t e r a t i o n when the noise covarianies a r e unknown. u i t h t h i s p a r t i t i o n i n g , the minimizing values o f c are independent o f W. The normal r o l e o f W i s i n weighing !Ie inportance o f f i t t i n g the d i f f e r e n t elements o f the h. One value of 5 might f i t one element o f h best, while another value o f 6 f i t s another element o f h best; W establishes how t o s t r i k e a compromise among these c o n f l i c t i n g aims. Since the p a r t i t i o n e d problem structure makes the d i f f e r e n t e l e ments o f h 111dependen:. W i s l a r g e l y i r r e l e v a n t . Therefore we can estimate the elements o f 5 using any a r b i t r a r y value o f Y (usually an i d e n t i t y matrix). I f we want an estimate o f W, we can compute it a f t e r we estimate the other unknowns. The equation tational plotting 10.4 sonbiced e f f e c t o f these computational improvements i s t o make the computational cost of the stdtee r r o r and output-equation e r r o r methods n e g l i g i b l e i n many applications. I t i s c o m n f o r the coiapucost o f the actual equation-error algorithm t o be dwarfed by the overhead costs o f o b t a i c i n g :he data. the r e s u l t s , and r e l a t e d c m i u t a t i o n s .

DISCUSSION

The undebated strong points o f the state-equation-error and output-equation-error methods are t h e i r s i m p l i c i t y and low computational cost. Host important i s t h a t Gauss-Newton gives the exact minimum of the cost function without i t e r a t i o n . Because the methods a r e noniterative, they require no s t a r t i n g estinates. These methods have been used i n many applications, sometimes under d i f f e r e n t nams. The weaknesses o f these methods stem from t h e i r assumptions o f p e r f e c t s t a t e measurements. R e l a t i v e l y small amounts o f noise i n the measurements can cause s i g n i f i c a n t t i a s e r r o r s i n the estimates. I f a measurement o f some s t a t e i s unavailable, o r i f an i n s t r u m n t f a i l s , these methods are not d i r e c t l y applicable (though such problems are sometimes handled by s t a t e reconstruction ?lgorithms). State-equation-error and output-equation-error methods can be used w i t h e i t h e r of two d i s t i n c t approaches, depending upon t h e application. The f i r s t approacn i s t o accept the problem o f measurement-noise s e n s i t i v i t y and t o emphasize the computational e f f i c i e n c y o f the method. This approach i s appropriate when conputational cost i s a more important consideration than accuracy. For example, state-equation e r r o r and output-equation e r r o r methods are popular f o r obtaining s t a r t i n g values f o r i t e r a t i v e procedures such as output e r r o r . I n such applications, the estimates need only be accur a t e enough t o cause the i t e r a t i v e methods t o converge (presumably t o b e t t e r estivates). Another cormon use f o r state-equation e r r o r and output-error i s t o select a model from a l a r g e number of candidates by estimating the parameters i n each candidate model. Once the model form i s selected, t h e rough parameter estimates can be r e f i n e d by iom other method.

The second approach t o usinq state-equation-error or output-quation-error methods i s t o spend the tine and effort,necessary t o get accurate reqults from them. which f i r s t requires accurate state acasureaents with low noise ~ r v e l s . I n many applications o f these methods, m s t o f the work l i e s i n f i l t e r i n g the data and recot~structingestimates of r n r r s u m d states. (A K a l u n f i l t e r can sometimes be helpful hen?, provided that the f i l t e r does not depend upon the parameters t o be estlwted. This condition requires a special problea structure.) The t o t a l cost of obtaining good e s t i m t e s from these methods. including the cost o f data preprocessing. m y be cowarable t o the cost o f l o r e complicated i t e r a t i v e algorithas t h a t require less preprocessing. The tra&-off i s highly dependent on application variables such as the required accuracy o f the estimates. the qua1i t y o f the available instrumentation. and the existence o f independent needs f o r accurate state measurments.

CHAPTER 11 11.0
ACCURACY O f THE ESTIMTES

a pervasive issue i n the varicus stages o f application. fm the problem statener~tt o the evaluation and use of the I-esults.

Parameter estimates from real systems are. by t h e i r nature, inperfect.

The accuracy o f the estimates i s

Ue introduced : subject of parameter estimation i n Section 1.4, using concepts o f errors i n the e s t i l rates and adequacy cf the results. The subsequent chapters have largely concentrated on the derivation o f lgorithas. These h r i v a t i o n s are a l l related t o accuracy issues, based on the definitions and discussions i n hapter 4. W ~ v e r ,the questions about accuracy have been largely overshadowed by the d e t a i l s o f deriving and ~aplementi~a algori t b s . the
!#I t h i s chapter, w return the emphasis t o the c r i t i c a l issue o f accuracy. e The f i n a l judgment o f the parameter estimation process f o r a particular application i s based on the accuracy o f the results. W examine e the evaluation of the accuracy. factors contributing t o inaccuracy, and means o f improving accuracy. A t r u l y comprehensive treatment of the subject o f accuracy i s inpossible. Ue r e s t r i c t wr discussion largely t o seneric issues related t o the thcory and methodology of parameter estimation.

To make effective use of parameter estimates, we must have sane gauge o f t h e i r accuracy, be i t a s t a t i s t i cal measure, an i n t u i t i v e guess, o r some other source. I f we absolutely cannot distinguish the extremes of accurate versus worthless estimates, we must always consider the p o s s i b i l i t y that the estimates are worthless. i n which case the estimates could not be used i n any application i n which t h e i r v a l i d i t y was importaat. riterefore. neasures of the estimate accuracy are as inpartant as are the estimates themszlves. Various means ~f judging the accuracy o f parameter estimates are i n current use.
W w i l l group the uses f o r measures of e s t i w t e accuracy i n t o three p n e r a l classes. The f i r s t class o f e ose i s i n planning the parameter estimation. Predictions of the estimate accuracy can be used t o evaluate the ,Idequacy of the proposed experiments and instrumentation system f o r the parameter estimation on the proposed mdel. There are limitations t o t n i s usage because i t involves predicting accuracy before the actual data are obtained. Unexpected problems can always cause degradation o f the results compared t o the predictions. The accuracy predicticns are m s t useful i n i d e n t i f y i n g experiments that have no hope o f success.

The second use i s i n the parameter estimation process i t s e l f . Measures o f accuracy can help detect various problems i n the estimation, from modeling failures. data problems, program bugs, o r other sources. Another facet of t h i s class of use i s the canparison of d i f f e r e n t estimates. The canparisons can be between two d i f f e r e n t models o r methods applied t o the same data set, between estimates from independent data sets, o r between predictions and estimates from the experimental data. I n any o f these events, measures o f accuracy car1 help determine which of the conflicting values i s best, o r whether some c a n p m i s e behreen them should be cotisidered. Comparison o f the accuracy measures with the differences i n the estimates i s a means t o detennine i f the differences are significant. The magnitude o f the observed differences between the estimates is, i n i t s e l f , an indicator o f accuracy. The t h i r d use of measures o f accuracy i s f o r presentation with the f i n a l estimates f o r the user o f the results. Ifthe estimates are t o be used i n a control system design, f o r instance, knowledge of t h e i r accuracy i s useful i n evaluating the s e n s i t i v i t y o f the control system. I f the estimates are t o be used by an e x p l i c i t adaptive o r learning control system, then i t i s important that the accuracy evaluation be systematic enough t o be arltunatically iv'enented. Such iarnediate use o f the estimates precludes the intercession o f engineering judgment; the ev,- lscion o f the estimates must be e n t i r e l y automatic. Such control systems m s t recognize poor results and sui.olrly discount them (or ensure that they never occur-an overly optimis;ic goal). The single most c r i t i c a l contributor t o getting accurate parameter estimates i n practical problems i s the analyst's understanding o f the physical system and the instrumentation. The most thorough knwledge o f param eter estimation theory and the use o f the most powerful techniques do not compensate f o r poor understanding o f the system. This statement relates d i r e c t l y t o the discussion i n Chapter 1 about the "black box" identificat i w problem and the roles o f independent knowledge versus system identification. The principles discussed i n t h i s chapter, although no substitute f o r an understanding o f the system, are a necessary adjunct t o such understanding. Before proceeding fucthar, we need t o review the d e f i n i t i o n o f the term "accuracy" as i t applies t o real data. .A system i s lev-r described exactly by the simp1 i f f e d models used f o r analysis. Regardless of the sophistication o f t r .mdel, unexplained sources o f modeling error w i l l always remain. There i s no unique. correct m d e l . The cc. ,pr o f accuracy i s d i f f i c u l t t o define precisely i f no correct model exists. I t i s easiest t o approach b j * ~ n s i d e r i n gthe problem i n two parts: estilnation and modeling. For analyzing the estimation problem. we assume that the m d e l clescribes the system exactly. The d e f i n i t i o n o f accuracy i s then precise and q u a n t i ~ o t i v e . k n y results arr: available i n the subject area o f estimation accuracy. Sections 11.1 and 11.2 disc1 i several o f them.

The modeling problem addresses the question o f whether the fonn o f the model can describe the system adequa?:ely f o r i t s intended use. There i s l i t t l e gulde from the theory i n t h i s area. Studies such as those of Gup,'a, Hall, and Trantlz (1978), Fiske and Price (1977). and Akaike (1974). discuss selection o f the best model from a set o f candidates, but do not consider the more basic issue c f defining the candidate models. Section 11.4 considlrl.s t h i s point i n more detail.
For the %,*stpart, the determination o f lnodel adequacy i s based on engineering j u d ~ n n t and problemspeciflr analysis relying heavily on the analyst's understanding o f the physics o f the system. I n some cases,

we can t e s t model adzquacy by demonstration: i f we t r y the model and i t achieves i t s purpose, i t was obviously adequate. Such t e s t s are not always p r a c t i c a l , however. This method assumes, o f course, t h a t the t e s t was co>nprehensive. Such assumptions should not be made 1 i g h t l y ; they have cost 1 ives when systelns encountered untested conditions.
A f t e r considering estimation and modeling as separate problems, we need t o look a t t h e i r i n t e r a c t i o n s t o conplete t.ie discussion o f accuracy. W need t o consider the estimates t n a t r e s u l t from a model judged t o be e adequate, a1though not exact. As i n the modeling problem, t h i s process involves considerable subjective judgment, although we can obtain some q u a n t i t a t i v e r e s u l t s . W can examine sane specific, postulated sources o f modeling e r r o r through simuldtions o r analyses t h a t e use more c m l e x models than are p r a c t i c a l o r desirable i n the parameter estimation. Such simulations o r analyses cari include, f o r example, models o f s p e c i f i c , postulated instrumentation e r r o r s (Hodge and Bryant. 1978; and Sorensen. 1972). M i n e and I l i f f (1981b) present some m r e general, but less rigorous, results.
11.1 CONFIDENCE REGIONS

The concept o f a confidence region i s c e n t r a l t o the a n a l y t i c a l study o f estimation accuracy. I n general terms, a confidence region i s a region w i t h i n which we can be reasonably confident t h a t the t r u e value of F. l i e s . Accurate estimates correspond t o smail confidence regions f o r a given level o f confidence. Note t h a t small confidence regions i n p l y l a r g e confidence; i n order t o avoid t h i s apparent inversion o f terminology, the t e r n "uncertainty region" i s sometimes used i n place o f the t e n "confidence region." The following subsect i o n s define confidence regions more precisely. For continuous, nonsingular estimation problems, the p r o b a b i l i t y o f any p o i n t estimate's being exactly correct i s zero. Ye need a concept such as the confidence region t o make statements w i t h a nonzero confidence. Throughout the discussion o f confidence regions, we assume t h a t the system model i s correct; t h a t is, we assume the t h a t 5 has a t r u e value l y i n g i ~ ,parameter smce. I n l a t e r sections we w i l l consider issues r e l a t i n g t o model i n g error. 11.1.1 @dm Parameter Vector This

Let us consider f i r s t the case i n which 5 i s a random variable w i t h a known p r i o r d i s t r i b u t i o n . s i t u a t i o n usually implies the use o f an HAP estimator. In this i n any f i x e d sion, we can working w i t h bution of c

case, F, has a p o s t e r i o r d i s t r i b u t i o n , and we can define the p o s t e r i o r p r o b a b i l i t y t h a t C l i e s region. Although we w i l l use the p o s t e r i o r a i s t r i b u t i o n o f 6 as the context f o r t h i s discusequally w e l l define p r i o r confidence regions. None o f the f o l l o w i n g development depends up?n our a p q s t e r i o r d i s t r i b u t i o n . For s i m p l i c i t y o f exposition, we w i l l assume t h a t the p o s t e r i o r d i s t r i has a density function. The p o s t e r i o r p r o b a b i l i t y t h a t F, l i e s i n a region R i s then

W d ~ f i n e R t o be a confidence region f o r the confidecce l e v e l a i f P(R) = a, and no other region e e w i t h the same p r o b a b i l i t y i s smaller than R. W use the volume o f a region as a measure o f i t s size. Theorem 11.1 Let R be the set of a l l points w i t h p(cIZ) r c, where c a constant. Then R i s a confidence region f o r the confidence level a = P(R). Proof Let
= a.

is

R be as defined above, and l e t R ' by any other region w i t h W need t o prove t h a t the vcluy o f R ? s t be greater than o r e equal t o t h a t o f ,R. k'e define T = R n R , S = R n R and, S' = R'n R. Then T, S, and S are d i s j o i n t , R = T u 5, and R ' = T u S Because S C R, we must have p ( ~ , l Z )2 c everywhere i n S. Conversely, S ' c R, so p(cIZ) c everywhere I n S t . I n order f o r P(R') P(R), we must have P(S') = P(S). Therefore, tire volume o f S' must be greater than o r equal t o t h a t o f 5. The volume o f R ' must then be greater than t h a t o f R, comp l e t i n g the proof.

I t i s o f t e n convenient t o characterize a closed region by i t s boundary. The boundaries o f the confidence regions defined by Theorem 11.1 are i s o c l i n e s o f the p o s t e r i o r density function p ( ~ 1 Z ) .

W can w r i t e the confidence region derived j n the above theorem as e

W must use the f u l l notation f o r the p r o b a b i l i t y density f u n c t l o n t o avoid confusion i n the f o l l o w i n g m n i p u e l a t i o n s . For consistency w i t h the f o l l o w i n g section, i t i s convenient t o re-express the confidence region i n terms o f the density function o f the e r r o r .

The estimate

i s a deterministic function o f

Z; therefore, Equation (11.1-3)

t r i v f a l l y gives

Substituting t h i s i n t o Equation (11.1-2)

gives the expression

R = t x : peIZ(x Substituting x +

- ilz)

c)

for

i n Equation (11.1-5)

gives the convenient form

This form shcws the boundaries o f the confidence regions t o be translated i s o c l i n e s o f the error-density function. Exact determination o f the confidence regions i s impractical except i n simple cases. One such case occurs when 6 i s scalar and p(cIZ) i s unimodal. An i s o c l i n e then c o c j i s t s o f tw points, and the 1 ine segment between the two p o i n t s i s the confidence region. I n t h i s one-dimensional case, the confidence region i s o f t e n c a l l e d a confidence i n t e r v a l . Another simple case occurs when the p o s t e r i o r density function i s i n some standard family o f density functions expressible i n closed form. This !'s m s t cormonly the family o f Gaussian density functions. An i s o c l i n e o f a Gaussian density f u n c t i o n w i t h mar, m and nonsingular covariance A i s a set o f x values satisfying

This i s the equation o f an e l l i p s o i d . For problems not f i t t i n g i n t o one o f thesz special cases, we usually must make approximations i n the computation o f the confidence regions. Section l l . i . 3 discusses the most comnon approximation. 11.1.2 m r a n d o m Parameter Vector

When i s simply an unknown parameter w i t h no random nature, the development o f confidence regions i s more obl ique, but the r e s u l t i s z i m i l a r i n form t o the r e s u l t s o f the previous section. The same comnents apply when we wish t o ignore any p r i o r d i s t r i b u t i o n o f 5 and t o obtain confidence regions based s o l e l y on the current experimental data. These s i t u a t i o n s usually imply the use o f MLE estimators. I n neither o f these s i t u a t i o n s can we meaningfully discuss the p r o b a b i l i t y o f 6 l y i n g i n a given region. Ye proceed as follows t o develop a s u b s t i t u t e concept: the estimate i s 2 function o f the observation Z, which has a p r o b a b i l i t y d i s t r i b u t i o n conditioned on 6. Therefore, we can define a p r o b a b i l i t y d i s t r i b u t i o n e o f i conditioned on 5 . W w i l l assume t h a t t h i s d i s t r i b u t i o n has a density function

For a given value o f 6, the i s o c l i n e s o f pile define boundaries o f confidence regions f o r be such a confidence region, w i t h confidence l e v e l a .

c.

Let

R ,

Pl = {x: pi16(xIc)
It i s convenient t o define

L C l

(11.1-8) Pe/5. using the r e l a t i o n (11.1-9)

R,

i n terms o f the e r r o r density function piIc(xI6) = p e I c ( ~

-16)

This gives

The estimate has p r o b a b i l i t y a o f being i n R , . For t h i s chapter, we a r e more interested i n the s i t u a t i o n where we know the value o f and seek t o define a confidence region f o r 6 , which i s unknown. W e can define such a confidence region f o r 6 , given i , i n two steps, s t a r t i n g w i t h the region R , .

R1

The f i r s t step i s t o define a region R which i s a m i r r o r image o f R , , . A point 5 r e f l e c t s onto the p o i n t i+ x i n R,, as shown i n Figure (11.1-1). W can thus w r i t e e

-x

R ,

i n the region as

This r e f l e c t i o n interchanges 6 and l i e s i n R,, probability a that

i;therefore.

i s i n R, i f and o n l y i f there i s the same p r o b a b i l i t y a t h a t

f,

i s i n R,. l i e s i n R,.

Because there i s

To be t e c h n i c a l l y correct, we must be careful about the phrasing o f t h i s statement. Because the t r u e value i s n o t random, i t makes no se:'se t o say t h a t 6 has p r o b a b i l i t y a o f l y i n g i n R,. The randomness i s i n , , t h e construction o f the region ? because R depends on the estimate i , which depends i n t u r n on the noiseW can sensibly say t h a t the region R, , constructed i n t h i s manner, has p r o b a b i l i t y e contaminated observations. a o f covering the t r u e value c. This concept o f a region covering the f i x e d p o i n t 6 replaces the concept o f the p o i n t 5 l y i n g i n a f i x e d region. The d i s t i n c t i o n i s more important i n theory than i n practice.

i n p r i n c i p l e , we cannot construct the region from the data a v a i l Although we have defined the region R, , able because R, deperds on the value o f c, which i s unknown. Our next step i s t o construct a region R which approximates R2. but does not depend on the t r u e value o f c. W base the approximation on the assumpe t i o n t h a t P e l t i s approximately i n v a r i a n t as a function o f 6; t h a t i s

This approximation i s u n l i k e l y t o be v a l i d f o r large values o f o f 6 , the approximation i s u s u a l l y reasonable. for , Ue define the confidence region R
6.

except i n s!mple cases.

For small values using

by applying t h i s approximation t o Equation (11.1-11).


R, =

-C

~i + x:

P (xli) 2 c l elc

(11.1-13)

The region R depends only on , p c, and the a r b i t r a r y constant c. The function pe i s presumed known from the s t a r t , and i s the e s t i m a d computed by the methods described i n previous c h p h r s . I n p r i n c i p l e , we have s u f f i c i e n t information t o conpute t h e region R,. P r a c t i c a l a p p l i c a t i a n requires e i t h e r t h a t p be i n one of t h e simple forms described i n Section 11.1.1, o r t h a t we make f u r t h e r approximations as d i s t k s e d i n Section 11.1.3.

e,

If c i s small ( t h a t i s , if the estimate i s accurate). then R w i l l l i k e l y be a close approximation , to R . , If 4 6 i s large, then the approximation i s questionable. The r e s u l t i s t h a t we are unable t o define l a r g e confidence regions accurately except i n special cases. Ye can t e l l t h a t the confidence region i s large, but i t s precise size and shape are d i f f i c u l t t o determine.

- -

Note t h a t the c o n f i d e ~ ~ cregion f o r nonrandom parameters, defined by Equatlon (11.1-13). i s almost idene The only d i f f e r t i c a l i n form t o the confidence region for random parameters, defined by Equation (11.1-6). ence i n the form i s what the density futictions are conditioned on. 11.1.5 Gaussian Approximation

The previous sections have derived the boundaries o f confidence regions f o r both random and nonrandom parameter vectors i n terms o f i s o c l i n e s o i p r o b a b i l i t y density functions o f the e r r o r vector. Except i n special cases, the p r o b a b i l i t y density functions are too complicated t o allow p r a c t i c a l conputaticn of the exact isoclines. Extreme precision i n the conputation o f the confidence regions I s seldom necessary; we have I n t h i s section. a1ready made approximations i n the d e f i n i t i o n o f confidence regions f o r conrandan parameters we introduce approximations which allow r e l a t i v e l y easy computation o f confidence regions.
The c e n t r a l idea o f t h i s section i s t o approximate the p e r t i n e n t p r o b a b i l i t y density functions by Gaussian density functions. As discussed i n Section 11.1.1, the i s o c l i n e s o f Gaussian density functiolfs a r e e l l i p s o i d s . I n many cases. which are easy t o conpute. W c a l l these "confidence e l l i p s o i d s " o r "uncertainty e l l i p s o i d s . e w can j u s t i f y the Gaussian approximation w i t h arguments t h a t the d i s t r i b u t i o n s asymptotically approach e Gaussians as t h e amunt o f data increases. Section 5.4.2 discusses some p e r t i n e n t asylnptotic r e s u l t s .

e A Gaussian approximation i s defined by i t s mean and covariance. W w i l l consider appropriate choices f o r the mean and covariance t o make the Gaussian density f u n c t i o n a reasonable approximation. An obvious p o s r i b f l i t y i s t o set the mean and covariance o f the Gaussian approximation t o match the m a n and covariance of the o r i g i n a l dhnsity function; we are o f t e n forced t o s e t t l e f o r approximations t o the mean and covariance o f the o r i g i n a l density function, the exact values being impractical t o compute. Another p o s s i b i l i t y i s t o use Equations (3.5-17) an3 (3.5-18). W w i l l i l l u s t r a t e the use o f botk o f these options. e Consider f i r s t the case o f an MLE estimator. Equation (11.1-13) defines the confidence region. Ye w i l l use covariance matching t o define the Gaussian approximation t o pelt. The exact mean and covariance o f p e l c are d i f f i c u l t t o compute, b u t there a r e asymptotic r z s u l t s which give reasonable approximations. W use zero as an approximation t o the m a n o f pel ; t h i s approximation i s based on MLE estimators being e asymptotically unbiased. Because MLE estimators are e f f f c i e n t , the Cramer-Rao bound gives an asymptotic approximation for the covariance o f pelc as the inverse o f the Fisher information matrix M(c). W can use e e i t h e r Equation (4.2-19) o r (4.2-24) as equivalent expressions f o r the Fisher information matrix. Equat i o n (5.4-11) gives the p a r t i c u l a r form o f M(c) f o r s t a t i c nonlinear systems w i t h a d d i t i v e Gaussian noise. Both i and M(e) are r e a d i l y a v a i l a b l e i n p r a c t i c a l application. The estimate o f a parameter estimation program, and most MLE parameter-estimation programs comp1,te t o i t as a by-product o f i t e r a t i v e minimization of the cost function. i s the primary output M(t) o r an apprc mation

Now consider the case o f an MAP estimator. W need a Gaussian approximation t o p(e1z). Equae t i o n s (3.5-17) and (3.5-18) provide a convenient basis f o r such an approximation. By Equation (3.5-17). we s e t the mean o f the Gaussian approximation equal t o the p o i n t a t which p ( e l z ) i s a maximum; by d e f i n i t i o n o f the MAP estimator, t h i s p o i n t i s zero. W then s e t t h e covariance o f the Gaussian approximation t o e
A

[-V:

on p ( e l z ) ] - I

evaluated a t 6 = i. For s t a t i c nonlinear systems w t t h a d d i t i v e Gaussian noise, Equation (11.1-14) reduces t o t k form o f Equation (5.4-12). which we could a l s o have o t t a i n e d by apprCxIlnate covariancr matching argurnnts. This form f o r the covariance i s the same as t h a t used i n the HLE confidence e l l i p s o i d . w i t h the a d d i t l o n o f the p r i o r covariance t e n . As t h e p r i o r covariance goes t o I n f i n i t y , the confidence e l 1I p s o i d f o r t h e CIAP esttmator approaches t h a t f o r the MLE estimator, as we would anticipate. Both the MLE and MAP confidence e l l i p s o i d s take the f o m (X

Z)*A-~(X

- 8) = c

where A i s an approximation t o the error-covariance matrix. W have suggested s u i t a b l e approximations i n tne e above paragraphs, b u t most approximations t o the e r r o r covariance are equally acceptable. The choice i s usually d i c t a t e d by what i s conveniently available i n a given program. 11.1.4 N o n s t a t i s t i c a l Derivation

W can a l t e r n a t e l y derive the confidence e l l i p s o i d s f o r M P and M estimators from a n o n s t a t i s t i c a l viewe E point. This d e r i v a t i o n obtains the same r e s u l t as the s t a t i s t i c a l approach and i s easier t o follow. Comparison of the ideas used i n the s t a t i s t i c a l and n o n s t a t i s t i c a l derivations reveals the close r e l a t i o n s h i p s between the s t a t i s t i c a l c n a r a c t e r i s t i c s o f the estimates and the numerical problems o f computing them. The nonstatist i c a l approach generalizes e a s i l y t o estimators and models f o r which precise s t a t i s t i c a l descriptions are difficult. The n o n s t a t i s t i c a l d e r i v a t i o n presumes t h a t the estimate i s defined as the minimizing p o i n t o f some cost function. W examine t h e shape o f t h i s cost function as i t affects the numerical minimization problem i n the e area o f the minimum. For current purposes, we are not concerned w i t h start-up problems. i s o l a t e d l o c a l m i n i m , and other problems manifested f a r from the s o l u t i o n point. A r e l a t i v e l y f l a t . i l l - d e f i n e d minimum correspo~~ds t o a questionable estimate; the extreme case o f t h i s i s a function without a d i s c r e t e l o c a l minimum p o i n t A steep, we1 1-def ined minimum corresponds t o a re1i a b l e estimate. With t h i s j u s t i f i c a t i o n , we define a confidence region t o be the s e t o f p o i n t s w i t h cost-function values less than o r equal t o some constant. D i f f e r e n t values o f the constant g i v e d i f f e r e n t confidence levels. The boundary o f such a region i s an i s o c l i n e o f the cost function. W then approximate the cost function i n the neighborhood o f the minimum by a quadratic Taylor-series e expansion about the minimum point. 1 (11.1-16) J(E) = J ( i ) + 2 (E - i ) * [ v ; ~ ( i ) l ( ~ - i) The i s o c l i n e s o f t h i s quadratic approximation are the confidence e l l i p s o i d s .
(E

- i)*Cv;J(<)I(c - i ) = c

The second gradient o f an MLE o r MAP c o s t function i s an asynptotic approximation t o the appropriate e r r o r covariance. Therefore. Equation (11.1-17) gives the same shape ccnfidence e l l i p s o i d s as we previously derived on a s t a t i s t i c a l basis. I n practice. the Gauss-Newton o r other approximation t o the second gradient i s usually used. Tha constant c determines the size o f the confidence e l l i p s o i d . The n o n s t a t i s t i c a l d e r i v a t i o n gives no obvious basis f o r selecting a value o f c. The value c = 1 gives the most useful correspondence t o the s t a t i s t i c a l derivation, as we w i l l see i n Section 11.2.1. Figures (11.1-2) and (11.1-3) the n o n s t a t i s t i c a i d e f i n i t i o n . 11.2 i l l u s t r a t e the construction o f one-dimensional confidence e l l i p s o i d s using

ANALYSIS O THE CONFIDENCE ELLIPSOID F

The confidence e l l i p s o i d gives a comprehensive p i c t u r e o f the t h e o r s t i c a l l y l i k e l y e r r o r s i n the estimate. I t i s d i f f i c u l t , however, t o display t h e information content o f the e l l i p s o i d on a two-dimensional sheet o f paper. I n the applications we most comnonly work on, there are typicallv 10 t o 30 ulrknown parameters; t h a t is, e the e l l i p s o i d i s 10- t o 30-dimensional. W can p r i n t the covariance matrix which defines t h e shape o f the e l l i p s o i d , buc i t i s d i f f i c u l t t o draw useful conclusions from such a presentation format. The problem o f meaningful presentation i s f u r t h e r compounded when analyzing hundreds o f experiments t o obtain parameter estimates under a wide v a r i e t y o f conditions. I n the f o l l o w i n g sections, we confidence e l l i p s o i d s In ways t h a t reducing the dimensionality o f the forms, such as the accuracy o f the discuss s i m p l i f i e d s t a t i s t i c s t h a t characterize important features o f the are easy t o describe and present. The emphasis i n these s t a t i s t i c s i s on problem. Many important questions about accuracy reduce t o one-dimensional estimate o f each element of the parameter vector.

A l l o f the s t a t i s t i c s discussed here are functions o f the matrix A, which defines the shape of the confidence e l l i p s o i d . W have seen above t h a t A i s an approximation t o the error-covariance matriq. These two e viewpoints o f A w i l l provide us w i t h geometrical and s t a t i s t i c a l interpretations. A t h i r d i n t e r p r e t a t i o n comes from viewing A as the inverse o f t h e second gradient o f the cost function. I n practice, A i s usually computed from the Gauss-Newton o r other convenient approvlmation t o the second gradient. These s t a t i s t i c s are c l o s e l y l i n k e d t o some o f the basic sources o f estimation e r r o r s and d i f f i c u l t i e s . W w i l l i l l u s t r a t e the discussion w l t h idealized examples o f these classes o f d i f f i c u l t i e s . The exact rneans e o f overcoming such d i f f i c u l t i e s depends on the problem, but t h e f i r s t step i s t o understand the mechanism causing the difficulty. I n a s u r p r i s i n g number o f applications, the major difficu1t:es are cases o f the simple i d e a l i z a t i o n s dlscussed here. 11.2.1 Sensitivity

The s e n s i t i v i t y I s the simplest o f the s t a t l s t l c s r e l a t i n g t o the confidence e l l i p s o i d . Although the sens i t i v i t y has both a s t a t i s t i c a l and a n o n s t a t i s t i c a l i n t e r p r e t a t i o n , the use o f the s t a t i s t i c a l l n t e r p n t a t l o n i s r e l a t t v e l y rare. The t e r n " s e n s i t l v l t y " corns from the n o n s t a t l s t i c a l i n t e r p r e t a t i o n , which we w i l l discuss flrst.

From the n o n s t a t i s t l c a l viewpoint. the s e n s i t l v i t y i s a measure o f how much the cost-function value changes f o r a given change i n a scalar parameter value. The most comnon d e f i n i t i o n o f t h e s e n s i t i v t t y w i t h respect t o a parameter i s the second p a r t i a l d e r i v a t i v e o f the c o s t f!:,~ctton w i t h respect t o the parameter.

For the purposes o f t h i s 2hapter. we a r e interested i n the s e n s i t i v i t y evaluated a t the minimum p o i n t o f the cost function; we w i l l take t h i s as p a r t o f the d e f i n i t i o n o f the s e n s l t i v i t y . The i n Equation (11.2-1) can be any scalar function o f the 6 vector. I n most cases, ~1 'is one o t the elements o f the 6 vector. For s i r p l i c i t y , we w i l l assume f o r the r e s t o f t h i s section t h a t 1 i s the i t h element o f C. Generalizations are straightfarnard. When 61 i s the i t h element o f C, the second p a r t i a l d e r i v a t i v e w i t h respect t o 6.i i s the 4th diagonal element o f the second-gradient matrix.

The s e n s i t i v i t y has a simple geometric i n t e r p r e t a t i o n based on the confidence e l l i p s o i d . Use the value c = 1 i n Equation (11.1-17) t o define a confidence e l l i p s o i d . Draw a l i n e passing through i (the center o f the el!ipsoid) and p a r a l l e l t o t h e c i axis. The s e n s i t i v i t y w i t h respect t o t i i s r e l a t e d t o the distance, Ii,from the center o f the e l l i p s o i d t o the i n t e r c e p t o f t h i s l i n e and the e l l i p s o i d . W c a l l t h i s distance e Figure (11.2-1) shows the construction o f thle i n s e n s i t i v i t i e s w i t h the i n s e n s i t i v i t y w i t h respect t o ~ i . respect t o 6, and 6 on a two-dimensional example. The r e l a t i o n s h i p oetween the s e n s i t i v i t y and the insensi, tivity i s

which follows irnnediately from Equation (11.1-17) sensitivity.

f o r the confidence e l l i p s o i d , and Equation (11.2-1)

f o r the

W can rephrase the geometric i n l e r p r e t a t i o n o f the i n s e n s i t i v i t y as follows: the i n s e n s i t i v i t y w i t h e respect t o 6 i s the l a r g e s t change t h a t we can make i n the i t h element o f { and s t i l l remain w i t h i n the confidence e l i i p s o i d . A l l other elements o f C are constrained t o remain equal t o t h e i r estirnares values during t h i s search; t h a t i s . the search i s constrained t o a l i n e p a r a l l e l t o the ~ia x i s passing through i. From the s t a t i s t i c a l viewpoint, the i n s e n s i t i v i t y w i t h respect t o C. i s an approximation t o the standard deviation o f e i , the corresponding component o f the error. condit-oned on a l l o f the other components o f the e r r o r . W can see t h i s by r e c a l l i n g the r e s u l t s from Chapter 3 on conditional Gaussian d i s t r i b u t i o n s . I f the e covariance o f e i s A, then the covariance o f e i conditioned on a l l o f the other' components i s [ ( ~ - ' ) ~ i ] - ' ; therefore. the conditional standard deviation i s [ ( A - ' ) ~ ~ ] - ' / ~ . From Equations (11.2-2) and (11.2-3). we can see t h a t t h i s expression equals the i n s e n s i t i v i t y . Note t h a t the conditiontng on the other elements i n the s t a t i s t i c a l viewpoint corresponds d i r e c t l y t o the c o n s t r a i n t on the other elements i n the g e o m t r i c viewpoint. A s e n s i t i v i t y analysis w i l l detect one o f the most obtious kinds o f estimation d i f f i c u l t y - p a r a m e t e r s which have 1i t t l e o r no e f f e c t on the system response. 11 a parameter has no e f f e c t on t h e system response. then i t should be obvious t h a t the system response data give no basis f o r an estimate o f the parameter; i n s t a t i s t i c a l terms, the system i s unidentifiable. S i m i l a r l y , i f a parameter has l i t t l e e f f e c t on the system response. then there i s l i t t l e basis f o r an estimate o f the parameter; we can expect the estimates t o be inaccurate. Checking f o r parameters which have no e f f e c t on the system response may seem l i k e an academic exercise, considering t h a t p r a c t i c a l problems would n o t be l i k e l y t o have such i r r e l e v a n t parameters. I n f a c t , t h f s seemingly t p i v i a l d i f f i c u l t y i s extremely c o m n i n p r a c t i c a l applications. I t can a r i s e from typographical o r other e r r o r s i n Input t o computer programs. Perhaps the most comnon example o f t h l s problem i s attempting t o estimate the e f f e c t o f an input which i s i d e n t i c a l l y zero. The input might e i t h e r be v a l i d l y zero. i n which case i t s e f f e c t cannot be estimated, o r the input signal might have been destroyed or misplaced by sensor o r programning problems. The s e n s i t i v i t y i s a reasonable i n d i c a t o r of accuracy only when we are estimating a s i n g l e parameter, because the estimates o f other para~netersa r e never exact. as the s e n s l t i v i t y analysis assumes. The s e n s i t i v i t y acalysis ignores a l l e f f e c t s o f c o r r e l a t i o n between parameters; we can evaluate the s e n s i t i v i t y w i t h respect t o a parameter without even knowing what other parameters are being estimated. When more than one parameter i s estimated, t h e s e n s i t i v i t y gives only a lower bound f o r the e r r o r estimate. The e r r o r band i s always a t l e a s t as l a r g e as the s e n s i t i v i t y regardles* o f what other parameters are estimated; c o r r e l a t i o n e f f e c t s between parameters can increase. b u t never decrease, the e r r o r band. I n other words, high s e n s l t i v i t y i s a necessary. b u t n o t s u f f i c i e n t , condition f o r an accurate estimate. I n practice. c o r r e l a t i o n e f f e c t s useless as an i n d i c a t o r o f accuracy. o f completely i r r e l e v a n t parameters. Indistinguishable from the e f f e c t s of 11.2.2 Correlation tend t o increase the e r r o r band so much t h a t the s e n s i t i v i t y i s v i r t u a l l y The s e n s l t i v i t y analysis i s usually useful only f o r detecting the problem The s e n s i t t v i t y w i l l n o t i n d i c a t e when the e f f e c t o f a parameter i s other parameters, a more comnon P lblem.

W noted i n the previous section t h a t c o r r e l a t i o n s among parameters r e s u l t I n much l a r g e r e r r o r bands than e indicated by t h e s e n s i t i v i t i e s alone, The fnadequacy o f the s e n s i t i v t t y as a measure o f estimate accuracy has e l e d t o the widespread use o f the s t a t i s t i c a l c o r r e l a t i o n s t o i n d i c a t e accuracy. W w i l l see i n t h f s section t h a t the c o r r e l a t i o n s a l s o give an incomplete p i c t u r e o f the accuraty.

The statistical c o r r e l a t i m between two e r r o r components e l and e j corr(e, .ej) assuming t h a t the means o f e l and e j are zero. is corr(el,ej)

i s defined t o be

= E t e i e j l / i F t j
I n terms o f
= hi
A, the covarlar~cematrlx o f

e, the c o r r e l a t l o n (11.2-5)

j/m

Geometrically, the c o r r e l a t i o n s a r e r e l a t e d t o the e c c e n t r i c t t y o f the confidence e l l l ~ s o l d . I f the sens l t i v l t i e s w i t h respect t o a l l of the unknown parameters are equal (which we can always arrange by a scale change), and ifthe c o r r e l a t l o n s are a l l zero, then the confidence e l l l y s o l d I s spherlcal. As the magnltudes o f the correlatlons become larger. the e c c e n t r l c l t y o f t h e scaled e l l l p s o l d lncreasrs. The magnitude o f the c o r r e l a t l o n s can never exceed 1, except through appmxlmations o r round-off e r r o r s i n the computation. The def l n i t l o n above l s f o r the uncondltlonal o r f u l l correlatlons. Whenever the term c o r r e l a t i o n appears without a modlfler, i t implicitly means t h e unconditlona? c o r r e l a t i o n . W can a l s o define condlt40nal e correlatlons. although they a r e l e s s c m n l y used. The d e f l n l t l o n o f the conditional c o r r e l a t l o n I s l d e n t l c a l t o t h a t of the uncondltlonal correlations, except t h a t the expected values are a l l condltioned on a l l o f the parameters other than the two under consideration. W can express the condltional c o r r e l a t l o n o f e l and e j e as cond c9rr(el

.:

j) = -rlj/

J('lrrjj)

where r = A". This i s s l m i l a r t o the expresslon f o r the uncondltional correlation, the dlfference belng t h a t r replaces A and the sign 1s changed.
I f there are only two unknowns, h + conditional and unconditional c o r r e l a t l o n s are identical. I f there e . a r r more than hro unknowns, the condltlonal and unconditlonal c o r r e l a t l o n s can give q u l t e d l f f e r e n t plctures. Conslder the case i n which r i s an N-by-N matrlx w i t h 1 ' s on the dlagonal and w l t h a l l o f the off-diagonal elements equal t o X. As X. the condltional correlation, approaches -1/(N 1). the f u l l c o r r e l a t l o n approaches 1. I n the l l m l t , when X equals -1/(N 1). the r m t r i x i s singular. Thus, f o r l a r g e N, the f u l l c o r r e l a t i o n s can be q u l t e hlgh even when a l l o f the conditlonal c o r r e l a t l o n s a r e low. This same example I n v e r t s t o show t h a t the converse also i s true.

There are three objections t o using the correlations, f u l l o r conditional, as primary I n d i c a t o r s o f accuracy. F i r s t , although the c o r r e l a t l o n s give l n f o m t l o n about the shape o f the confldence e l l l p s o i d . th2y completely ignore i t s slze. Figure (11.2-2) shows two confldence ellipsoids. E l l i p s e A I s completely contained w f t b l n e l l i p s e B and I s , therefore, c l e a r l y preferable; y e t e l l i p s e B has zero c o r r e l a t l o n and e l l i p s e A has s l g n i f i c a n t c o r r e l a t i o n . From t h i s example, i t i s obvious t h a t accurate estimates can have hlgh c o r r e l a t l o n s and poor estlmates can have l o r correlatlons. To evaluate the accuracy o f t h e estlmates, you need informatlon about the s e n s i t i v l t l e s as v :I1 as about the correlatlons; n e i t h e r alone i s adequate. Ar a more concrete example o f the i n t e r p l a y between c o r r e l a t f o n and s e n s i t i v i t y , conslder a scalar l l n e a r system:

W wlsh t o estimate 0. Both D and the b i a s H are unknown. The lnput u ( t i ) i s an angular p o s i t i o n o f e some control device. Suppose t h a t the input time-history i s as shown f n Figure (11.2-3). A large portion o f the energy i n t h i s input i s from the steady-state value of 90'; the energy i n the pulse i s much smaller. This input I s h l g h l y correlated w l t h a constant b l r s lnput. Therefore, the estimate o f D w l l l be h l g h l y correl a t e d w l t h the estlmate o f H. ( I f t h l s p o i n t i s not obvlous. we can choose a few time p o i n t s on t h e figure and compute the corresponding covariance matrlx.) The s e n s l t l v l t y w i t h respect t o 0 I s high; because o f the l a r g e values o f u, small changes I n D cause l a r g e changes I n z.
Now w conslder the same system, w l t h the l n p u t shorn i n Flgure (11.2-4). e Both the c o r r e l a t l o n and the s e n s i t i v i t y a r e much l o m r than they were f o r the lnput o f Flgure (11.2-3). These changes balance each other. r e s u l t i n g i n the same accuracy i n estimating D. The inputs s h w n l n the two f i g u r e s a r e i d e n t i c a l , but M a The cholce o f reference a x i s I s a matter of covvention sured w i t h respect t o reference axes r o t a t e d by 90'. which should not a f f e c t the accuracy; I t does, however, a f f e c t both the s e n s i t l v i t y and c o r r e l a t i o n .

Thls exanple ! l l u s t r a t e s t h a t the c o r r e l a t i o n alone I s n o t a reasonable measure o f accuracy. 8 redeflnfng the refemnce ax,: o f the input i n t h i s example, m can change the c o r r e l a t i o n a t w i l l t o any vaTue between -1 and 1. The second o b j r c t l o n t o th use o f c o r r e l a t l o n s as Indicators o f accuracy I s more serious because i t cannot be answered by simply looking a t s e n s l t l v l t i e s and c o r r e l a t l o n s together. I n the same way t h a t s m s i t l v l t l e s are one-dlmnsional tools, c o r r e l a t l o n s are two-dimensional tools. The u t i l i t y of a t o o l n s t r l c t e d t o two-dimensional subspaces i s l i m i t e d . Three slnple exanples o f i d e a l i z e d b u t r e a l i s t i c situations serve t o i l l u s t r a t e the dimensional l l m l t a t t o n s o f t h e correlatlons. Thece examples involve f r e e l a k r a l - d l n c t i o n a l o s c l l l a t l o n o f an a l r c r a f t . For the f i r s t exanple, there i s a yaw-rate feedback t o the rudder and a rudder-to-ailero~i interconnect. Thus the a l l e r o n and wdder signals a r e both proportional t o yaw rate. I n t h i s case. the conditlonal c o r r e l r t i o n s o f the alleron, rudder, and yaw-rate derivatives a m 1 (or nearly so w i t h i n p e r f e c t data). Conditioned on t h e a i l e r o n derivatives being k n o w exactly, changes i n t h e rudder d e r i v a t i v e rstlmates can be e x a c t l y compensated f o r by changes I n the yrw-ratz d c r l v a t i v e e s t i m t e s ; thus, the conditlonal c o r n l r t i o n i s 1 The .

uncondfttonal correlations. however, are e a s i l y seen t o be only 1/2. Changes I n the rudder d e r l v a t t v e e s t t nates must be compensated f o r by some combination o f changes i n the a i l e r o n and yaw-rate d e r i v a t i v e estimates. Since there are no constratnts on how much o f the compensation must come from the a i l e r o n and how much from the yaw-rate dertvattve estimates, the unconditional c o r r e l a t i o n s would be 1/2 (because, on the average, 112 o i the compcnsatton would come from each source). For the second exanple, no feedback i s present and there i s a n c d t r a l l y damped, d u t c h - r o l l o s c t l l a t i o n (or a wtng rock). The sideslip, r o l l - r a t e , and yaw-rate stg~balsare thus a l l sinusotds o f the s a m frquency, w i t h d i f f e r e n t phases and anplttudts. Taken two a t a time, these stgnals have low correlations. The condit t o n a l cort.clations consider only two p t r a m t e r s a t a time, and thus the conditional c o r r e l a t f ~ n so f the derivatives w i l l be low. Nonetheless, the three stgnals a r e l t n e a r l y dependent when a?1 are c?nsidered t o ther, because they can a l l be w r i t t e n as l i n e a r co&tnations o f a sine wave and a cosine , . A , ? a t the dutchr o c frequency. The unconditional correlattons o f the d e r t v r l i v e s w i l l be 1 (or nearly so w i t n tmperfect data). 50th o f the above exanples have three-dimnsional c o r r e l a t i o n problems, which prevent the parameters from being i d e n t t f t a b l e . The condttlonal c o r r e l a t l o n s are low i n one case, and the unconditional c o r r e l a t i o n s are low i n the other. Although n d t h e r alone i s sufficient, examination o f both the condttional and uncondtttonal c o r r e l a t i o n s w i l l always r r v e a l three-dtmtnsional c o r r e l a t i o r ~problems. For the t h i r d example, suppose t h a t a wing l e v e l e r feeds back bank angle t o the aileron. and t h a t a n e u t r a l l y damped dutch r o l l i s present w i t h the feedback on. There are then f o u r p e r t i n v i ~ t ignals (sideslip, r o l l rate, yaw rate, and r t l e r o n ) t h a t are sinusotds w i t h the Sam frequency and d i f f e r e n t phases. I n t h i s case, both the conditfonal and the uneondttional c o r r e l a t i o n s w i l l be low. Nonetheless, there i s a c o r r e l a t t o n problem which r e s u l t s i n u n i d e n t i f i a b l e parameters. Thts c o r r e l a t i o n problem i s four-dimensional and cannot be seen using the two-dimensional correlatlons. The f u l l and conditional c o r r e l a t l o n s are c l o s e l y r e l a t e d t o the eigcnvalues o f 2-by-2 submtrtces of the r matrices. respectively, n o m l i t e d t o have u n t t y dtagonal elements. S p e c i f i c a l l y , the eigenvalues are 1 p l u s the c o r r e l a t l o n and 1 minus the correlatlon; thus. htgh c o r r e l a t l o n s correspond t o large etgenvalue spreads. Higher-order c o r r e l a t i o n s would be investigated using eigenvalues o f l a r g e r submtrices. Looked a t i n t h i s I i g n t , the i n v e s t i g a t i o n o f 2-by-2 submatrices i s revealed as an a r b i t r a r y choice d i c t a t e d by i t s f a m i l t a r i t y more than by any o b j e c t i v e c r t t e r i o n . The eigenvalues o f the f u l l n o m l i z e d A and T matrices would seem more approrriate tools. These efgenvalues and the corresponding eigenvectors can provtde some informatton, b u t they a r e seldom used. I n prtnctple, small algenvalues of the n o m l i t e d r matrix o r large eigenvalues o f the normaltzed A matrix indicate c o r r e l a t i o n s among the parameters w i t h stgntficant components i n the corresponding etgenvectors. Note t h a t the eigenvalues o f the u n n o m l l z e d r and A m t r l c e s are o f l i t t l e use i n studying correlattons, because scdling e f f e c t s tend t o dominate.
A and

The l a s t objection t o the use o f the c o r r e l a t i o n s i s the d t f f i c u l t y o f presentation. I t I s impractical t o dtsplay the e s t t m t c d c o r r e l a t l o n s graphtcally i n a problem w i t h more than a handful o f unknowns. The most conwan presentation i s stmpl t o p r i n t the matrix o f estimated correlations. This o p t i o n o f f e r s l t t t l e improvement i n comprehmstbl!tty over simply p r i n t t n g t h e A matrix. Ifthere are a l a r g e number of expcrlments, i t i s potntless t o p r i n t a l l o f the c o r r e l a t i o n matrices. Such a nongraphical presentation cannot r e a s o ~ b l y i v e a coherent p i c t u r e o f the system analyzed. g 11.2.3 C r m r - R a o Bound

The Cramr-Rao bound I s the l a s t o f the s t a t i s t i c s based on the confidence e l l i p s o i d . It proves t o be the most useful o f these s t a t i s t i c s . The Cramr-Rao b w n d I s o f t e n r e f e r r e d t o by other names, tncluding the standard deviation and t h e uncertainty l e v e l . W w i l l constder both s t a t i s t i c a l and n o n s t a t t s t i c a l tntsrpretae t i o n s o f the C r a m r - k o bound. The Cramtr-Rao bound o f or, estimated scalar parameter i s the standard deviation o f the e r r o r i n t h a t parameter. S t r i c t l y speaking, the term Cramer-Ro bound applies o n l y t o the approximation t o the standard deviation obtained from t h e Cramer-Rao i n q u a l i t y . For the purposes o f t h i s section, the properties are stmtl a r , regardless o f the source o f t standard deviation. I n terms o f the A n a t r i x , th Cramer-ko bound o f the i t h e l a n t o f i s (htf)lfi. The C r m r - R a o bound I s c l o s e l y r e l a t e d t o the i n s e n s i t i v t t y . Both a r e standard deviations o f the e r r o r , the only difference k i n g t h a t the i n s e n s i t i v i t y i s the conditional standard deviation, whereas the Cramer-Ro bound i s uncondttional. They are also conputationally s i m i l a r , the difference being i n whether t h e tnverston i s o f the rnstrtx o r o f the indtvidual e l m t . The g e o n t r l c r e l a t i o n s h i p between the C r a m r - k o b w n d and the i n s e n s i t i v i t y i s p a r t i c u l a r l y revealing. The Cramr-Rao bcund on c I s the larges. chrnge t h a t you can m k e i n t i and s t i l l - i n w i t h i n the c o n f i dence e l l i p s o i d . w r t n t h s search. the other conponbntr are f r e e t o take any values t h a t keep the p o i n t w i t h i n the confidence e ? l i p w i d . This d e f i n i t i o n fa i d e n t i c a l t o t h e geometric d e f i n i t t u n o f the i n r m s i t i v ity,'except t h a t the other components a r e constrained t o the e s t t m t e d values i n the d e f t n i t f o n o f l n s e n s t t i v i t y . This c o n s t r a i n t i s d t r e c t l y r e l a t e d t o the statistical c o n d i t i o n t n i n the d e f t n i t t o n o t the i n s e n s l t i v f t y ; the C r w r - l o bound has no such constratnts md i s an uncondftiona? standard &viation. The Craacr-ftao bound must always be s t l e a s t as large as the i n s e n s t t i v t t y , because releasing a c o n s t r a i n t can never make the s o l u t i o n o f a maxtmization problem w l l e r . This f a c t r e l a t e s t o our previous statement t h a t c o r r e l a t i o n e f f e c t s can increase, b u t n r -r decrease, the e r r o r band &f inbd by the I n s e n s l t i v l t y . Figure (11.2-5) i l l u s t r a t e s the geometric in6 ~ r e t a t i o n f the C r a a e r - l o bounds and i n s e n s i t f v l t i e s i n a o two-dimenstonal examle. To p v v e t h a t the Cramer-ko b w n d I s tho s o l u t i o n to the above optlmizatlon problem, prove a m r e gemr41 r e s u l t . (The general r e s u l t I s a c t u a l l y m s t a r to prove.)
*e

w t l l s t a t e and

flxed vectar x and a posltive deflnlte synnstric o f x*y, subject t o the constralnt that x*Hx s 1, l s P r m f Slnce x*y has no unconstrained local eAtrena, the solutlon must m n the constraint boundary; therefore, the lnequal l t y i n the constralnt can be replaced by an equallty. Thls constralned optimization p r o b l m can be restated by the use o f Lagrange m u l t i p l i e r s (Luenbergsr, 1969) as the unconstrained miniml zatlon o f

where A I s the scalar Lagrange multlpl i e r . the gradients t o zero as follows:


I )

The mnxirmm I s found by sattlng (11.2-9)

* vXt(x,A)

- AHx

From Equatlon (11.2-9) we have


x

~'~H'ly

Substituting t h l s l n t o Equation (11.2-10) gives

y*~'l~'l~~'lH'ly 1

Substltutlng i n t o Equatlon (11.2-11) gives

and thus

a t the solution.

Thls Cs the r e s u l t sought.

The speclfic case o f y being a u n l t vector along the ~1 axls glves the form claimed f o r the Cramer-Lo bound o f the C{ element. The general form o f 'lheomn (11.2-1) has other applications. The value of any llne'l. canblnatlon o f the parmetars can bn! expressed as c*y f o r some f l x e d y-vector. Thus the general form shows how t o evaluate s the accuracy o f arbitrary linear c d < ~ t i o n o f parameters. Thls form applies to mny situatlons where the sum, dlfferencr. o r other conblnation o f m l t i p l e parameters i s o f interest.
O the b r s l s o f t h l s n t r l c picture, w can think o f the Cramer-Ao bounds as insensltivit:es t h a t a n colputed accounting f o r a l E a m t e r correlations. The coaputrtlon and i n t e r p r e t r t l o n of the C r a r * r - h o bounds are v a l i d i n any nulnbrr o f dlmnslons. I n t h i s respect, the Cramer-bo bounds contrast w l t h 'ha insms i t l v i t l e s , d i c h are one-dimenslonal tocls, and the correlations, whlch are two-dlmnstonal tools. The C r o a r - L o bounds are thus the best of the theoretical measures o f accuracy that can be evaluated f o r a slngle exwriment.

11.3

OTHiR MEASURES OF ACCURACY

The p r e v l w s secttons h w discussed the C r a a r - k o bound ml other accuracy s t r t i s t l c s based on the confidence ellipsoid. Although the C r u r r - b o Dound I s the best slngle analytical mersure o f accuracy, overn l l m c e on any slngle source o f accuracy data I s dangtrous. Uncritical use o f the C r m r - L o bound a n glve extremly nlsleadlng results i n r w l i s t l c situations, as discussed by M i n e and I l i f f (1981b). Thls section discusses a l t e m r t e accuracy matures, whlch can supplement the C r m r - A o bound.

Th, bias of an c s t l m t o r i s occasionally c i t e d as an lndlcatur o f accuracy. Yu do not consider It a useful indicator i n most clrcuntmces. Thls suction i s l l m l t e d t o a b r i e f exposition of the r e a m s for t h l s judoarnt

Section 4.2.1 defines the b i a s o f an estimator. Bins a r i s e s from several sources. Some estimators are i n t r i n s i c a l l y biased, regardless o f the v t u r e o f the data. Random noise i n the data o f t e n causes a bias. The b i a s from random nolse somettmes goes t o zero asymptotically f o r estimators matched t o the noise character i s t t c s . F i n a l l y , t h e Inevitable modeling e r r o r s i n analyzing r e a l systems cause a l l estlmators t o be biased, even asymptotically. Rost discussions o f b i a s r e f e r , i n p l i c i t l y o r e x p l i c i t l y , t o asymptotic bias. Even f o r tdeallzed cases w i t h no modeliny e r r o r , estimators are seldom unbiased f o r f i n i t e time. There are two reasons why the b i a s I s o f minimal use as a measure o f accuracy. F i r s t , the b i a s r e f l e c t s only the consistent errors; i t gnores random scatter. As I l l u s t r a t e d i n Section 4.2.1, i t i s possible +or an estimator t o give ludicrous i n d i v i d u r l estimates which average o u t t o a small o r zero bias. This property i s I n t r i n s i c t o the definition o f the bias. Second, the b i a s i s d i f f i c u l t t o compute t n most cases. I f we could compute the bids, we could subtract i t from the estimates t o o b t a i n revised esttmates t h a t were unbiased. (Some estimators use t h i s techniqla-.) I n some cases, i t may be ~ r a c t l c a lt o compute a bound on the magnitude o f the bias from a p a r t i c u l a r source, even when we cannot conprrte the actua: bias. Although they are r a r e l y used, such bounds can give a reasonable i n d i c a t i o n o f the l i k e l y magnitude o f t h e error from some sources. This i s the most constructive use o f b i a s i n f c m t i o n i n evaluating accuracy. I n contrast. t h e often-repeated statements t h a t a gfven estimator i s o r fs n o t asymptotically unbiased are o f l i t t l e p r a c t i c a l use. Most o f the estimators considered I n t h i s riocument are asymptotically unbiased when the assumptions used i n the d e r i v a t i o n are true. The statement t h a t ocher estimators ere biased under the same condl+ions amounts t o a restatement o f the universal p r i n c i p l e t h a t estim';ors a-e biased i n the presence o f modeling e r r o r . Thus argumnts about which o f two e s t i n a t o r s i s b l a s r * & r e s i l l y . Thes arguments reduce t o the issue of what assumptions t o use, an issue best addressed d t r e c t l y . Although q u a n t i t a t i v e measures o f btas may not be avai1,ible. the analyst should alwayz consider the issue Unforo f b i a s due t o modeling e r r o r . Bias e r r o r s are added t o a l l other typos o f e r r o r i n the L-:ima:es. tunately, some bias e r r o r r are impossible *.a detect s o l e l y by analyzing the data. The estlmatas can be repeatable w i t h l i t t l e s c a t t e r and appear t o be accurate by a l l other measures, and s t i l l have l a r g e oias errors. An example o f t h i s type o f problem i s a c a l i b r a t i o n e r r o r i n a nonredundant instrument. Th only way t o avoid such problems i s t o be meticulous i n executing and documenting every step o f che ~ p p l i c a t i o n , includi n g nudeling, instrumentation. and data handling. No automatic t e s t s e i i s t t h a t adequately s u b s t i t u t e f o r such care. 11.3.2
0"'

Scatter

When there are several experimcn:; a t the same condition. the s c a t t e r o f the esttmates i s an i n d i c a t i o n accuracy. W can a l s o evaluate s c a t t e r about a smooth f a i r i n g o f the estimates i n a series o f experiments e w i t h gradually changing conditions. This approach assumes t h a t the parameters change smoothly as a functfon o f e x p e r i m n t a l condttion. The s c a t t e r has a s i g n t f i c a n t advantage over m n y o f the t h e o r e t i c a l measures o f a c c u r a c ~discussed below. The s c a t t e r measures the actual p e r f o m n c c t h a t s o f the t h e o r e t i c a l measures are t r y i n y t o p r e d i c t . m Therefore the s c a t t e r includes several e f f e c t s . such as random F r r o r s i n mersuring the experinent conditions. t h a t are ignored i n the t h e o r e t i r a t predictions. You can gain the most i n f o r m t i o n , of course, by c o n s i d e r i ~ g both the observed s c a t t e r and the t h e o r e t i c a l predictions. An inherent weakness i n the use o f s c a t t e r as a gauge o f accuracy i s t h a t several data p o i n t s are r e w i r e d t o define i t . Depending on the application. t h i s o b j e c t i o n can range from i n c o n s q u e n t i a l tc, insurmountable. A r e l a t e d problem i s t h a t the s c a t t e r does not show the accuracy o f i n d i v i d u a l potnts, some of n h i c h m y be b e t t e r than others. For instance. ifo n l y two c o n f l i c t i n g data p o i n t s are available, the s c a t t e r gives no h ' n t as t o whlch i s more r e l i a b l e . Figure (11.3-1) shcws e s t i m t e s o f the parameter Cnp obtained from f l i g h t data o f a PA-30 a i r c r a f t . The s c a t t e r I s large. showing estimates o f both signs. Ftgure (11.3-2) shows the same data segregated i n t o rudder and a i l e r o n nuneuvers. I n t h i s case, the s c r t t r r makes i t evident t h a t the a i l e r o n maneuvers r e s u l t i n f a r more consistent estlmater o f Cn tnan do the rudder maneuvers. Had t h e n been o n l y one o r two a i l e r o n and one o r two rudder mcneuvers avaifable, there w u l d heve been no way t o Qduce from the s c a t t e r t h a t the a i l e r o n maneuvers were superior f o r estimating t h i s parameter. The s c a t t e r shares a makness w i t h ,nost o f the t h e o r e t i c a l accuracy 6nasures i n t h a t I t does n o t account f o r consistent e r r o r s (i.e., biases). M n y occurrences can r e s u l t i n s m l i s c a t t e r about an t n c o r f r c t value. fhe scatter, therefore, should be regarded as a l o w r bound. The estimates can be w r s e than i s i n d i c a t e d by ~ the scatter, b u are zeldon b e t t e r . h i n e m d I l i f f (1981b) discuss well-documented s i t u ~ t l o n si n which tte s c a t t e r i s s i g n i f i c a n t l y l a r g e r than the C r w r - k o bounds, I n a11 such cases, we r e w r d the s c a t t e r 8s n m r e r e a l i s t i c measure o f t h e magn i t u d e o f the errors. The C r m a r - k t bound I s s t i l l a reasonahla m a n s o f d e t e m f n i n g whlch i n d i v i d u a l expcrim n t s a r e most accurate, b u t m y n o t gIve a rearonaule m g n i t u d c o f the e r r o r . I n s p i t e of i t s problems. the L t a s c a t t e r i s an e a s i l y used t o o l f o r evaluating accuracy, and i t should always be extmlned when s u f f i c i e n t data p o i n t s are a v a i l a b l e t o define it. 11.3.3 Enpinerrinn J u d w n t

inginerring jud ?t i s the o l d e s t aeasure o f estimate r e l i a b i l i t y . Even w i t h the t h e o r e t i c a l accurrc measures now a v a f l r b r the need f o r j u d g r r n t rsnuins; th t h e o r e t i c a l measures are r r e l y t o o l s which suppfy r o r e i n f o r m t i o n on which t o base the judgment. By d e f t n i t i o n . the process o f applying r n ? l n c c r i n g judgment

cainot be descriaed precisely and q u a n t i t a t i v e l y . o r there would be no judgment involved. Algorithms can be devised t o search f o r s p e c i f i c problems, but the engineer s t i l l needs t o make a f i n a l unautomated judgment. Therefore, t h i s section w i l l simply l i s t some o f the f a c t o r s most o f t e n considered i n making a judgment. One o f the most basic factors i n judging the accuracy o f the estimates i s the anticipated accuracy. T1.e engineer usually has a priori knowledge o f how accurately one can reasonably expect t o be able t o estimate the parameters. This k i w l e d g e can be based on previous experience, awareness o f the r e l a t i v e importance and l i n e a r dependence o f the parameters, and the q u a l i t y o f experimental data obtained. Another basic c r i t e r i o n i s the reasonability o f the estimated parameter values. Before analysis i s begun. we usually know the approximate range o f vslues o f the parameters. Drastic deviatiocs from t h i s range are reason t o suspect the estimates unless we discover the reason f o r the poor p r e d i c t i o n o r we independently v e r i f y the suspect value. W have prev'ously mentioned the r o l e o f engineering judgment i n e must look f o r v i o l a t i o n s o f s p e c i f i c assumptions made i n d e r i v i n g the may i n d i c a t e w d e l i n g errors. Both the estimator and the theoretical by modeling errors. The magnitude o f the nodeling-error e f f e c t s must evaluating model adequacy. The engineer model, and f o r dnexplainrd problems t h a t measures o f accuracy can be invalidated be judged.

The engineer judges the q u a l i t y o f the f i t o f the measured and estimated time h i s t o r i e s . The characterist i c s o f t h i s f i t can give indications of many problems. Many modeling e r r o r problems f i r s t become apparent as poor time-history f i t s . Failed sensors and data processing e r r o r s o r omissions are among the other classes o f problems which can be deduced from the f i t s . F i n a l l y , engineering judgment i s used t o assemble and weigh a1 1 o f the available information about the estimates You nust combine the judgmental f a c t o r s w i t h information from the theoretical t o o l s i n order t o give a f i 11 best estimate o f the parameters and o f t h e i r accuracies. 11.4 MODEL STRUCTURE DETERMINATIG;~

I n the previous sections, ne have l a r g e l y assumed t h a t the assumed model form i s correct. This i s never s t r i c t l y t r u e i n practice. Therefore. & m s t always consider the possible e f f e c t s o f modeling e r r o r as a special issue. The t o o l s discussed i n Section 11.3 can help i n the evaluation o f these e f f e c t s . I n t h i s section, we s p e c l f i c a l l y examine the question o f determining the best model s t r u c t t r e f o r pararre t e r estimation. One approach t o minimizing the e f f e c t s o f ~rudelstructure e r r o r s i s t o use a model structure which i s close t o t h a t o f the t r u e system. There are, however, d e f i n i t e l i m i t s t o t h i s p r i n c i p l e . The l i m i t a t i o n s a r i s e both i n how accurate you make the model and i n how accurate you should make it.

can

I n tne f i e l d o f simulation, i t i s almost axiomatic t h a t the sir,-lation f i d e l i t y improves as more d e t a i l : i s added t th, m d e l . Practical considerations o f cost and the degree o f required f i d e l i t y d i c t a t e the l e v e l o f d e t a i l included i n the model. Simulation and system i d e n t i f i c a t i o n are c l o s e l y r e l a t e d f i e l d s . and we might expect t h a t s~lcha basic p r i n c i p l e would be c m n t o both. Contrary t o t h i s expectation, system i d e n t i f i c a t i o n sometimes obtains b e t t e r r e s u l t s from a simple than from a d e t a i l e d model. The use o f too d e t a i l e d a model i s probably one -f the most comnon sources o f d i f f i c u l t y i n the p r a c t i c a l a p p l i c a t i o n o f system identification. The problem; t h a t a r i s e from too d e t a i l e d a model are best i l l u s t r a t e d by a simple example. Presume t h a t Figure (11.4-1) shows experimental data from a system w i t h a scalar input U. and a sczlar output Z. The l i n e i n the f i g u r e i s the best l i n e a r f i t t o the data. This l i n e appears t o be a reasonable representation o f t h e system. To investigate possible nonlinear effects, consider the case o f polynomial models. It i s obvious t h a t the e r r o r between the model output and the experimental data w i l l become smaller as the order of the model increases. High-crder polynomials include lower-order polynomials as s p e c i f i c cases (we have no requirement t h a t the high-ordsr coef:irlent be nonzero), so the best secohd-order f i t 1s a t l e a s t as good as the best l i n e a r f i t , and so f o r t h . When the order o f the polynomial bhcomes one less than the number o f data points. the model w i l l exactly match the experimental data (unless input values were repeated). Although the dat,a points are Figure (11.4-2) shows such a p e r f e c t match o f the data from Figure (11.4-1). matched perfectly, the curve o s c i l l a t e s w i l d l y . The simple l i n e a r f i t o f Figure (11.4-1) i s probably a much b e t t e r representation o f the system, even though the model of Figure (11.4-2) i s more detailed. W could say e t h a t the model o f Figure (11.4-2) i s f i t t i n g t h e noise irtstead o f the t r u response. Essentially, as the model complexity increases, and more unknown parameters a r e estimated, the problem approaches the black-box system-identification problem where there are no assumptions about the model form. W e have previously snown t h a t the pure black-box problem i s insoluble. One can deduce only a f i n i t e anaunt o f information about the system from a f i n i t e amount o f experimental data. The engineer provides, i n the form o f an assumed model structure, the r e s t o f the information required t o solve the system-identification problem. As the assumed model structure becomes more general, i t provides less information, and thus more o f the i n f o r m t i o n must be deduced from the experimental data. Eventually, one reaches a p o i n t where the information available i s I n s u f f i c i e n t ; the estimation algorithms then perform poorly, g i v i n g r i d i c u l o u s results. The Cramer-Rao bound gives a s t a t i s t i c a l basis f o r estimating whether the experimental data contain s u f f i c i e n t i n f o m t i o n t o r e l i a b l y estimate the parameters I n a model. This and r e l a t e d s t a t i s t i c s can be used t o determine the n u d e r and selection o f terms t o include i n the model (KleIn and Batterson. 1983; Gupta, Hall. and Trankle. 1978; and Trankle, Vincent. a t ~ dF - k l i n , 1982). The basic p r i n c i p l e i s t o include i n the m d e l oi~ly those terms t h a t can be accurately e s t i m . from the a v a i l a b l e experimental data. This process. known as model structure detetmlnation, i s descrfbed i n r u r t h e r d e t a i l i n the c i t e d references. W w i l l r e s t r i c t our e discussion t o the general n a t w e and a p p l i c a b i l i t y o f model structure determirailon.

Automatic model structure determination i s often viewed as a panacea t h a t eliminates the necessity f o r model selection t o be based on engineering judgment and knokiedge o f the phenomenology o f t h e system. Since we have repeatedly emphasized t h a t pure black-box system i d e n t i f i c a t i o n i s impossible, such claims f o r automatic model determination must b t viewed w i t h suspicion. There i s a basic f a l l a c y i n the argument tha'i a u t o m t i c model structure detennination can replace engineering judgnent i n selecting a model. The model structure detenninatlon algorithms a r e not creative; they can only t e s t candidate models suggested by the engineer. I n f a c t , the model structure detennination a l g o r i t h n s are a type o f parameter estimation i n disguise, i n which the parameter i s an index i n d i c a t i n g which model i s t o be used. I n a way, model structure determination i s easier than most parameter estimation. A t each stage. there are only two possible values f o r a term, zero o r nonzero; whereas most parameter estimation demands tha. a s p e c i f i c value be picked from the e n t i r e r e a l l l n e . This task does n o t approach the scope o f t h e black-box system-identification problem i n which the number a f possible models i s a high order o f i n f i n i t y . Engineering judgment i s s t i l l needed, therefore, t o select the types o f candidate models t o be tested. I f the bandidate models are not appropriate, the r e s u l t s w i l l be questionable. The very best t h a t could be expecttd from an automatic algorithm i n t h i s circumstance would be r e j e c t i o n o f a l l o f t h e candidates (and not a l l automatic t e s t s have even t h a t much c a p a b i l i t y . No a u t m t i c algorithm can suggest c r e a t i v e inprovenents t h a t i t has n o t been s p e c i f i c a l l y programed f o r . Collsider a system w i t h an actual output o f Z = sin(3). Assume t h a t a polynomial model has been selected by the engineer, and automatic structure determination has been used t o determine what order polynomial t o use. The task i s hopeless i n t h i s form. The data can be f i t a r b i t r a r i l y well w i t h a polynomial o f a high enough order. b u t the polynomial form does n o t describe the essence o f the system. I n p a r t i c u l a r , the f i n i t e polynomial w i l l n o t be v a l i d for extrapolating system performance outside c f the range o f the experimental data. I n the above system, consider three ranges o f U-values: IU! < 0.1. IUI < 1.0! and I U ( 10.0. I n t h e range I U I c 0.1, the l i n e a r polynomtal Z = U i s a close approximtion. as shown i n Figure (11.4-3). The extrapo a t i o n o f t h i s approximation t o the ranan IUI < 1.0 introduces noticeable errors. as shown i n Figure (11.4-4). Over t h i s range, the approximatiolr Z = U U3/6 i s reasonable. I f we expand our view t o the range IU/ < 10.0, as i n Figure (1.5-5). then n e i t h e r the l i n e a r nor t C third-order polynomial i s a t a l l representative o f the sine function. I t would require a t l e a s t a seventh-order polynomial t o match even the gross c h a r a c t e r i s t i c s o f the sine function over t h i s range; a good match wouid require a s t i l l higher order.

Another problem w i t h automatic model-structure determination i s t h a t i t gives only a s t a t i s t i c a l estimate. Like a l l estimates, i t if imperfect. I f no b e t t e r information i s available. i t i s appropriate t o use automatic model s t r u c t u r e determination as the best guess. If, however, f a c t s about the model s t r u c t u r e a r e deducible from the physics o f the system, i t i s s i l l y t o throw away known f a c t s and use irrperfect estimates. (This i s one o f the most basic p r i n c i p l e s i n the e n t i r e f i e l d o f system i d e n t i f i c a t i o n , not j u s t i n model structure determination: i f a f a c t i s k n r m , bse i t and save the e s t i m t i o n theory f o r cases i n which i t i s needed. ) The most basic problem w i t h automatic model s t r u c t u r e determination l i e s i n the statement o f the problem. The very term "model structure determination" i s misleading, becaiise there i s seldom a c o r r e c t model t o determine. Even when there i s a correct model, i t may be f a r too complicated f o r p r a c t i c a l purposes. The r e a l model structul-e determination problem i s not t o determine s m nonexistent "correct" model structure, b u t t o determine an adequate nodel structure. W discussed the idea o f adequate models i n Section 1.4; the idea o f e an adequate model structure i s an i n t i l n t e p ~ r o f the idea o f an adequate model. t This basic issue i s addressad b r i e f l y . i f a t a l l , i n most o f the l i t e r a t u r e on modr?l s t r u c t u r e determinat i o n . Many papers generate simulated data w i t h a specified mcdel, and then demonstrate t h a t a proposed model structure determination algorithm can determine the c o r r e c t inodel. This a p ~ r o a c hhas l i t t l e t o do w i t h the r e a l issue i n model structure determination. The previous paragraphs have emphasized the numerous problems o f automatic model strllcture determination. That these problems e x i s t does not mean t h a t automatic model-structure determination i s worthless, only t h a t the mfndless a p p l i c a t i o n o f i t i s dangerous. Automatic model structure determination can be a valuable t o o l when used w i t h an appreciation o f i t s liaitat:,is. Most gocd model structure determination programs allow the engineer t o override the s t a t i s t i c a l decision and force s p e c i f i c terns t o be included o r omitted. This approach makcs good use o f both the theory and the judgment, so t h a t the theory i s used as a t o o l t o a i d the judgment and t o warn against some types o f poor judgment, but the end r e s p o n s i b i l i t y l i e s w i t h the engineer. 11.5 EXPERIMENT DESIGN

Thc prevlous discussion has, f o r the mcst part, assumed t h a t a s p e c l f f c set o f experimental data has already been gathered. I n some cases, t h i s i s a v a l i d assunption. I n other cases, the opportunity I s a v a i l able t o specify the experiments t o be performed and the measurements t o be taken. This section gives a b r i e f e overview o f the subject o f designing experiments t o r parameter i d e n t i f i c a t i o n . W leave d e t a i l e d discussion t o works c i t e d I n the references. Methods f o r experiment design f a l l i n t o two major categories. The f i r s t category I s t h a t o f methods based on numerical optimization. Such methods choose an input, subject t o appropriate constraints, which minimizes the Cramer-Rao bound o r some r e l a t e d e r r o r estimate. Goodwin (1962) and Plaetschke and Schulz (1979) g i v e t h e o r e t i c a l and p r a c t i c a l d e t a i l s o f s m optimization approaches t o input design. Experiment design i s o f t e n strongly constrained by p r a c t i c a l considerations; i n the extreme case, the constraints completely specify the Input, leaving no l a t i t u d e f o r design. I n a design based on numerical o p t f m i z ~ t i o n , tile constraints must be expressed mathematically. This d e r i v a t i o n o f such expressions IS sometimes straiqhtforward, as when a control device I s l i m i t e d by a physical stop a t r s p e c i f i c position. I n other cases, the constraints involve issues such as safety t h a t are d i f f i c u l t t o quantify as precise l i m i t s .

S l i g h t changes i n the form of the constraints can change the e n t i r e character o f the t h e o r e t i c a l optimum input. Because the constraints are one o f the major influences i n the experiment &sign, adopting s i n p l i f i e d constraint forms s o l e l y b e c u s e they are easy t o analyze i s o f t e n inadvisable. I n p a r t i c u l a r . " s o f t " cons t r a i n t s i n the form of a cost penalty proportional t o the square o f the input are almost never accurate representations o f p r a c t i c a l constraints. k s t p r a c t i c a l experirnent design f a l l s i n t o the second major category, methods based more on h e u r i s t i c design than on formal optimization of a cost function. Such designs draw heavily on the engineer's understandi n g o f the system. There are several widely applicable r u l e s o f thunb t o help h e u r i s t i c experiment design; some o f them consider issues such as frequency content, modal excitation, and independence. Plaetschke and Schulz (1979) describe some of these rilles, and evaluate inputs based on them.

Figure (11.1-1).

Construction o f

R,.

Figure (1 1.1-3). Construction o f two-dimensional confidence e l 1ipsoid.

I
JIE)

t2

----I

J I t ) = CONSTANT

I I
I
I

I
(WIN

I I I
I

E EMAX Figure (1 1.1 -2). Construction of one-dimensional confidence e l 1ipsoid.

Figure (11.2-1).

Geometric i n t e r p r e t a t i o n o f insensitivity.

CR. MER-RAO BOUND

0 FLIGHT DATA

Figure (11.2-2).

Correlation and sensitivity.

". d@

': b-'
80
TIME

Figure (11.3-1).

Estimates of

Cn

P'

Figure (11.2-3).

High correlation and high sensitivity.


.8

I
.4

CRLUACR- RAO BOUND

OFLIGHT DATA DDER MANUEVERS

" * - ' : ~
-10

:
' a u c

-.4

TIME

Figure (11.2-4).

Low correlation and low sensitivity.

-.8

-1.2

AILERON MANEUVERS

Figure (11.3-21.

Estimates of input u.

Cn

, segregated
P

by

Figure (1 1.2-5). Cramcr-Rao bounds and insensltivitfes.

-.I

I
IUI < 0.1.

Figure (11.4-3).

2 = sin(U) in 9 e rlnge

Figure (11.4-1).

Best linear f i t o f noise data.

Figure (11.4-4).

Z = sin!U) i n the range /UI

< 1.0.

Figure (11.4-2;.

Exact polynonlal match o f noise data.

-10

l
IUI < 10.0.

Figure (11.4-5).

Z=

sln(U) i n the range

12.0 CHAPTER 12

I n t h i s document, we have presented t k e t h e o r e t i c a l background o f s t a t i s t i c a l estimators f o r dynamic n systems, w i t h p a r t i c u l a r crphasis nr, maxinurn-likelihood estimators. A understanding o f t h i s t h e o r e t i c a l background i s c r u c i a l t o t h e p r a c t i c a l a p p l i c a t i o n of the estimators; the analyst needs t o know the c a p a b i l i t i e s and l i m i t a t i o n s c f the estimators. There a r e several examples o f a r t i f i c i a l l y c ~ l p l i c a t e dproblems t h a t succunb t o simple approaches, and seemingly t r i v i a l questions t h a t have no answers.

A thorough understanding o f the system being analyzed i s necessary t o conplement t h i s t h e o r e t i c a l b x k ground. No amount o f t h e o r e t i c a l sophistication can compensate f o r the lack o f such understanding. The e n t i r e theory r e s t s on the basis o f the assunptions made about the system characteristics. The theory can g i v e only l i m i t e d help i n v a l i d a t i n g o r r e f u t i n g such assunptions.
Errors and unexpected d i f f i c u l t i e s are i n e v i t a b l e i n any substantial parameter estimation project. The eventual success o f the p r o j e c t hinges on the a n a l y s t ' s a b i l i t y t o recognize unreasonable r e s u l t s and diagnose t h e i r causes. This a b i l i t y . i n turn, requires an understanding o f both e s t i m t i o n theory and the system being analyzed. Problem can range frw obvious instrumentation f 6 i l u r e s t o subtle modeling inconsistencies and i d e n t i f i e a b i l i t y problems. Probably the m s t d i f f i c u l t p a r t o f p a r a w t e r estimation i s t o straddle the f i n e l i n e between models too simple t o adequately represent t h e system and models too conplicated t o be i d e n t i f i a b l e . There i s no conserv a t i v e p o s i t i o n on t h i s 'ssue; excesses i n e i t h e r d i r e c t i o n can be f a t a l . The solution i s t y p i c a l l y i t e r a t i v e . using diagnostic s k i l l s t o detect problems and make improvements u n t i l an 3dequate r e s u l t i s obtained. The problem i s exacerbated by there being no correct answer. Neither i s there a s i n g l e c o r r e c t method t o solve parameter estimation problems. Although we have c a s t i gated some practices as demonstrably poor, we make no attempt t o e s t a b l i s h as dogma any p a r t i c u l a r method. The material of t h i s document i s intended nore as a set o f t o o l s f o r parameter estimation problems. The select i o n o f the best t o o l s for a p a r t i c u l a r task i s influenced by f a c t o r s other than the purely t h e o r e t i c a l . B e t t e r r e s u l t s o f t e n come from a crude, but adequate, method t h a t the analyst thoroughly understands than from a sophisticated, b u t unfamiliar. method. We recxmend the a t t i t u d e expressed by Gauss (1809, p. 108):

I?. always p r o f i t a b l e t o approach the more d i f f i c u l t problems i n several is ways, and not t o despise the good although preferring the b e t t e r .

APPENDIX A A.0 MTRIX RESULTS

This appendix presents several matrix r e s u l t s used i n the body o f the book. The derivations are mostly exercises i n s i n p l e matrix algebra. Various of these r e s u l t s a r e given i n numerour other documents; Goodwin and Payne ( 1 9 7 7 . appendix E) present wst o f them. A.l W R I X INVERSION L E W Consider a square. nonsingular m t r i x A, p a r t i t i o n e d as

where A,,

and A,,

are square.

Define the inverse o f A

t o be r , s i m i l a r l y p a r t i t i o n e d as (A. 1 - 2 )

where r,, i s the same size as A,,. W want t o express the p a r t i t i o n s r i j i n terms o f the A i j o To e derive such expressions, we need t o assume t h a t e i t h e r A,, o r A,% i s i n v e r t i b l e ; i f both are singular, there i s no useful form. Consider f i r s t the case where A,, i s invert~ble. L e m A.1.l Given A and r p a r t i t i o n e d as i n Equations ( A . l - { j and ( A . l - 2 ) . assume t a t A and A a r e i n v e r t i b l e . Then (A,, A , , A ~ ~ A ~ , , i s invertible and the !artitions 0) r are given by

r:, =

A ;

- A;;A,~(A~,

A~,A;~A,~)-~A~,A;~

Proof - The condition

A r = I gives the four equations Al:rlz


+

A I Z ~ ~ Z= 0

A21r12

A22r22 = I

and t h e condition r A = I gives the f o l l r equations ~,,AIz ~ZIA,,


+

rlzA22 = 0 P,?A,, =
Q

~ll"11 +

rlzA21 = 1 rzzA2z = 1

~~IAIZ
Equations ( A . l - 7 )

and (A.1-12).

respectively, give
r,2

-h;:~12r22

r z 1 = -r22A21A;: Substitute Equation (A.1-15) i n t o Equation (A.1-10) t t o n ( A . l - 1 6 ) i n t o E q u a t i ~ n(A.1-14) t o get (A22 and s u o s t i t u t e Equa-

= I

r22(~22

- A ~ ~ A ; : A ~ ~= ) I

Ry the a s s u n ~ t i o no f i n v e r t i b i l i t y o f A, t h e r j e x i s t and s a t i s f y Equations (A.l-7) t o (A.l-14). The assumption invertibility of A , then s a t i s f i e s Equaassures, through the above substitutions, t h a t r,, t i o n s ( A . l - 1 7 ) and (A.l-18). Therefore (A, A,,A;~A,,) i s i n v e r t i b l e and r, i s given by Equation (A.l-6). ,

04

Substituting Equation (A.1-6 i n t o Equations (A.l-15) and (A.l-16) gives F i n a l l y , s u b s t i t u t i n g Equation (A.1-I) i n t o Equattons (1.1-4) and (1.1-51. gives Equation (A. 1-3). completing Equation (A. 1-9) and solving f o r r,, the proof. The case where
A ,,

i s nonsingular i s stmply a p e r m t a t i o n of the same l e m .

L e m A . l - 2 Given A and r p a r t i t i o n e d as i n Equations ( A . l - 1 ) and (A.l-2). assume t a t A and A are i n v e r t i b l e . Then (A,, A,,A;~A,,) i s inverttble and the :artitions o)L r are given by

Proof

Define a reordered matrix

The inverse o f A '

i s given by the corresponding reordering o f

r.

Then apply the previous lemna t o A ' and r ' When both A , and A ,,

. -

a r e i n v e r t i b l e , we can combine the above l e m s t o o t t a i n two other useful r e s u l t s .

Assume t h a t two matrices A and C are i n v e r t i b l e . Further L e m A.l-3 BC-ID) o r (C DA-'0) i s i n v e r t i b l e . assume t h a t one of the expressions (A Then the other expression i s a l s o i n v e r t i b l e and

Proof Define A, = A , A , = 0, A, = D, and A = C. I n order t o apply , G s ( A . l - 1 ) and ( A - 1 - 2 f , we f t r s k need t o show t h a t A as defined by Equatto~i(A.l-1) i s i n v e r t i b l e . I f C DA-'0) i s i n v e r t t b l e , then the r j defined by Equations ( A . l - 3 ) t o lA.1-6) s a t i s f y Equations (1.1-7) t o (1.1-14). Therefore A i s i n v e r t i b l e . L e m (A.l-2) then gives the i n v e r t i b i l i t y o f (A 0C"D). which i s one o f the desired r e s u l t s .

BC'lD) I s i n v e r t i b l e , then the r i j Conversely, i f we assume t h a t (A defined by Equations (A.l-19) t o (A.l-22) s a t i s f y Equations ( A . l - 7 ) t o (A.l-14). Therefore A i s i n v e r t i b l e and L e m (A.1-1) gives t h e i n v e r t i b i l i t y o f the expression (C DA"0).

Thus the i n v e r t i b i l i t y o f e i t h e r expression implies i n v e r t i b i l i t y o f the other and o f A. W can now apply both L e m s A 1 1 and (A.1-2). e Equating the expressions f o r r,, given by Eguations [A:l:3{ and (1.1-19). and p u t t i n g the r e s u l t i n tenns o f A, 0, C, and D, gives Equation (A.l-23). completing the proof. L e m A.l-4 Given A, 0, C, and D as i n Lenma (A.l-3). f n v e r t i b i l f t y assumptions, then A"B(C w i t h the same

- DA'lB)"

= (A

- BC-lD)"BC-'

Proof The proof i s i d e n t i c a l t o t h a t o f Lemna (A.l-3). excett t h a t we equate thexpressions for r , given by Equations (A.l-4) and (A.l-20), giving Equatfon (A.l-24) as a r e s u l t .

A.2

MATRIX DIFFERENTIATION

For several o f the f o l l o w i n g r e s u l t s , i t i s convenient t o define the d e r i v a t i v e o f a scalar w i t h respect t o a matrix. I f f i s a scalar f u n c t i o n o f the m a t r i x A, we define df/dA t o be a matrix w i t h elements equal t o the d e r i v a t i v e s o f f w i t h respect t o corresponding elements o f A.

Two s i m ~ l er e l a t i o n s i n v o l v i n g the t r a c e f u n c t i o n are useful i n manipulating the matrix and vector quant i t i e s we work with. Result A.2-1
If x and y

are two vectors o f the same length, then x*y = t r ( y x * ) (A.2-2)

Proof - Both sides expand t o Rtsult-A.2-2 I f

ZX(')~(~).
1

A and B are two matrices of the same size, then

Proof --

Expand the r i g h t side, eiement by element.

Both o f these r e s u l t s are special cases o f the same r e l a t i o n s h i p between inner products and outer products. The following r e s u l t i s a p a r t i c u l a r a p p l i c a t i o n o f Result (A.2-2). Result A.2-3 I f f(A) i s a scalar function of the matrix f u n c t i o n o f the scalar x, then A, and A is a

Proof

Use the chain r u l e w i t h the i n d i v i d u a l eleinents o f

A to write

Equation (A.2-4) then follows from Result (A.2-2) and the d e f i n i t i o n given by Equation (A.2-1). Result A.2-4 I f the matrix d dx wherever A i s invertible. the inverse AA-' = I Take the d e r i v a t i v e , using the chain r u l e . A i s a function o f

x, then

Proof - By the d e f i n i t i o n o f

Solving f o r Result A.2-5

d/dx(A-')

gives Equation iA.2-6), as desired. i s i n v e r t i b l e , and x and y are vectors, then

If A

Proof - Use r e s u l t

(A.2-4)

t o get

Now

where e l Therefore

i s e vector w l t h zeros i n a l l but the

i t h elenknt, whlch 1s 1.

which i s the (1.j) element o f -(A''~x*A-')~. The d e f i n i t i o n o f the matrix derivative then glves Equation (A.2-10) as desired. Result A.2-6 If A i s I n v e r t i b l e , then

Proof - Expanding the determinant by cofactors enlAl = an

o f the

l t h row gfves

~('*~)(adj) ( ~ * ~ ) A

Taking the d e r i v a t i v e w i t h respect t o ~ ( ~ *gives j )

does not depend on ~ ( j ' ~ ) . Equation (A.2-15) and Using because (adj A ) ( ~ ' ~ ) the expression f o r a matrix inverse i n terms o f the matrix o f cofactors, we get

Equation (A.2-14) then follows, as desired, from the d e f i n i t i o n o f the d e r i v a t i v e w i t h respect t o a matrix.

REFERENCES Acton, Forman S.: Numerical Methods t h a t Work. Harper 6 Row, New Yark, 1970. IEEE Trans. 41tmt. Contr., Vol. AC-19.

Akaike, Htrotugu: A hew Look a t S t a t i s t i c a l Model I d e n t l f i c a t i o n . No. 6, pp. 716-723, 1974. Aoki , Masanao: Optimization o f Stochastic Systems. Apostol, Tom M.: Calculus: Volume 11.

Academic Press, New York, 1967. 2nd ed., 1969.

Xerox College Publishing, Waltham, Mass..

Ash, Robert B. : Basic Probabil l t y Theory. Astrom, Karl J.:

John Wiley 6 Sons, Inc., New York, 1970. Academic Press. New York. 1970. A u t m t i c a , Vol. 7, pp. 123-162, 1970. AIM

Introduction t o Stochastic Control Theory. System I d e n t l f i c a t i o n - A

Astrom, Karl J. and Eykhoff, P.: Bach. R. E. and Wingrove, R. C.: paper 83-2087. 1983.

Survey.

Applications o f State Estimation I n A i r c r a f t F l i g h t Data Analysis.

Balakrishnan. A. V.: Stochastic D i f f e r e n t i a l Systems I. F i l t e r i n g and Control-A Function Space Apprach. Lecture Notes i n Economics and Mathematical Systems. 84. M Beckman, G. Goos, and H. P. Kunzi eds., . Springer-Verlag, Berl in, 1973. Balakrishnan. A. V.: Balakrishnan. A. V.: Barnard. G. A.: 1958. Stochastic f i l t e r i n g and Control. Kalmn F i l t e r i n g Theory. Optlmization Software, Inc., Los Angeles, 1981.

Optimization Software, Inc., New York. 1984. Biometrika, Vol. 45,

Thomas Bayes Ess4y Toward Solving a Problem i n the Doctrine o f Chances.

Bayes, Thorns: An Introduction t o the Doctrine o f Fluxions, and a Defence o f the Mathematicians Against t h e Objections o f the Author o f the Analyst. John Noon, 1736. (See Barnard, 1958). Bierman, G. J.: Factorization Methods f o r Discrete Sequential E s t i m t i o n . neering, Vol. 128, Academic Press, New York, 1977. Brauer, Fred and Noel. John A.: New York, 1969. Mathematics i n Science and EngiW. A. Benjamin,

Q u a l i t a t i v e Theory o f Ordinary O i f f e r e n t i a l Equations.

Cox. A. B. and Bryson, A. . : I d e n t i f i c a t i o n by a Colrbined Smoothing Nonlinear Programning Algorithm. Automatlca, Vol 16, pp. 689-694, 1980.

C r a d r , Harald: Dixon, L. C. W.: Doetsch, K. H.:

Mathematical Methods o f S t a t i s t i c s . Nonlinear Optimization.

Princeton University Press, Princeton, N.J.,

1946.

Crane, Russak 6 Co., New York, 1972. A.R.C. . R. 6 M 2945, 1953. SIAM, Philadelphia,

The Time Vector Method f o r S t a b i l i t y Investigations.

Oongarra, J. J.; A l e r , C. B.; Bunch, J. R.; and Stewart. G. W.: 1979. Etkin, 8.: Eykhoff. P.: Dynamics o f Atmospheric F l i g h t .

LINPACK User's Guide.

John Wiley 6 Sons, Inc., New York, 1958. John Wiley 6 Sons, London, 1974. Academic Press, New York, 1967.

System I d e n t i f i c a t i o n , Parameter and State Estimation. Mathematical S t a t i s t i c s :

Ferguson, Thomas S.:

A Decision Theoretic Approach.

Fisher, R. A.: On the Mathematical Foundations o f Theoretical S t a t i s t i c s . Vol. 222, pp. 309-368, 1921.

Phil. Trans. Roy. Soc. London, A I M paper 77-1171, 1977.

Fiske, P. H. and Price, C. F. : A New Approach t o Model Structure I d e n t i f i c a t i o n . Flack. Nelson D. : AFFTC S t a b i l i t y and Control Technique. Foster, G W.: . 1983. AFFTC-TN-59-21,

Edwards, California, 1959. RAE TR 83025,

The I d e n t i f i c a t i o n o f A i r c r a f t S t a b l l i t y an6 Control Parameters i n Turbulence.

Garbow, B. S.; Boyle. J. M.; Dongarra, J. J.; and Cbler, C. 8.: Extonsion. Springer-Verlag, Barlin, 1977.

Matrix Eigensystem Routines-EISPACK Guide

Gauss, Karl F r i e d r i c h - Theory o f the Motion o f the Heavenly Bodies Moving About the Sun i n Conic Sections. Translated by Charles Henry Davis. @over Publications, Inc., Nw York, 1647. Translated fw: Theoria e Motus, 1809. Geyser, L u c i l l e C. and Lehtinen, Bruce: D i g i t a l Program f o r Solving the Linear Stochastic O p t i w l C o r ~ t r o land Estimation Problem. NASA TN D-7820, 1975.

n P bodwin, Graham C.: A Overview o f the System I d e n ~ ~ f l c a t i o nr o b l m Experiment Design. on I d e n t l f i c a t l o n and System Paramtar Estimation, Wlshlngton, D.C., 1982.

S i x t h IFAC Synposiun

--

---

--

.-Am--

--

'(a)

Goodwln. Graham C. and Payne, Robert L.: Academic Press, New York, 1 '7.

Dynamic System I d e n t i f l c a t i o n :

Experiment Design and Data Analysls.

Greenberg, H. : A Survey of Methods f o r Determining S t a b l l l t y Parameters o f an Airplane from Dynamic F l i g h t M e a s u r m n t NASA TN-2340, 1951.

Gupta. N. K. ; Hall, W. E. ; and Trankle, T. L. : Advanced Methods o f A d e l Structure Determination from Test Data. A I M J. Guidance and Control, Vol. 1, No. 3, 1978. Gupta, N. K.; and Mehra, R. K.: Computatlonal Aspects o f M x i m m Llkellhood Estimation and Reduction I n Senslt i v i t y Functlon Calculations. IEEE Trans, on Automnt. Contr., Vol AC-19, No. 6, pp. 774-783, 1974. Hajdaslnskl, A. K.; Eykhoff, P.; D a m , A. A. H.; and van den Boom, A. J. W.: The Cholce and Use o f D l f f e r e n t Model Sets f o r System I d e n t i f i c a t l o n . S l x t h IFAC Symposlwn on I d r ? n t l f l c a t l o n and System Parameter E s t l mat {on, Washington. D.C., 1982. Hodge, Hard F. and Bryant, Wayne H.: f b n t e Csrlo Analysls o f Inaccuracies I n Estlmated A l r c r a f t Parameters Caused by Unmdeled F l l g h t I n s t r u n n t a t l o n Errors. NASA TN D-7712, 1975. Jategaonkar, R. and Plaetschke, E.: M x l m m Llkellhood Parameter Estlmatlon from F l i g h t Test Data f o r General Nonlinear Systems. DFVLR-FB 83-14, 1983. Jamlnskl. Andrew H.: Stochastic Processes and F l l t e r l n g Theory. Acadanlc Press, New York, 1970. IEEE

Kailath. T. and Lyung, L.: Asymptotlc Behavior o f Constant-Coefflclent R i c c a t l D f f f e r e n t l a l Equations. Trans. Automat. Contr., Vol. AC-21, pp. 385-388, 1976. Kalman. R. E. and Bucy, R. S.: New Results i n Linear F i l t e r i n g and Predlctlon Theory. Journal of Baslc Englneerlng, Vol. 63, pp. 95-107, 1961. Klein, Vladislav: 1975. On the Adequate Model f o r A l r c r a f t Parameter Estlmation.

Trans. ASME, Serles D.

CIT. Cransfield Report Aero No. 28,

Kleln, Vladlslav and Batterson, Jams &.: Determination o f Spllces and Stepwise Regresslon. NASA TP-2126, 1983. Kushncr, Harold: Levan, N.: Introduction t o Stochastic Ccntrol.

-plane Model Structure from F l i g h t Data Uslng

Hol t, Rinehart and Winston. Inc., New Vork, 1971.

Systems and Signals.

Optimization Software, Inc., New Vork. 1983. S t a t l s t l c s o f Random Processes I: General Theory. Sprlnger-Verlag,

Llpster, R. S, and Shiryayev, A. N.: N w York, 1977. e Luenberger, David G.: Luenberger, Davfd G.:

Oytlmlzatlon by Vector Space Methods.

John Wlley L Sons, New Vork, 1969. Addison-Wesley. Readlng Mass.. 1972.

Introduction t o Llnear and Nonlinear Programing.

M l n e . Rlchard E.: Programer's Manual f a r M E 3 . A General FORTRAN Program f o r Maximum Likelihood Parameter Estlmatlon. NASA TP-1690, 1981. M i n e , Richard E. and I 1lff, Kenneth W. : User's Manual f o r MHLE3, A General Fortran Program f o r Mxlnum Llke1lhood Parameter Estimation. NASA TP-1563, 1980. M l n e , Richard E. and I l i f f . Kenneth W.: Fornulation and Implemntation o f a P r a c t l c a l Algoritiun f o r Parameter Estimation w i t h Process and Measurement Nolse. SIAM J. Appl. Math., Vol. 41, pp. 558-579, 1981(a). M i n e , Rlchard E. and I l i f f . Kenneth U.: The Theor and Practice o f Estlrartlng the Accuracy of Dynamic F l l g h t Dctermlned Coeff l c i e n t s . NASA RP-1077, 1981(bf. Med'tch, J. S.: Stochsstic O p t i m l Llnear Estlmstion and Control. k G r a w - H i l l Book Co., New York, 1%9. Academtc

Mehra, Ramn K. ~ n L a i n l o t i s , D l m l t r l G (eds): d . Press, Mw York, 1976. e

System I d e n t f . ~ ~ a t l ~ . Advance* and h s e Studies. ::

A l e r , C. R. and Stewart, G. W.: An A l g o r i t h n f o r Generallzad kt1l x Elgenvalue Problems. c a l Analysls, Val. 10, pp. 241-256, 1973.

SIAn J. o f NuneriSIAn

A l e r , Cleve; and Van Loan. Charles: Nineteen Dubious Ways t o Conpute the Exponentlal of a b t r i x . R e v l a , Vol. 20, No. 4, pp. 801-836, 1978. Warlng, Evar D.: Llnear Algebra m h t r f x Theory. d John U l l e y L Sons. Inc..

k York, 2nd ed..

1970.

Palgc, Lowell J.; k r f f t , J. Dean; and Slobko, Thanvs A.: ing, Lexington. Mss., 2nd ed., 1974. Papoulis, Athnasios: k YO&, 1965. Penrose, R.:

Elements o f Linear Algebra.

Xerox College Publlsh-

Probablllty. k n d o ~ n Varlrbles, and Stochastic Processes. Proc. b n t t r i d g e P h i l .

kGrrw-Hi11 Book Co.,

A L n e r a l i z e d Inverse f o r Matrices.

W. 51, pp. 4Cb-413, 1955.

P l t n m , E. J. G.:

Sonr L s l c W r y f o r S t r t i s t l c a l Inference. Chrplan md M11, London, 1979.

Plaetschke, E. and Schulz, 6.: Polak, E,:

Practical Input Signal Design.

A A D Lecture Series No. 104, 1979. GR Academic Press, Nw York, 1971. e

CMputational Mthods i n Optimization:

A Unified Approach.

Potter, James E. : Matrix Quadratic Solutions.

S I N J. Appl. Math.. Vol. 14, pp. 496-501, 1966.

Rtmpy, John M. and Rerry, Donald 1.: Determinatton of S t a b i l i t y Derivatives from F l i g h t Test Data by Means o f High Speed Repetitive Operation Analog Hatching. FTC-TDR-64-8, Edmrds, Calif., 1964. Rto, 5. 5. : Dptimizatlon, Theory and Applications. Royden, H. L.: Rudln, Yalter: Real Analysis. Wiley E a s t e r ~Limited, New Dulhi, 1979.

The n c M i l l a n Co., London. 1968. HcGraw-Hill Book Co., New York, 1974. P r e n t i c e - k l l , Inc.. Englewood C l i f f s , N.J., 1973.

Real and Comp' x Analysis.

Schweppe, Fred C.:

Uncertain Dynamic Systems.

Sorensen, John A.: Analysis of Instrumentation Error Effects on the I d e n t i f i c a t i o n Accuracy o f A i r c r a f t Parameters. NASA CR-112121, 1972. Sorensen, Harold W.: Parameter Estimtion; Principles a t ~ dProblems. Marcel Dckker. Inc., New York, 1980.

Strang, G l l k : Linear Algebra and I t s Applications. :

Academic Press,

New York. 1980.

Trankle. T L.; Vincent, J.H. ; and Franklin, 5. N.: 5 stem l d e n t i f i c b t i o n o f Nonlinesr Aerodynamic Models. AGARDograph. The Techniques and Techn?logy of Nonf inear F i l tering and l l l r n Fi1tertng. 1982. Vaughan, David R.: A Nonrecursive Algebraic Solution f x the Discrete Riccati Equation. Contr., Vol. X-15, op. 597-599, 1970. Wikrg. Oonald M.: State Space and Linear Systems. kGraw-Hi11 Book to., New York, 1971. IEEE Trans. Autonwt.

Wilkinson. J. H. : The Algebraic Eigenvalue Problem.

Clarendon Press, Oxford, 1965. Volume 11. Linear Algebra, Part 2.

Wilkinson, J. H. and Rtinsch, C.: Handbook f o r Automntic C w u t a t i o n . Springer-Verlr), New York, 1971.

Wolowicz, Chester H. : Considerations i n the Determination of S t a b i l i t y and Control Derivatives and Dynamic Characteristics f. rm F l i g h t Data. A A D Rep. 549-Part 1, 1966. GR Yolowicz, Chester H. and Hollanun, Euclid C.: Report 224, 1958. Zacks, Shelenlyahu: Stability-Derivative Determination f r a n F l i g h t Data. John Wiley 6 Sons. M n York, 1971. e NcGraw-dill Book Co., N York, 1963. e w AAD GR

The Theory o f S t a t i s t i c a l Inference.

Zldek, Lotfi A. and h s w r . Charles A.:

Linear System Theory.

You might also like