0% found this document useful (0 votes)
120 views

System Identification

This technical report from Linköping University discusses system identification. It provides an overview of the basic process of system identification which involves using observed input and output data from a system to build a mathematical model of that system. The report outlines the common steps of specifying candidate model structures, selecting a fitness criterion, estimating model parameters to find the best fit model according to that criterion, and validating the identified model. It then discusses some specific linear model structures and methods commonly used in system identification applications, including ARX (AutoRegressive with eXogenous inputs) models and linear least squares estimation.

Uploaded by

daniel flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

System Identification

This technical report from Linköping University discusses system identification. It provides an overview of the basic process of system identification which involves using observed input and output data from a system to build a mathematical model of that system. The report outlines the common steps of specifying candidate model structures, selecting a fitness criterion, estimating model parameters to find the best fit model according to that criterion, and validating the identified model. It then discusses some specific linear model structures and methods commonly used in system identification applications, including ARX (AutoRegressive with eXogenous inputs) models and linear least squares estimation.

Uploaded by

daniel flores
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Technical report from Automatic Control at Linköpings universitet

System Identification
Lennart Ljung
Division of Automatic Control
E-mail: [email protected]

29th June 2007

Report no.: LiTH-ISY-R-2809


Accepted for publication in Wiley Encyclopedia of Electrical and
Electronics Engineering

Address:
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden

WWW: http://www.control.isy.liu.se

AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET

Technical reports from the Automatic Control group in Linköping are available from
http://www.control.isy.liu.se/publications.
Abstract
This is a survey of System Identication.

Keywords: identication
System Identication
Lennart Ljung
Department of Electrical Engineering, Linkoping University
S-581 83 Linkoping, Sweden. e-mail [email protected]
April 27, 1997

1 Introduction
The process of going from observed data to a mathematical model is funda-
mental in science and engineering. In the control area this process has been
termed \System Identication" and the objective is then to nd dynamical
models (dierence or dierential equations) from observed input and output
signals. Its basic features are however common with general model building
processes in statistics and other sciences.
System Identication covers the problem of building models of systems where
both when insignicant prior information is available and when the system's
properties are known, up to a few parameters (physical constants). Accord-
ingly, one talks about black box and gray box models. Among black box
models there are familiar linear models such as ARX and ARMAX, and
among non-linear black box models we have, e.g., Articial Neural Networks
(ANN).

1
0 20 40 60 80 100 120 140 160 180

0 20 40 60 80 100 120 140 160 180

0 20 40 60 80 100 120 140 160 180

0 20 40 60 80 100 120 140 160 180

Figure 1: Results from test ights of the new Swedish aircraft JAS-Gripen,
developed by SAAB Military Aircraft AB, Sweden. From above) Pitch rate.
b) Elevator angle. c) Canard angle. d) Leading edge ap.

1.1 The Problem


The area of system identication begins and ends with real data. Data are
required to build and to validate models. The result of the modeling process
can be no better than what corresponds to the information contents in the
data.
Let us take a look at two data sets:
Example 1 An unstable aircraft. Figure 1 shows some results from test
ights of the new Swedish aircraft JAS-Gripen, developed by SAAB Military
Aircraft AB, Sweden. The problem is to use the information in these data to
determine the dynamical properties of the aircraft for ne-tuning regulators,
for simulations, and so on. Of particular interest are the aerodynamical
derivatives.

Example 2 Vessel dynamics. Figure 2 shows data from a pulp factory.


2
They are collected from one of the bu er vessels. The problem is to determine
the residence time in the vessel. The pulp spends about 48 hours total in the
process, and knowing the residence time in the di erent vessels is important
in order to associate various portions of the pulp with the di erent chemical
actions that have taken place in the vessel at di erent times. (The -number
is a quality property that in this context can be seen as a marker allowing us
to trace the pulp.)

So, the bottom line of these examples is that we have collected input-output
data from a process or a plant, and we need to extract information from
these to nd out (something about) the process's dynamical properties.

1.2 Background and Literature


System Identication has its roots in standard statistical techniques and
many of the basic routines have direct interpretations as well known statis-
tical methods such as Least Squares and Maximum Likelihood. The control
community took an active part in the development and application of these
basic techniques to dynamic systems right after the birth of \modern con-
trol theory" in the early 1960's. Maximum likelihood estimation was applied
to dierence equations (ARMAX models) by  Astrom and Bohlin, 1965] and
thereafter a wide range of estimation techniques and model parameteriza-
tions ourished. By now, the area is well matured with established and well
understood techniques. Industrial use and application of the techniques has
become standard. See Ljung, 1995] for a common software package.
The literature on System Identication is extensive. For a practical user ori-
ented introduction we may mention Ljung and Glad, 1994]. Texts that go
deeper into the theory and algorithms include Ljung, 1987], and Soderstrom and Stoica, 1989].
A classical treatment is Box and Jenkins, 1970].
These books all deal with the \mainstream" approach to system identica-
tion, as described in this article. In addition, there is a substantial literature
on other approaches, such as \set membership" (compute all those models
that reproduce the observed data within a certain given error bound), estima-
tion of models from given frequency response measurement Schoukens and Pintelon, 1991],

3
OUTPUT #1
25

20

15

10

5
0 100 200 300 400 500 600

INPUT #1

20

15

10

0 100 200 300 400 500 600

80

60

40

20

0
0 100 200 300 400 500 600

200

150

100

50

0
0 100 200 300 400 500 600

Figure 2: From the pulp factory at Skutskar, Sweden. The pulp ows con-
tinuously through the plant via several buer tanks. From above: a) The
-number of the pulp owing into a buer vessel. b) The -number of the
pulp coming out from the buer vessel. c) Flow out from the buer vessel.
d) Level in the buer vessel.

4
on-line model estimation Ljung and Soderstrom, 1983], non-parametric fre-
quency domain methods Brillinger, 1981], etc. To follow the development in
the eld, the IFAC series of Symposia on System Identication (Budapest,
Hungary (1991), Copenhagen, Denmark (1994), Fukuoka, Japan (1997)) is
also a good source.

1.3 Outline
The system identication procedure is characterized by four basic ingredients:
1. The observed data
2. A set of candidate models
3. A criterion of t
4. Validation
The problem can be expressed as nding that model in the candidate set, that
best describes the data, according to the criterion, and then evaluate and
validate that model's properties. To do this we need to penetrate a number
of things:
1. First, in Section 2 we give a preview of the whole process, as applied
to the simplest set of candidate models.
2. Then, at some length, in Sections 3 and 4 we display and discuss the
most common sets of candidate models used in system identication.
In general terms, a model will be a predictor of the next output y(t)
from the process, given past observations Z t;1 , and parameterized in
terms of a nite-dimensional parameter vector :
y^(tj) = g( Z t;1) (1)
3. We then, in Section 5, discuss the criterion of t for general model sets,.
This will have the character
X
VN () = ky(t) ; y^(tj)k2 (2)
5
We also discuss how to nd the best model (minimize the criterion),
how to assess its properties.
4. In Section 6 we shall describe special methods for linear black-box mod-
els. This includes frequency analysis, spectral analysis and so called
subspace methods for linear state-space models.
5. We then turn to the practical issues of system identication to assure
good quality of the data by proper experiment design (Section 7) how
to decide upon a good model structure (Section 8) and how to deal
with the data (Section 9).

2 Displaying the Basic Ideas: ARX Models


and the Linear Least Squares Method
The Model
We shall generally denote the system's input and output at time t by u(t)
and y(t), respectively. Perhaps the most basic relationship between the input
and output is the linear di erence equation
y(t) + a1 y(t ; 1) + : : : + any(t ; n) = b1 u(t ; 1) + : : : + bm u(t ; m)(3)
We have chosen to represent the system in discrete time, primarily since ob-
served data are always collected by sampling. It is thus more straightforward
to relate observed data to discrete time models. Nothing prevents us how-
ever from working with continuous time models: we shall return to that in
Section 3.4.
In (3) we assume the sampling interval to be one time unit. This is not
essential, but makes notation easier.
A pragmatic and useful way to see (3) is to view it as a way of determining
the next output value given previous observations:
y(t) = ;a1 y(t ; 1) ; : : : ; an y(t ; n) + b1 u(t ; 1) + : : : + bm u(t ; m)(4)
6
For more compact notation we introduce the vectors
 = a1  : : :  an b1  : : :  bm ]T (5)
'(t) = ;y(t ; 1) : : : ; y(t ; n) u(t ; 1) : : : u(t ; m)]T (6)
With these (4) can be rewritten as
y(t) = 'T (t)
To emphasize that the calculation of y(t) from past data (4) indeed depends
on the parameters in , we shall rather call this calculated value y^(tj) and
write
y^(tj) = 'T (t) (7)

The Least Squares Method


Now suppose for a given system that we do not know the values of the
parameters in , but that we have recorded inputs and outputs over a time
interval 1  t  N :
Z N = fu(1) y(1) : : : u(N ) y(N )g (8)
An obvious approach is then to select  in (3) through (7) so as to t the
calculated values y^(tj) as well as possible to the measured outputs by the
least squares method:
min
 N
V ( Z N ) (9)
where

VN ( Z N ) = 1X N
(y(t) ; y^(tj))2 =
N t=1 (10)
XN
= 1 (y(t) ; 'T (t))2
N t=1
7
We shall denote the value of  that minimizes (9) by ^N :
^N = arg min
 N
V ( Z N ) (11)

(\arg min" means the minimizing argument, i.e., that value of  which min-
imizes VN .)
Since VN is quadratic in , we can nd the minimum value easily by setting
the derivative to zero:
XN
0 = d VN ( Z N ) = 2 '(t)(y(t) ; 'T (t))
d N t=1
which gives
X
N X
N
'(t)y(t) = '(t)'T (t) (12)
t=1 t=1
or
"X
N #;1 X
N
^N = '(t)'T (t) '(t)y(t) (13)
t=1 t=1

Once the vectors '(t) are dened, the solution can easily be found by modern
numerical software, such as MATLAB.
Example 3 First order di erence equation
Consider the simple model
y(t) + ay(t ; 1) = bu(t ; 1):
This gives us the estimate according to (5), (6) and (13)
" # "P 2 # " #
y (t ; 1) ; P y(t ;P1)u(t ; 1) ;1 ;PP y(t)y(t ; 1)
a^N = P
^bN ; y(t ; 1)u(t ; 1) u2(t ; 1) y(t)u(t ; 1)
8
All sums are from t = 1 to t = N . A typical convention is to take values
outside the measured range to be zero. In this case we would thus take y(0) =
0.

The simple model (3) and the well known least squares method (13) form
the archetype of System Identication. Not only that { they also give the
most commonly used parametric identication method and are much more
versatile than perhaps perceived at rst sight. In particular one should realize
that (3) can directly be extended to several dierent inputs (this just calls
for a redenition of '(t) in (6)) and that the inputs and outputs do not have
to be the raw measurements. On the contrary { it is often most important
to think over the physics of the application and come up with suitable inputs
and outputs for (3), formed from the actual measurements.

Example 4 An immersion heater


Consider a process consisting of an immersion heater immersed in a cooling
liquid. We measure:

 v(t): The voltage applied to the heater


 r(t): The temperature of the liquid
 y(t): The temperature of the heater coil surface
Suppose we need a model for how y (t) depends on r(t) and v (t). Some sim-
ple considerations based on common sense and high school physics (\Semi-
physical modeling") reveal the following:

 The change in temperature of the heater coil over one sample is pro-
portional to the electrical power in it (the inow power) minus the heat
loss to the liquid
 The electrical power is proportional to v2(t)
 The heat loss is proportional to y(t) ; r(t)

9
This suggests the model

y(t) = y(t ; 1) + v2(t ; 1) ;  (y(t ; 1) ; r(t ; 1))


which ts into the form

y(t) + 1 y(t ; 1) = 2 v2(t ; 1) + 3 r(t ; 1))


This is a two input (v2 and r) and one output model, and corresponds to
choosing

'(t) = ;y(t ; 1) v2(t ; 1) r(t ; 1)]T


in (7).

Some Statistical Remarks


Model structures, such as (7) that are linear in  are known in statistics
as linear regression and the vector '(t) is called the regression vector (its
components are the regressors). \Regress" here alludes to the fact that we
try to calculate (or describe) y(t) by \going back" to '(t). Models such as
(3) where the regression vector { '(t) { contains old values of the variable
to be explained { y(t) { are then partly auto-regressions. For that reason
the model structure (3) has the standard name ARX-model (Auto-regression
with extra inputs).
There is a rich statistical literature on the properties of the estimate ^N
under varying assumptions. See, e.g. Draper and Smith, 1981]. So far we
have just viewed (9) and (10) as \curve-tting". In Section 5.2 we shall deal
with a more comprehensive statistical discussion, which includes the ARX
model as a special case. Some direct calculations will be done in the following
subsection.

10
Model Quality and Experiment Design
Let us consider the simplest special case, that of a Finite Impulse Response
(FIR) model. That is obtained from (3) by taking n = 0:

y(t) = b1 u(t ; 1) + : : : bm u(t ; m) (14)

Suppose that the observed data really have been generated by a similar mech-
anism

y(t) = b01 u(t ; 1) + : : : b0m u(t ; m) + e(t) (15)

where e(t) is a white noise sequence with variance , but otherwise unknown.
(That is, e(t) can be described as a sequence of independent random variables
with zero mean values and variances .) Analogous to (7), we can write this
as

y(t) = 'T (t)0 + e(t) (16)

We can now replace y(t) in (13) by the above expression, and obtain
"N #;1 N
^N = X '(t)'T (t) X '(t)y(t)
t=1 t=1
"X
N #;1 "X
N X
N #
= '(t)'T (t) '(t)'T (t) 0 + '(t)e(t)
t=1 t=1 t=1

or
"N #;1 N
~N = ^N ; 0 = X '(t)'T (t) X '(t)e(t) (17)
t=1 t=1

Suppose that the input u is independent of the noise e. Then ' and e are
independent in this expression, so it is easy to see that E ~N = 0, since

11
e has zero mean. The estimate is consequently unbiased. Here E denotes
mathematical expectation.
We can also form the expectation of ~N ~NT , i.e., the covariance matrix of the
parameter error. Denote the matrix within brackets by RN . Take expectation
with respect to the white noise e. Then RN is a deterministic matrix and we
have
X
N
PN = E ~N ~NT = RN;1 '(t)'T (s)Ee(t)e(s)RN;1 = RN;1 (18)
ts=1

since the double sum collapses to RN .


We have thus computed the covariance matrix of the estimate ^N . It is
determined entirely by the input properties and the noise level. Moreover
dene
R" = Nlim 1R (19)
!1 N N

This will be the covariance matrix of the input, i.e. the i ; j -element of R" is
Ruu(i ; j ) = Eu(t + i)u(t + j ).
If the matrix R" is non-singular, we nd that the covariance matrix of the
parameter estimate is approximately (and the approximation improves as
N ! 1)

PN = N R" ;1 (20)

A number of things follow from this. All of them are typical of the general
properties to be described in Section 5.2:
 The covariance decays like
p 1=N , so the parameters approach the limit-
ing value at the rate 1= N .
 The covariance is proportional to the Noise-To-Signal ratio. That is, it
is proportional to the noise variance and inversely proportional to the
input power.
12
 The covariance does not depend on the input's or noise's signal shapes,
only on their variance/covariance properties.
 Experiment design, i.e., the selection of the input u, aims at making
the matrix R";1 "as small as possible". Note that the same R" can be
obtained for many dierent signals u.

3 Model structures I: Linear Models


3.1 Output error models
Starting from (3) there is actually another, quite dierent, way to approach
the calculation of good values of ai and bi from observed data (8).
Equation (3) describes a linear, discrete-time system with transfer function
b 1z
n;1 + b2 z n;2 + : : : + bm z n;m
G(z) = zn + a1 zn;1 + : : : + an (21)

(assuming n  m)
Here z is the z-transform variable, and one may simply think of the transfer
function G as a shorthand notation for the dierence equation (3).
We shall here use the shift operator q as an alternative for the variable z in
the (21). The shift operator q has the properties

qu(t) = u(t + 1) (22)

(just as multiplying a z-transform by z corresponds to a time shift).


Given only an input sequence

fu(t) t = 1 : : :  N g

13
we could then calculate the output for system (21) by running u as input to
this system:
y^(tj) = G(q)u(t) (23)

Example 5 A rst order system Consider the system


y(t + 1) + ay(t) = bu(t)
The output according to (23) is then obtained as
X
1
y^(tj) = q +b a u(t) = b (;a)k;1u(t ; k)
k=1
or
y^(t + 1j) + ay^(tj) = bu(t) (24)

Notice the essential dierence between (23) and (7)! In (7) we calculated
y^(tj) using both past measured inputs and also past measured outputs y(t;k).
In (23) y^(tj) is calculated from past inputs only. As soon as we use data
from a real system (that does not exactly obey (12)) there will always be a
dierence between these two ways of obtaining the computed output.
Now, we could of course still say that a reasonable estimate of  is obtained
by minimizing the quadratic t:

^N = arg min 1X


N
y(t) ; y^(tj)]2 (25)
 N t=1

even when y^(tj) is computed according to (23). Such an estimate is often


called an output-error estimate, since we have formed the t between a purely
simulated output and the measured output. Note that y^(tj) according to
(23) is not linear in , so the function to be minimized in (25) is not quadratic
in . Hence some numerical search schemes have to be applied in order to nd
14
^N in (25). Most often in practice a Gauss-Newton iterative minimization
procedure is used. See Section 5.4.
It follows from the discussion that the estimate obtained by (25) will in
general dier from the one from (9). What is the essential dierence? To
answer that question we will have to discuss various ways of perceiving and
describing the disturbances that act on the system.

3.2 Noise Models and Prediction Filters


(Readers who concentrate on the \bottom line" may skip directly to the end
of this section.)
A linear, nite-dimensional dynamical system can be described by the equa-
tion

y(t) = B (q) u(t)


A(q) (26)

See (21) { (22). Based on (26) we can predict the next output from previous
measurements either as in (23)

y^(tj) = B (q) u(t)


A(q) (27)

or as in (4), (7):

y^(tj) = (1 ; A(q))y(t) + B (q)u(t) (28)

Which one shall we choose? We can make the discussion more general by
writing for (26)

y(t) = G(q )u(t) (29)

15
to indicate that the transfer function depends on the (numerator and denom-
inator) parameters  (as in (5)). We can multiply both sides of (29) by an
arbitrary stable lter W (q ) giving
W (q )y(t) = W (q )G(q )u(t) (30)
then we can add y(t) to both sides of the equation and rearrange to obtain
y(t) = (1 ; W (q ))y(t) + W (q )G(q )u(t) (31)
We assume that the lter W starts with a 1:
W (q ) = 1 + w1q;1 + w2q;1 + w2q;2 + : : :
so that 1 ; W (q ) actually contains a delay. We thus obtain the predictor
y^(tj) = (1 ; W (q ))y(t) + W (q )G(q )u(t) (32)
Note that this formulation is now similar to that of (28).
We see that the method used in (27) corresponds to the choice W (q )  1,
while the procedure in (28) is obtained for W (q ) = A(q).
Now, does the predictor (32) depend on the lter W (q )? Well, if the input
{ output data are exactly described by (29) and we know all relevant initial
conditions the predictor (32) produces identical predictions y^(tj), regardless
of the choice of stable lters W (q ).
To bring out the relevant dierences, we must accept the fact that there will
always be disturbances and noise that aect the system, so instead of (29)
we have a true system that relates the inputs and outputs by
y(t) = G0 (q)u(t) + v(t) (33)
for some disturbance sequence fv(t)g. So (32) becomes,
y^(tj) = f(1 ; W (q ))G0(q) + W (q )G(q )gu(t) + (1 ; W (q ))v(t)
16
Now, assume that there exists a value 0 , such that G(q 0) = G0 (q). Then
the error of the above prediction becomes

"(t ) = y(t) ; y^(tj0 ) = W (q )v(t) (34)

To make this error as small as possible we must thus match the choice of
the lter W (q 0) to the properties of the noise v(t). Suppose v(t) can be
described as ltered white noise

v(t) = H0(q)e(t) (35)

where e(t) is a sequence of independent random variables. Here we assume


H0(q) to be normalized, so that to H0(q) = 1 + h1 q;1 + : : :. Then, it is easy
to see from (34) that no lter W (q 0) can do better than 1=H0(q), since this
makes the prediction error "(t 0 ) equal to the white noise source e(t).
All this leads to the following summarizing conclusion (which is the only
thing one needs to understand from this section).

1. In order to distinguish between dierent predictors, one has to intro-


duce descriptions of the disturbances that act on the process
2. If the input{output description is assumed to be
y(t) = G(q)u(t) + H (q)e(t) (36)
where fe(t)g is a white noise source, then the natural predictor of y(t)
given previous observations of inputs and outputs will be
y^(tj) = 1 ; H ;1(q)]y(t) + H ;1(q)G(q)u(t) (37)
This predictor gives the smallest possible error { if fe(t)g indeed is
white noise.

17
3. Since the dynamics G(q) and the noise model H (q) are typically un-
known, we will have to work with a parameterized description
y(t) = G(q )u(t) + H (q )e(t) (38)
The corresponding predictor in then obtained from (37):
y^(tj) = I ; H ;1(q )]y(t) + H ;1(q )G(q )u(t) (39)

We may now return to the question we posed at the end of Section 3.1. What
is the practical dierence between minimizing (10) and (25)? Comparing
(23) with (29) we see that this predictor corresponds to the assumption that
H = 1, i.e., that white measurement noise is added to the output. This also
means that minimizing the corresponding prediction error { (25) { will give
a clearly better estimate, if this assumptions more or less correct.

3.3 Linear Black-Box Model Parameterization


The model parameterization (38) contains a large number of much-used spe-
cial cases. We have already seen that the ARX-model (3) corresponds to

G(q ) = B (q)
A(q) H (q ) = A(1q) (40)

That is, we assume the system (plant) dynamics and the noise model to have
common poles, and no numerator dynamics for the noise. Its main feature is
that the predictor y^(tj) will be linear in the parameters  according to (11)
or (7).
We can make (40) more general by allowing also numerator dynamics. We
then obtain the parameterization

G(q ) = B (q)
A(q) H (q ) = CA((qq)) (41)

18
The eect of the numerator C is that the current predicted value of y will
depend upon previous predicted values, not just measured values. This is
known as a ARMAX model, since the C (q)-term makes the noise model a
Moving Average of a white noise source. Also, (41) assumes that the dynam-
ics and the noise model have common poles, and is therefore particularly
suited for the case where the disturbances enter together with the input,
\early in the process" so to speak.
The output error (OE) model we considered in (23) corresponds to the case

G(q ) = B (q)
F (q) H (q ) = 1 (42)

(We use F in the denominator to distinguish the case from (40).) Its unique
feature is that the prediction is based on past inputs only. It also concentrates
on the model dynamics and does not bother about describing the noise.
We can also generalize this model by allowing a general noise model

G(q ) = B (q) 
F (q)
C (q)
H (q ) = D (q) (43)

This particular model parameterization is known as the Box-Jenkins (BJ)


model, since it as suggested in the well known book by Box and Jenkins, 1970].
It diers from the ARMAX-model (42) is that it assigns dierent dynamics
(poles) to the noise characteristics from the input - output properties. It is
thus better suited for cases where the noise enters \late in the process", such
as measurement noise. See gure 3!
One might wonder why we need all these dierent model parameterizations.
As has been mentioned in the text each has its advantages, which can be
summarized as follows

ARX: Gives a linear regression. Very simple to estimate 


ARMAX: Gives reasonable exibility to the noise description. Assumes
that noise enters like the inputs

19
e
u-
B - &?m - 1 y-
ARX
A

e
u- B - &?m y - OE
F

e
?
C
u-
B - &?m - 1 y-
ARMAX
A
e
?
C
D
u- B - &?m y - BJ
F

Figure 3: Linear Black-Box Model structures.

20
OE: Concentrates on the input{output dynamics
BJ: Very exible. Assumes no common characteristics between noise and
input{output behavior.

3.4 Physically parameterized linear models


So far we have treated the parameters  only as vehicles to give reasonable
exibility to the transfer functions in the general linear model (38). This
model can also be arrived at from other considerations.
Consider a continuous time state space model

x_ (t) = A()x(t) + B ()u(t) (44a)

y(t) = C ()x(t) + v(t) (44b)

Here x(t) is the state vector and typically consists of physical variables (such
as positions and velocities etc). The state space matrices A B and C are
parameterized by the parameter vector , reecting the physical insight we
have into the process. The parameters could be physical constants (resis-
tance, heat transfer coe(cients, aerodynamical derivatives etc) whose values
are not known. They could also reect other types of insights into the sys-
tem's properties.
Example 8.4 An electric motor
Consider an electric motor with the input u being the applied voltage and the
output y being the angular position of the motor shaft.
A rst, but reasonable approximation of the motor's dynamics is as a rst
order system from voltage to angular velocity, followed by an integrator:

G(s) = s(s b+ a)

21
If we select the state variables
 
x(t) = yy_ ((tt))

we obtain the state space form


   
x_ = 00 ;1a x + 0b u (45)
y = (1 0)x + v
where v denotes disturbances and noise. In this case we thus have
 
 = ab
0 1  0 (46)
A() = 0 ;a B () = b
C = (1 0)
The parameterization reects our insight that the system contains an integra-
tion, but is in this case not directly derived from detailed physical modeling.
Basic physical laws would in this case have given us how  depends on phys-
ical constants, such as resistance of the wiring, amount of inertia, friction
coecients and magnetic eld constants. 2
Now, how do we t a continuous-time model (44a) to sampled observed data?
If the input u(t) has been piecewise constant over the sampling interval

u(t) = u(kT ) kT  t < (k + 1)T


then the states, inputs and outputs at the sampling instants will be repre-
sented by the discrete time model
x((k + 1)T ) = A"()x(kT ) + B" ()u(kT ) (47)
y(kT ) = C ()x(kT ) + v(kT )

22
where
ZT
A"() = eA()T  B" () = eA() B ()d (48)
0

This follows from solving (44) over one sampling period. We could also
further model the added noise term v(kT ) and represent the system in the
innovations form
x"((k + 1)T ) = A"()"x(kT ) + B" ()u(kT ) + K" ()e(kT ) (49)
y(kT ) = C ()"x(kT ) + e(kT )
where fe(kT )g is white noise. The step from (47) to (49) is really a standard
Kalman lter step: x" will be the one-step ahead predicted Kalman states. A
pragmatic way to think about it is as follows: In (47) the term v(kT ) may
not be white noise. If it is colored we may separate out that part of v(kT )
that cannot be predicted from past values. Denote this part by e(kT ): it will
be the innovation. The other part of v(kT ) { the one that can be predicted
{ can then be described as a combination of earlier innovations, e(`T ) ` < k.
Its eect on y(kT ) can then be described via the states, by changing them
from x to x", where x" contains additional states associated with getting v(kT )
from e(`T ) k  `.
Now (49) can be written in input { output from as (let T = 1)
y(t) = G(q )u(t) + H (q )e(t) (50)
with
G(q ) = C ()(qI ; A"());1B" () (51)
H (q ) = I + C ()(qI ; A"());1K" ()
We are thus back at the basic linear model (38). The parameterization of G
and H in terms of  is however more complicated than the ones we discussed
in Section 3.3.
The general estimation techniques, model properties (including the charac-
terization (85)), algorithms, etc., apply exactly as described in Section 5.
23
From these examples it is also quite clear that non-linear models with un-
known parameters can be approached in the same way. We would then
typically arrive at a a structure
x_ (t) = f (x(t) u(t) )
y(t) = h(x(t) u(t) ) + v(t) (52)
In this model, all noise eects are collected as additive output disturbances
v(t) which is a restriction, but also a very helpful simplication. If we dene
y^(tj) as the simulated output response to (52), for a given input, ignoring the
noise v(t), everything that was said in Section 5 about parameter estimation,
model properties, etc. is still applicable.

4 Model Structures II: Non-linear Black Box


Models
In this section we shall describe the basic ideas behind model structures that
have the capability to cover any non-linear mapping from past data to the
predicted value of y(t). Recall that we dened a general model structure as
a parameterized mapping in (1):
y^(tj) = g( Z t;1) (53)
We shall consequently allow quite general non-linear mappings g. This
section will deal with some general principles for how to construct such
mappings, and will cover Articial Neural Networks as a special case. See
Sjoberg et al., 1995] and Juditsky et al., 1995] for recent and more compre-
hensive surveys.

4.1 Non-Linear Black-Box Structures


Now, the model structure family (53) is really too general, and it turns out
to be useful to write g as a concatenation of two mappings: one that takes
24
the increasing number of past observations Z t;1 and maps them into a nite
dimensional vector '(t) of xed dimension and one that takes this vector to
the space of the outputs:

y^(tj) = g( Z t;1) = g('(t) ) (54)

where

'(t) = '(Z t;1 ) (55)

Let the dimension of ' be d. As before, we shall call this vector the regression
vector and its components will be referred to as the regressors. We also
allow the more general case that the formation of the regressors is itself
parameterized:

'(t) = '(Z t;1  ) (56)

which we for short write '(t ). For simplicity, the extra argument  will
however be used explicitly only when essential for the discussion.
The choice of the non-linear mapping in (53) has thus been reduced to two
partial problems for dynamical systems:

1. How to choose the non-linear mapping g(') from the regressor space
to the output space (i.e., from Rd to Rp).
2. How to choose the regressors '(t) from past inputs and outputs.

The second problem is the same for all dynamical systems, and it turns out
that the most useful choices of regression vectors are to let them contain past
inputs and outputs, and possibly also past predicted/simulated outputs. The
regression vector will thus be of the character (6). We now turn to the rst
problem.

25
4.2 Non-Linear Mappings: Possibilities
Now let us turn to the nonlinear mapping
g(' ) (57)
which for any given  maps from Rd to Rp. For most of the discussion we
will use p = 1, i.e., the output is scalar-valued. At this point it does not
matter how the regression vector ' = ('1  : : :  'd )T was constructed. It is
just a vector that lives in Rd .
It is natural to think of the parameterized function family as function ex-
pansions:
X
g(' ) = k gk (') : (58)
We refer to gk as basis functions, since the role they play in (58) is similar
to that of a functional space basis. In some particular situations, they do
constitute a functional basis. Typical examples are wavelet bases (see below).
We are going to show that expansion (58) with dierent basis functions,
plays the role of a unied framework for investigating most known nonlinear
black-box model structures.
Now, the key question is: How to choose the basis functions gk ? The follow-
ing facts are essential to understand the connections between most known
nonlinear black-box model structures:
 All the gk are formed from one "mother basis function", that we gener-
ically denote by (x).
 This function (x) is a function of a scalar variable x.
 Typically gk are dilated (scaled) and translated versions of . For the
scalar case d = 1 we may write
gk (') = gk (' k  k ) = (k (' ; k )) (59)
We thus use k to denote the dilation parameters and k to denote
translation parameters.
26
A Scalar Example: Fourier Series Take (x) = cos(x). Then (58),(59)
will be the Fourier series expansion, with k as the frequencies and k as the
phases.

Another Scalar Example: Piecewise Constant Functions Take as


the unit interval indicator function:
(
(x) = 10 for 0x<1 (60)
else

and take, for example, k = k k = 1=) and k = f (k)). Then (58),


(59) gives a piecewise constant approximation of any function f . Clearly
we would have obtained a quite similar result by a smooth version of the
indicator function, e.g., the Gaussian bell:

(x) = p1 e;x2 =2 (61)


2

A Variant of the Piece-wise constant case Take to be the unit step


function
(
(x) = 01 for x<0 (62)
for x  0

We then just have a variant of (60), since the indicator function can be
obtained as the dierence of two steps. A smooth version of the step, like
the sigmoid function

(x) = (x) = 1 +1e;x (63)

will of course give quite similar results.

27
Classi cation of single-variable basis functions
Two classes of single-variable basis functions can be distinguished depending
on their nature :
 Local Basis Functions are functions having their gradient with bounded
support, or at least vanishing rapidly at innity. Loosely speaking, their
variations are concentrated to some interval.
 Global Basis Functions are functions having innitely spreading (bounded
or not) gradient.
Clearly the Fourier series is an example of a global basis function, while (60),
(61), (62) and (63) are all local functions.

Construction of multi-variable basis functions


In multi-dimensional case (d > 1), gk are multi-variable functions. In practice
they are often constructed from the single-variable function in some simple
manner. Let us recall the three most often used methods for constructing
multi-variable basis functions from single-variable basis functions.

1. Tensor product. Given d single-variable functions of the dierent


components 'j of a d-dimensional vector ', h1('1) : : :  hd ('d) (iden-
tical or not). The tensor product construction of the corresponding
function from Rd is then given by their product. In the present case
this means that the basis functions are constructed from the scalar
function as
Yd
gk (') = (kj ('j ; kj )) (64)
j =1
2. Radial construction. For any single-variable function the radial
construction of multi-variable basis function of ' 2 Rd , has the form
gk (') = gk (' k  k ) = (k' ; k kk ) (65)
28
where k
kk denotes any chosen norm on the space of the regression
vector '. The norm could typically be a quadratic norm
k'k2k = 'T k ' (66)
with k as a possibly k-dependent positive denite matrix of dilation
(scale) parameters. In simple cases k may be just scaled versions of
the identity matrix.
3. Ridge construction. Let be any single-variable function. Then for
all k 2 Rd, k 2 R, a ridge function is given by
gk (') = gk (' k  k ) = (kT ' + k ) ' 2 Rd (67)
The ridge function is thus constant for all ' in the sub-space f' 2
Rd : kT ' = constantg. As a consequence, even if the mother basis
function has local support, the basis functions gk will have unbounded
support in this subspace. The resulting basis could be said to be semi-
global, but the term ridge function is more precise.

Approximation Issues
For any of the described choices the resulting model becomes
X
n
g(' ) = k (k (' ; k )) (68)
k=1

with the dierent exact interpretations of the argument k (' ; k ) just dis-
cussed. The expansion is entirely determined by
 the scalar valued function (x) of a scalar variable x
 the way the basis functions are expanded to depend on a vector '.
The parameterization in terms of  can be characterized by three types of
parameters:
29
 The coordinates 
 The scale or dilation parameters 
 The location parameters 
A key issue is how well the function expansion is capable of approximating
any possible \true system" g0('). There is a rather extensive literature on
this subject. For an identication oriented survey, see, e.g., Juditsky et al., 1995].
The bottom line is easy: For almost any choice of (x) { except being a
polynomial { the expansion (68) can approximate any \reasonable" function
g0(') arbitrarily well for suciently large n.
It is not di(cult to understand this. It is su(cient to check that the delta
function { or the indicator function for arbitrarily small areas { can be arbi-
trarily well approximated within the expansion. Then clearly all reasonable
functions can also be approximated. For local with radial construction this
is immediate: Indeed by scaling and location an arbitrarily small indicator
function can be places anywhere. For the ridge construction one needs to
show that a number of hyperplanes dened by  and  can be placed and
intersect so that any small area in Rd is cut out.
The question of how ecient the expansion is, i.e., how large n is required
to achieve a certain degree of approximation is more di(cult, and has no
general answer. We may point to the following aspects:
 If the scale and location parameters  and  are allowed to depend
on the function g0 to be approximated, then the number of terms n
required for a certain degree of approximation is much less than if
k  k  k = 1 : : : is an a priori xed sequence.
 For the local, radial approach the number of terms required to achieve
a certain degree of approximation  of an p times dierentiable function
is proportional to
1
n (d=p (69)
)

It thus increases exponentially with the number of regressors. This is


often referred to as the curse of dimensionality.
30
Connection to \Named Structures"
Here we briey review some popular structures, other structures related to in-
terpolation techniques are discussed in Sjoberg et al., 1995, Juditsky et al., 1995].

Wavelets The local approach corresponding to (58,65) has direct connec-


tions to wavelet networks and wavelet transforms. The exact relationships
are discussed in Sjoberg et al., 1995]. Loosely, we note that via the dilation
parameters in k we can work with dierent scales simultaneously to pick
up both local and not-so-local variations. With appropriate translations and
dilations of a single suitably chosen function (the \mother wavelet"), we
can make the expansion (58) orthonormal. This is discussed extensively in
Juditsky et al., 1995].

Wavelet and Radial Basis Networks. The choice (61) without any or-
thogonalization is found in both wavelet networks, Zhang and Benveniste, 1992]
and radial basis neural networks Poggio and Girosi, 1990].

Neural Networks The ridge choice (67) with given by (63) gives a
much-used neural network structure, viz. the one hidden layer feedforward
sigmoidal net.

Hinging Hyperplanes If instead of using the sigmoid  function we choose


\V-shaped" functions (in the form of a higher-dimensional \open book")
Breiman's hinging hyperplane structure is obtained, Breiman, 1993].

Nearest Neighbors or Interpolation By selecting as in (60) and the


location and scale vector k k in the structure (65), such that exactly one
observation falls into each \cube", the nearest neighbor model is obtained :
just load the input-output record into a table, and, for a given ', pick the
pair (yb 'b) for 'b closest to the given ', yb is the desired output estimate. If one
replaces (60) by a smoother function and allow some overlapping of the basis
functions, we get interpolation type techniques such as kernel estimators.
31
Fuzzy Models Also so called fuzzy models based on fuzzy set membership
belong to the model structures of the class (58). The basis functions gk then
are constructed from the fuzzy set membership functions and the inference
rules using the tensor approach (64). The exact relationship is described in
Sjoberg et al., 1995].

4.3 Estimating Non-linear Black Box Models


The model structure is determined by the following choices

 The regression vector (typically built up from past inputs and outputs)
 The basic function (local) or  (ridge)
 The number of elements (nodes) in the expansion (58).
Once these choices have been made y^(tj) = g('(t) ) is a well dened func-
tion of past data and the parameters . The parameters are made up of
coordinates in the expansion (58), and from location and scale parameters
in the dierent basis functions.
All the algorithms and analytical results of Section 5 can thus be applied. For
Neural Network applications these are also the typical estimation algorithms
used, often complemented with regularization, which means that a term is
added to the criterion (74), that penalizes the norm of . This will reduce
the variance of the model, in that "spurious" parameters are not allowed to
take on large, and mostly random values. See e.g. Sjoberg et al., 1995].
For wavelet applications it is common to distinguish between those param-
eters that enter linearly in y^(tj) (i.e. the coordinates in the function ex-
pansion) and those that enter non-linearly (i.e. the location and scale pa-
rameters). Often the latter are seeded to xed values and the coordinates
are estimated by the linear least squares method. Basis functions that give
a small contribution to the t (corresponding to non-useful values of the
scale and location parameters) can them be trimmed away ("pruning" or
"shrinking").

32
5 General Parameter Estimation Techniques
In this section we shall deal with issues that are independent of model struc-
ture. Principles and algorithms for tting models to data, as well as the
general properties of the estimated models are all model-structure indepen-
dent and equally well applicable to, say, ARMAX models and Neural Network
models.
The section is organized as follows. In Section 5.1 the general principles
for parameter estimation are outlined. Sections 5.2 and 5.3 deal with the
asymptotic (in the number of observed data) properties of the models, while
algorithms are described in Section 5.4.

5.1 Fitting Models to Data


In Section 2 we showed one way to parameterize descriptions of dynamical
systems. There are many other possibilities and we shall spend a fair amount
of this contribution to discuss the dierent choices and approaches. This is
actually the key problem in system identication. No matter how the problem
is approached, the bottom line is that such a model parameterization leads
to a predictor

y^(tj) = g( Z t;1) (70)

that depends on the unknown parameter vector and past data Z t;1 (see (8).
This predictor can be linear in y and u. This in turn contains several special
cases both in terms of black-box models and physically parameterized ones,
as was discussed in Sections 3 and 3.4, respectively. The predictor could also
be of general, non-linear nature, as was discussed in Section 4.
In any case we now need a method to determine a good value of , based
on the information in an observed, sampled data set (8). It suggests itself
that the basic least-squares like approach (9) through (11) still is a natural
approach, even when the predictor y^(tj) is a more general function of .
A procedure with some more degrees of freedom is the following one
33
1. From observed data and the predictor y^(tj) form the sequence of pre-
diction errors,
"(t ) = y(t) ; y^(tj) t = 1 2 : : : N (71)
2. Possibly lter the prediction errors through a linear lter L(q),
"F (t ) = L(q)"(t ) (72)
so as to enhance or depress interesting or unimportant frequency bands
in the signals.
3. Choose a scalar valued, positive function `(
) so as to measure the \size"
or \norm" of the prediction error:
`("F (t )) (73)
4. Minimize the sum of these norms:
^N = arg min
 N
V ( Z N ) (74)
where
X
N
VN ( Z N ) = N1 `("F (t )) (75)
t=1

This procedure is natural and pragmatic { we can still think of it as \curve-


tting" between y(t) and y^(tj). It also has several statistical and information
theoretic interpretations. Most importantly, if the noise source in the system
(like in (38)) is supposed to be a sequence of independent random variables
fe(t)g each having a probability density function fe(x), then (74) becomes
the Maximum Likelihood estimate (MLE) if we choose
L(q) = 1 and `(") = ; log fe(") (76)
The MLE has several nice statistical features and thus gives a strong \moral
support" for using the outlined method. Another pleasing aspect is that the
34
method is independent of the particular model parameterization used (al-
though this will aect the actual minimization procedure). For example, the
method of \back propagation" often used in connection with neural network
parameterizations amounts to computing ^N in (74) by a recursive gradient
method. We shall deal with these aspects in Section 5.4.

5.2 Model Quality


An essential question is, of course, what properties will the estimate resulting
from (74) have. These will naturally depend on the properties of the data
record Z N dened by (8). It is in general a di(cult problem to characterize
the quality of ^N exactly. One normally has to be content with the asymptotic
properties of ^N as the number of data, N , tends to innity.
It is an important aspect of the general identication method (74) that the
asymptotic properties of the resulting estimate can be expressed in general
terms for arbitrary model parameterizations.
The rst basic result is the following one:

^N !  as N ! 1 where (77)

 = arg min

E`("F (t )) (78)

That is, as more and more data become available, the estimate converges to
that value , that would minimize the expected value of the \norm" of the
ltered prediction errors. This is in a sense the best possible approximation of
the true system that is available within the model structure. The expectation
E in (78) is taken with respect to all random disturbances that aect the
data and it also includes averaging over the input properties. This means
in particular that  will make y^(tj ) a good approximation of y(t) with
respect to those aspects of the system that are enhanced by the input signal
used.
The second basic result is the following one: If f"(t )g is approximately

35
white noise, then the covariance matrix of ^N is approximately given by

E (^N ;  )(^N ;  )T N E(t)T (t)];1 (79)

where
 = E"2(t  ) (80)
d y^(tj)j
(t) = d (81)
 =

Think of  as the sensitivity derivative of the predictor with respect to the


parameters. Then (79) says that the covariance matrix for ^N is proportional
to the inverse of the covariance matrix of this sensitivity derivative. This is
a quite natural result.
Note: For all these results, the expectation operator E can, under most
general conditions, be replaced by the limit of the sample mean, that is

E(t)T (t) $ Nlim 1XN


(t)T (t) (82)
!1 N t=1

2
The results (77) through (81) are general and hold for all model structures,
both linear and non-linear ones, subject only to some regularity and smooth-
ness conditions. They are also fairly natural, and will give the guidelines for
all user choices involved in the process of identication. See Ljung, 1987] for
more details around this.

A Characterization of the Limiting Model in a General Class of


Linear Models
Let us apply the general limit result (77)-(78) to the linear model structure
(38). If we choose a quadratic criterion `(") = "2 (in the scalar output case)
36
then this result tells us, in the time domain, that the limiting parameter
estimate is the one that minimizes the ltered prediction error variance (for
the input used during the experiment.) Suppose that the data actually have
been generated by

y(t) = G0 (q)u(t) + v(t) (83)

Let *u (!) be the input spectrum and *v (!) be the spectrum for the additive
disturbance v. Then the ltered prediction error can be written

"F (t ) = HL((qq)) y(t) ; G(q )u(t)] =


L(q) (G (q) ; G(q ))u(t) + v(t)] (84)
H (q ) 0
By Parseval's relation, the prediction error variance can also be written as an
integral over the spectrum of the prediction error. This spectrum, in turn,
is directly obtained from (84), so the limit estimate  in (78) can also be
dened as
"Z 
2 *u (! )jL(e )j
i! 2
 = arg min jG0(e ) ; G(e  )j jH (ei!  )j2 d!
i! i!
Z ;  (85)
+ *v (!)jL(ei! )j2=jH (ei!  )j2d!
;

If the noise model H (q ) = H(q) does not depend on  (as in the output
error model (42)) the expression (85) thus shows that the resulting model
G(ei!   ) will give that frequency function in the model set that is closest to
the true one, in a quadratic frequency norm with weighting function

Q(!) = *u (!)jL(ei! )j2=jH(ei! )j2 (86)

This shows clearly that the t can be aected by the choice of prelter L,
the input spectrum *u and the noise model H.

37
5.3 Measures of Model Fit
Some quite general expressions for the expected model t, that are indepen-
dent of the model structure, can also be developed.
Let us measure the (average) t between any model (70) and the true system
as

V" () = E jy(t) ; y^(tj)j2 (87)

Here expectation E is over the data properties (i.e. expectation over \Z 1"
with the notation (8)). Recall that expectation also can be interpreted as
sample means as in (82).
Before we continue, let us note the very important aspect that the t V"
will depend, not only on the model and the true system, but also on data
properties, like input spectra, possible feedback, etc. We shall say that the
t depends on the experimental conditions.
The estimated model parameter ^N is a random variable, because it is con-
structed from observed data, that can be described as random variables. To
evaluate the model t, we then take the expectation of V" (^N ) with respect
to the estimation data. That gives our measure

FN = E V" (^N ) (88)

In general, the measure FN depends on a number of things:

 The model structure used.


 The number of data points N .
 The data properties for which the t V" is dened.
 The properties of the data used to estimate ^N .

38
The rather remarkable fact is that if the two last data properties coincide,
then, asymptotically in N , (see, e.g., Ljung, 1987], Chapter 16)

FN V"N ( )(1 + dim


N ) (89)

Here  is the value that minimizes the expected criterion (78). The notation
dim means the number of estimated parameters. The result also assumes
that the criterion function `(") = k"k2, and that the model structure is
successful in the sense that "F (t) is approximately white noise.
Despite the reservations about the formal validity of (89), it carries a most
important conceptual message: If a model is evaluated on a data set with the
same properties as the estimation data, then the t will not depend on the
data properties, and it will depend on the model structure only in terms of
the number of parameters used and of the best t o ered within the structure.
The expression can be rewritten as follows. Let y^0(tjt ; 1) denote the \true"
one step ahead prediction of y(t), and let

W () = E jy^0(tjt ; 1) ; y^(tj)j2 (90)

and let

 = E jy(t) ; y^0(tjt ; 1)j2 (91)

Then  is the innovations variance, i.e., that part of y(t) that cannot be pre-
dicted from the past. Moreover W () is the bias error, i.e. the discrepancy
between the true predictor and the best one available in the model structure.
Under the same assumptions as above, (89) can be rewritten as

FN  + W () +  dim
N (92)

The three terms constituting the model error then have the following inter-
pretations

39
  is the unavoidable error, stemming from the fact that the output
cannot be exactly predicted, even with perfect system knowledge.
 W () is the bias error. It depends on the model structure, and on the
experimental conditions. It will typically decrease as dim increases.
 The last term is the variance error. It is proportional to the number
of estimated parameters and inversely proportional to the number of
data points. It does not depend on the particular model structure or
the experimental conditions.

5.4 Algorithmic Aspects


In this section we shall discuss how to achieve the best t between observed
data and the model, i.e. how to carry out the minimization of (74). For
simplicity we here assume a quadratic criterion and set the prelter L to
unity:
X
N
VN () = 21N jy(t) ; y^(tj)j2 (93)
t=1

No analytic solution to this problem is possible unless the model y^(tj) is


linear in , so the minimization has to be done by some numerical search
procedure. A classical treatment of the problem of how to minimize the sum
of squares is given in Dennis and Schnabel, 1983].
Most e(cient search routines are based on iterative local search in a \down-
hill" direction from the current point. We then have an iterative scheme of
the following kind
^(i+1) = ^(i) ; iRi;1 g^i (94)
Here ^(i) is the parameter estimate after iteration number i. The search
scheme is thus made up of the three entities
 i step size
40
 g^i an estimate of the gradient VN0 (^(i) )
 Ri a matrix that modies the search direction

Search directions
The basis for the local search is the gradient
X
N
VN0 () = dVd
N ()
= ; N1 (y(t) ; y^(tj))(t ) (95)
t=1
where
@ y^(tj)
(t ) = @ (96)
The gradient  is in the general case a matrix with dim  rows and dim y
columns. It is well known that gradient search for the minimum is ine(cient,
especially close to the minimum. Then it is optimal to use the Newton search
direction
R;1()VN0 () (97)
where
X
N
R() = VN00 () = d VdN2() = N1 (t )T (t )
2

t=1
X
N @ y^(tj)
+ N1
2
(y(t) ; y^(tj)) @ 2
(98)
t=1
The true Newton direction will thus require that the second derivative
@ 2 y^(tj)
@2
be computed. Also, far from the minimum, R() need not be positive semidef-
inite. Therefore alternative search directions are more common in practice:
41
- Gradient direction. Simply take
Ri = I (99)
- Gauss-Newton direction. Use
X
N
Ri = Hi = N1 (t ^(i) )T (t ^(i) ) (100)
t=1

- Levenberg-Marquard direction. Use


Ri = Hi + I (101)
where Hi is dened by (100).
- Conjugate gradient direction. Construct the Newton direction from a se-
quence of gradient estimates. Loosely, think of VN00 as constructed by
dierence approximation of d gradients. The direction (97) is however
constructed directly, without explicitly forming and inverting V 00.

It is generally considered, Dennis and Schnabel, 1983], that the Gauss-Newton


search direction is to be preferred. For ill-conditioned problems the Levenberg-
Marquard modication is recommended.

Local Minima
A fundamental problem with minimization tasks like (9) is that VN () may
have several or many local (non-global) minima, where local search algo-
rithms may get caught. There is no easy solution to this problem. It is
usually well worth the eort to nd a good initial value (0) where to start
the iterations. Other than that, only various global search strategies are
left, such as random search, random restarts, simulated annealing, and the
genetic algorithm.

42
6 Special Estimation Techniques for Linear
Black Box Models
An important feature of a linear, time invariant system is that it is entirely
characterized by its impulse response. So if we know the system's response
to an impulse, we will also know its response to any input. Equivalently,
we could study the frequency response, which is the Fourier transform of the
impulse response.
In this section we shall consider estimation methods for linear systems, that
do not use particular model parameterizations. First, in Section 6.1, we shall
consider direct methods to determine the impulse response and the frequency
response, by simply applying the denitions of these concepts.
In Section 6.2 spectral analysis for frequency function estimation will be
discussed. Finally, in Section 6.3 a recent method to estimate general linear
systems (of given order, by unspecied structure) will be described.

6.1 Transient and Frequency Analysis


Transient Analysis
The rst step in modeling is to decide which quantities and variables are
important to describe what happens in the system. A simple and common
kind of experiment that shows how and in what time span various variables
aect each other is called step-response analysis or transient analysis. In
such experiments the inputs are varied (typically one at a time) as a step:
u(t) = u0, t < t0  u(t) = u1, t  t0. The other measurable variables in the
system are recorded during this time. We thus study the step response of the
system. An alternative would be to study the impulse response of the system
by letting the input be a pulse of short duration. From such measurements,
information of the following nature can be found:

1. The variables aected by the input in question. This makes it easier


to draw block diagrams for the system and to decide which inuences
43
can be neglected.
2. The time constants of the system. This also allows us to decide which
relationships in the model can be described as static (that is, they have
signicantly faster time constants than the time scale we are working
with.
3. The characteristic (oscillatory, poorly damped, monotone, and the like)
of the step responses, as well as the levels of static gains. Such in-
formation is useful when studying the behavior of the nal model in
simulation. Good agreement with the measured step responses should
give a certain condence in the model.

Frequency Analysis
If a linear system has the transfer function G(q) and the input is
u(t) = u0 cos !kT (k ; 1)T  t  kT (102)
then the output after possible transients have faded away will be
y(t) = y0 cos(!t + ') for t = T 2T 3T : : : (103)
where
y0 = jG(ei!T )j
u0 (104)
' = arg G(ei!T ) (105)
If the system is driven by the input (102) for a certain u0 and !1 and we
measure y0 and ' from the output signal, it is possible to determine the
complex number G(ei!1 T ) using (104){(105). By repeating this procedure for
a number of dierent !, we can get a good estimate of the frequency function
G(ei!T ). This method is called frequency analysis. Sometimes it is possible
to see or measure u0, y0, and ' directly from graphs of the input and output
signals. Most of the time, however, there will be noise and irregularities that
make it di(cult to determine ' directly. A suitable procedure is then to
correlate the output with cos !t and sin !t.
44
6.2 Estimating the Frequency Response by Spectral
Analysis
De nitions
The cross spectrum between two (stationary) signals u(t) and y(t) is dened
as the Fourier transform of their cross covariance function, provided this
exists:
X
1
*yu (!) = Ryu( )e;i! (106)
 =;1

where Ryu( ) is dened by

Ryu( ) = Ey(t)u(t ; ) (107)

The (auto) spectrum *u (!) of a signal u is dened as *uu(!), i.e. as its cross
spectrum with itself.
The spectrum describes the frequency contents of the signal. The connection
to more explicit Fourier techniques is evident by the following relationship

*u(!) = Nlim 1 jU (!)j2 (108)


!1 N N

where UN is the discrete time Fourier transform


X
N
UN (!) = u(t)ei!t (109)
t=1

The relationship (108) is shown, e.g. in Ljung and Glad, 1994].


Consider now the general linear model:

y(t) = G(q)u(t) + v(t) (110)

45
It is straightforward to show that the relationships between the spectra and
cross spectra of y and u (provided u and v are uncorrelated) is given by

*yu (!) = G(ei! )*u(!) (111)


*y (!) = jG(ei! )j2*u(!) + *v (!) (112)

It is easy to see how the transfer function G(ei! ) and the noise spectrum
v (!) can be estimated using these expressions, if only we have a method to
estimate cross spectra.

Estimation of Spectra
The spectrum is dened as the Fourier transform of the correlation function.
A natural idea would then be to take the transform of the estimate

^ 1 XN
Ryu( ) = N y(t)u(t ; )
N (113)
t=1

That will not work in most cases, though. The reason could be described as
follows: The estimate R^yu
N ( ) is not reliable for large , since it is based on
only a few observations. These "bad" estimates are mixed with good ones in
the Fourier transform, thus creating an overall bad estimate. It is better to
introduce a weighting, so that correlation estimates for large lags carry a
smaller weight:

^*Nyu (!) = X R^yu



N (`)
w (`)e;i`! (114)

`=;

This spectral estimation method is known as the The Blackman-Tukey ap-


proach. Here w (`) is a window function that decreases with j j. This func-
tion controls the trade-o between frequency resolution and variance of the
estimate. A function that gives signicant weights to the correlation at large
lags will be able to provide ner frequency details (a longer time span is

46
covered). At the same time it will have to use "bad" estimates, so the sta-
tistical quality (the variance) is poorer. We shall return to this trade-o in
a moment. How should we choose the shape of the window function w (`)?
There is no optimal solution to this problem, but the most common window
used in spectral analysis is the Hamming window:
w (k) = 21 (1 + cos k ) jkj <  (115)
w (k) = 0 jkj  
From the spectral estimates *u, *y and *yu obtained in this way, we can
now use (111) to obtain a natural estimate of the frequency function G(ei! ):

^ i! *^ Nyu(!)
GN (e ) = ^ N (116)
*u (! )
Furthermore, the disturbance spectrum can be estimated from (112) as
^N 2
^*Nv (!) = *^ Ny (!) ; j*yu(!)j (117)
*^ Nu (!)
To compute these estimates, the following steps are performed:

1. Collect data y(k), u(k) k = 1 : : :  N .


2. Subtract the corresponding sample means form the data. This will
avoid bad estimates at very low frequencies.
3. Choose the width of the lag window w (k).
4. Compute R^yN (k), R^uN (k), and R^yu
N (k) for jkj   according to (113).

5. Form the spectral estimates *^ Ny (!), *^ Nu (!), and *^ Nyu (!) according to
(114) and analogous expressions.
6. Form (116) and possibly also (117).

47
Quality of the Estimates
The estimates G^ N and *^ Nw are formed entirely from estimates of spectra and
cross spectra. Their properties will therefore be inherited from the properties
of the spectral estimates. For the Hamming window with width  , it can be
shown that the frequency resolution will be about
p radians/time unit (118)
 2
This means that details in the true frequency function that are ner than
this expression will be smeared out in the estimate. It is also possible to
show that the estimate's variances satisfy

Var G^ N (i!) 0:7



*v (!) (119)
N *u (!)
and
Var *^ Nv (!) 0:7
N
*2v (!) (120)

Variance" here refers to taking expectation over the noise sequence v(t).]
Note that the relative variance in (119) typically increases dramatically as
! tends to the Nyquist frequency. The reason is that jG(i!)j typically de-
cays rapidly, while the noise-to-signal ratio *v (!)=*u(!) has a tendency to
increase as ! increases. In a Bode diagram the estimates will thus show con-
siderable uctuations at high frequencies. Moreover, the constant frequency
resolution (118) will look thinner and thinner at higher frequencies in a Bode
diagram due to the logarithmic frequency scale.
See Ljung and Glad, 1994] for a more detailed discussion.

Choice of Window Size


The choice of  is a pure trade-o between frequency resolution and variance
(variability). For a spectrum with narrow resonance peaks it is thus necessary
48
to choose a large value of  and accept a higher variance. For a more at
spectrum, smaller values of  will do well. In practice a number of dierent
values of  are tried out. Often we start with a small value of  and increase
it successively until an estimate is found that balances the trade-o between
frequency resolution (true details) and variance (random uctuations). A
typical value for spectra without narrow resonances is  = 20{30.

6.3 Subspace Estimation Techniques for State Space


Models
A linear system can always be represented in state space form:

x(t + 1) = Ax(t) + Bu(t) + w(t)


y(t) = Cx(t) + Du(t) + e(t) (121)

We assume that we have no insight into the particular structure, and we


would just estimate any matrices A B C and D, that give a good description
of the input-output behavior of the system. This is not without problems,
among other things because there are an innite number of such matrices
that describe the same system (the similarity transforms). The coordinate
basis of the state-space realization thus needs to be xed.
Let us for a moment assume that not only are u and y measured, but also
the sequence of state vectors x. This would, by the way, x the state-space
realization coordinate basis. Now, with known u y and x, the model (121)
becomes a linear regression: the unknown parameters, all of the matrix en-
tries in all the matrices, mix with measured signals in linear combinations.
To see this clearly, let
 x(t + 1) 
Y (t) = y(t)
A B 
+= C D
 
*(t) = xu((tt))

49
 w(t) 
E (t) = e(t)

Then, (121) can be rewritten as

Y (t) = +*(t) + E (t) (122)

From this all the matrix elements in + can be estimated by the simple least
squares method, as described in Section 2. The covariance matrix for E (t)
can also be estimated easily as the sample sum of the model residuals. That
will give the covariance matrices for w and e, as well as the cross covariance
matrix between w and e. These matrices will, among other things, allow us
to compute the Kalman lter for (121). Note that all of the above holds
without changes for multivariable systems, i.e., when the output and input
signals are vectors.
The only remaining problem is where to get the state vector sequence x
from. It has long been known, e.g., Rissanen, 1974], Akaike, 1974b], that
all state vectors x(t) that can be reconstructed from input-output data in
fact are linear combinations of the components of the n k-step ahead output
predictors

y^(t + kjt) k = f1 2 : : :  ng (123)

where n is the model order (the dimension of x). See also Appendix 4.A in
Ljung, 1987]. We could then form these predictors, and select a basis among
their components:
0 y^(t + 1jt) 1
x(t) = L B
@ ... CA (124)
y^(t + njt)
The choice of L will determine the basis for the state-space realization, and
is done in such a way that it is well conditioned. The predictor y^(t + kjt) is
a linear function of u(s) y(s) 1  s  t and can e(ciently be determined

50
by linear projections directly on the input output data. (There is one com-
plication in that u(t + 1) : : :  u(t + k) should not be predicted, even if they
aect y(t + k).)
What we have described now is the subspace projection approach to es-
timating the matrices of the state-space model (121), including the ba-
sis for the representation and the noise covariance matrices. There are a
number of variants of this approach. See among several references, e.g.
Overschee and DeMoor, 1994], Larimore, 1983]
The approach gives very useful algorithms for model estimation, and is par-
ticularly well suited for multivariable systems. The algorithms also allow
numerically very reliable implementations. At present, the asymptotic prop-
erties of the methods are not fully investigated, and the general results quoted
in Section 5.2 are not directly applicable. Experience has shown, however,
that condence intervals computed according to the general asymptotic the-
ory, are good approximations. One may also use the estimates obtained by
a subspace method as initial conditions for minimizing the prediction error
criterion (74).

7 Data Quality
It is desirable to aect the conditions under which the data are collected.
The objective with such experiment design is to make the collected data set
Z N as informative as possible with respect to the models to be built using the
data. A considerable amount of theory around this topic can be developed
and we shall here just review some basic points.
The rst and most important point is the following one

1. The input signal u must be such that it exposes all the relevant proper-
ties of the system. It must thus not be too \simple". For example, a
pure sinusoid
u(t) = A cos !t

51
will only give information about the system's frequency response at
frequency !. This can also be seen from (85). The rule is that
 the input must contain at least as many dierent frequencies as
the order of the linear model to be built.
To be on the safe side, a good choice is to let the input be random
(such as ltered white noise). It then contains all frequencies.
Another case where the input is too simple is when it is generated by
feedback such as
u(t) = ;Ky(t) (125)
If we would like to build a rst order ARX model
y(t) + ay(t ; 1) = bu(t ; 1) + e(t)
we nd that for any given  all models such that
a + bK = 
will give identical input-output data. We can thus not distinguish be-
tween these models using an experiment with (125). That is, we can
not distinguish between any combinations of \a" and \b" if they satisfy
the above condition for a given \". The rule is
 If closed-loop experiments have to be performed, the feedback law
must not be too simple. It is to be preferred that a set-point in
the regulator is being changed in a random fashion.
The second main point in experimental design is
2. Allocate the input power to those frequency bands where a good model
in particularly important.
This is also seen from the expression (85).
If we let the input be ltered white noise, this gives information how
to choose the lter. In the time domain it is often useful to think like
this:
52
 Use binary (two-level) inputs if linear models are to be built: This
gives maximal variance for amplitude-constrained inputs.
 Check that the changes between the levels are such that the input
occasionally stays on one level so long that a step response from
the system has time, more or less, to settle. There is no need to let
the input signal switch so quickly back and forth that no response
in the output is clearly visible.
Note that the second point is really just a reformulation in the time
domain of the basic frequency domain advice: let the input energy be
concentrated in the important frequency bands.
A third basic piece of advice about experiment design concerns the
choice of sampling interval.
3. A typical good sampling frequency is 10 times the bandwidth of the
system. That corresponds roughly to 5 { 7 samples along the rise time
of a step response.

8 Model Validation and Model Selection


8.1 A Pragmatic Viewpoint
The system identication process has, as we have seen, these basic ingredients
 The set of models
 The data
 The selection criterion
Once these have been decided upon, we have, at least implicitly, dened a
model: The one in the set that best describes the data according to the
criterion. It is thus, in a sense, the best available model in the chosen set.
But is it good enough? It is the objective of model validation to answer that
question. Often the answer turns out to be \no", and we then have to go
53
#
Construct

- experiment,
"collect data !
#  ?Data
- Filter
data?  - present
Polish and
data
" !
#  ?Data
Choose model
structure - Fitto model
data
Model
"6 !
?
Data Model structure Validate
not OK not OK model
#? 
No Accept
model?
" !
?Yes
Figure 4: The identication loop

back and review the choice of model set, or perhaps modify the data set. See
Figure 4!
How do we check the quality of a model? The prime method is to investigate
how well it is capable of reproducing the behavior of a new set of data (the
validation data) that was not used to t the model. That is, we simulate the
obtained model with a new input and compare this simulated output. One
may then use one's eyes or numerical measurements of t to decide if the
t in question is good enough. Suppose we have obtained several dierent
models in dierent model structures (say a 4th order ARX model, a 2nd
order BJ model, a physically parameterized one and so on) and would like to
know which one is best. The simplest and most pragmatic approach to this
problem is then to simulate each one of them on validation data, evaluate
their performance, and pick the one that shows the most favorable t to

54
measured data. (This could indeed be a subjective criterion!)

8.2 The Bias-Variance Trade-o


At the heart of the model structure selection process is to handle the trade-
o between bias and variance, as formalized by (92). The "best" model
structure is the one that minimizes FN , the t between the model and the
data for a fresh data set { one that was not used for estimating the model.
Most procedures for choosing the model structures are also aiming at nding
this best choice.

Cross Validation
A very natural and pragmatic approach is Cross Validation. This means
N1 that
that the available data set is split into two parts, estimation data, Zest
is used to estimate the models:

^N1 = arg min VN1 ( Zest


N1 ) (126)
N2 for which the criterion is evaluated:
and validation data, Zval

F^N1 = VN2 (^N1  Zval


N2 ) (127)

Here VN is the criterion (75). Then F^N will be an unbiased estimate of the
measure FN , dened by (88), which was discussed at length in the previous
section. The procedure would the be to try out a number of model structures,
and choose the one that minimizes F^N1 .
Such cross validation techniques to nd a good model structure has an im-
mediate intuitive appeal. We simply check if the candidate model is capable
of "reproducing" data it hasn't yet seen. If that works well, we have some
condence in the model, regardless of any probabilistic framework that might
be imposed. Such techniques are also the most commonly used ones.

55
A few comments could be added. In the rst place, one could use dierent
splits of the original data into estimation and validation data. For example,
in statistics, there is a common cross validation technique called "leave one
out". This means that the validation data set consists of one data point "at
a time", but successively applied to the whole original set. In the second
place, the test of the model on the validation data does not have to be in
terms of the particular criterion (127). In system identication it is common
practice to simulate (or predict several steps ahead) the model using the
validation data, and then visually inspect the agreement between measured
and simulated (predicted) output.

Estimating the Variance Contribution { Penalizing the Model Com-


plexity
It is clear that the criterion (127) has to be evaluated on the validation data
to be of any use { it would be strictly decreasing as a function of model
exibility if evaluated on the estimation data. In other words, the adverse
eect of the dimension of  shown in (92) would be missed. There are a
number of criteria { often derived from entirely dierent viewpoints { that
try to capture the inuence of this variance error term. The two best known
ones are Akaike's Information Theoretic Criterion, AIC, which has the form
(for Gaussian disturbances)
! X
N
V~N ( Z N ) = 1 + 2dim 1 "2(t ) (128)
N N t=1
and Rissanen's Minimum Description Length Criterion, MDL in which dim
in the expression above is replaced by log Ndim. See Akaike, 1974a] and
Rissanen, 1978].
The criterion V~N is then to be minimized both with respect to  and to a
family of model structures. The relation to the expression (89) for FN is
obvious.

56
8.3 Residual Analysis
The second basic method for model validation is to examine the residuals
(\the leftovers") from the identication process. These are the prediction
errors

"(t) = "(t ^N ) = y(t) ; y^(tj^N )


i.e. what the model could not \explain". Ideally these should be independent
of information that was at hand at time t ; 1. For example if "(t) and u(t ; )
turn out to be correlated, then there are things in y(t) that originate from
u(t ; ) but have not been properly accounted for by y^(tj^N ) The model has
then not squeezed out all relevant information about the system from the
data.
It is good practice to always check the residuals for such (and other) depen-
dencies. This is known as residual analysis. A basic reference for how to
perform this is Draper and Smith, 1981].

9 Back to Data: The Practical Side of Iden-


ti cation
9.1 Software for System Identi cation
In practice System Identication is characterized by some quite heavy numer-
ical calculations to determine the best model in each given class of models.
This is mixed with several user choices, trying dierent model structures,
ltering data and so on. In practical applications we will thus need good
software support. There are now many dierent commercial packages for
identication available, such as Mathwork's System Identication Toolbox
Ljung, 1995], Matrix0xs System Identication Module MATRIXx, 1991] and
PIM Landau, 1990]. They all have in common that they oer the following
routines:

57
A Handling of data, plotting, etc.
Filtering of data, removal of drift, choice of data segments, etc.
B Non-parametric identication methods
Estimation of covariances, Fourier transforms, correlation- and spectral-
analysis, etc.
C Parametric estimation methods
Calculation of parametric estimates in dierent model structures.
D Presentation of models
Simulation of models, estimation and plotting of poles and zeros, com-
putation of frequency functions, and plotting Bode diagrams, etc.
E Model validation
Computation and analysis of residuals ("(t ^N )). Comparison between
dierent models' properties, etc.
The existing program packages dier mainly in various user interfaces and
by dierent options regarding the choice of model structure according to C
above. For example, MATLAB's Identication Toolbox Ljung, 1995] covers
all linear model structures discussed here, including arbitrarily parameterized
linear models in continuous time.
Regarding the user interface, there is now a clear trend to make it graphically
oriented. This avoids syntax problems and relies more on \click and move",
at the same time as tedious menu-labyrinths are avoided. More aspects of
CAD tools for system identication are treated in Ljung, 1993].

9.2 How to Get to a Good Model?


It follows from our discussion that the most essential element in the process
of identication { once the data have been recorded { is to try out various
model structures, compute the best model in the structures using (38), and
then validate this model. Typically this has to be repeated with quite a few
dierent structures before a satisfactory model can be found.
58
While one should not underestimate the di(culties of this process, the fol-
lowing simple procedure to get started and gain insight into the models could
be suggested:

1. Find out a good value for the delay between input and output, e.g. by
using correlation analysis.
2. Estimate a fourth order linear model with this delay using part of the
data, and simulate this model with the input and compare the model's
simulated output with the measured output over the whole data record.
In MATLAB language this is simple,
z = y u]
compare(z,arx(z(1:200,:),4 4 1]))

If the model/system is unstable or has integrators, use prediction over a


reasonable large time horizon instead of simulation.
Now, either of two things happen:

 The comparison \looks good". Then we can be condent that with


some extra work { trying out dierent orders, and various noise models
{ we can ne tune the model and have an acceptable model quite soon.
 The comparison \does not look good". Then we must do further work.
There are three basic reasons for the failure.
1. A good description needs higher order linear dynamics. This is
actually in practice the least likely reason, except for systems with
mechanical resonances. One then obviously has to try higher order
models or focus on certain frequency bands by band pass ltering.
2. There are more signals that signicantly a ect the output. We
must then look for what these signals might be, check if they
can be measured and if so include them among the inputs. Sig-
nal sources that cannot be traced or measured are called \distur-
bances" and we simply have to live with the fact that they will
have an adverse eect on the comparisons.

59
3. Some important non-linearities have been overlooked. We must
then resort to semi-physical modeling to nd out if some of the
measured signals should be subjected to non-linear transforma-
tions. If no such transformations suggest themselves, one might
have to try some non-linear black-box model, like a neural net-
work.

Clearly, this advice does not cover all the art of identication, but it is a
reasonable rst approximation.

Example 6 Aircraft dynamics


Let us try the recipe on the aircraft data in gure 1! Picking the canard angle
only as the input, estimating a fourth order model based on the data points
90 to 180, gives gure 5. (We use 10-step ahead prediction in this example
since the models are unstable { as they should be, JAS has unstable dynamics
in this ight case). It does not \look good". Let us try alternative 2: More
inputs. We repeat the procedure using all three inputs in gure 1. That is,
the model is computed as
arx(y u1 u2 u3], 4 4 4 4 1 1 1])
on the same data set. The comparison is shown in gure 4. It \looks good".
By further ne-tuning, as well as using model structures from physical mod-
eling, only slight improvements are obtained.

Example 7 Bu er vessel dynamics


Let us now consider the pulp process of gure 2. We use the -number before
the vessel as input and the -number after the vessel as output. The delay
is preliminarily estimated to 12 samples. Our recipe, where a fourth order
linear model is estimated using the rst 200 samples and then simulated over
the whole record gives gure 7. It does not look good.
Some reection shows that this process indeed must be non-linear (or time-
varying): the ow and the vessel level denitely a ect the dynamics. For
example, if the ow was a plug ow (no mixing) the vessel would have a
dynamics of a pure delay equal to vessel volume divided by ow.

60
0 20 40 60 80 100 120 140 160 180

Figure 5: Dashed line: Actual Pitch rate. Solid line: 10 step ahead predicted
pitch rate, based on the fourth order model from canard angle only.

0 20 40 60 80 100 120 140 160 180

Figure 6: As gure 5 but using all three inputs.

61
6

-2

-4

-6

-8
0 100 200 300 400 500 600

Figure 7: Dashed line: -number after the vessel, actual measurements. Solid
line: Simulated -number using the input only and a fourth order linear
model with delay 12, estimated using the rst 200 data points.

Let us thus resample the date accordingly, i.e. so that a new sample is taken
(by interpolation from the original measurement) equidistantly in terms of
integrated ows divided by volume. In MATLAB terms this will be
z = y,u] pf = flow./level
t =1:length(z)
newt =
table1(cumsum(pf),t],pf(1)sum:(pf)]' )
newz = table1(t,z], newt)

We now apply the same procedure to the resampled data. This gives gure
8. This \looks good". Somewhat better numbers can then be obtained by
ne-tuning the orders.

References
Akaike, 1974a] Akaike, H. (1974a). A new look at the statistical model
identication. IEEE Transactions on Automatic Control, AC-19:716{723.

62
6

-2

-4

-6

-8
0 50 100 150 200 250 300 350 400

Figure 8: Same as gure 7 but applied to resampled data

Akaike, 1974b] Akaike, H. (1974b). Stochastic theory of minimal realization.


IEEE Transactions on Automatic Control, AC-19:667{674.

Astrom and Bohlin, 1965]  Astrom, K. J. and Bohlin, T. (1965). Numerical
identication of linear dynamic systems from normal operating records. In
IFAC Symposium on Self-Adaptive Systems, Teddington, England.
Box and Jenkins, 1970] Box, G. E. P. and Jenkins, D. R. (1970). Time
Series Analysis, Forcasting and Control. Holden-Day, San Francisco.
Breiman, 1993] Breiman, L. (1993). Hinging hyperplanes for regression,
classication and function approximation. IEEE Trans. Info. Theory,
39:999{1013.
Brillinger, 1981] Brillinger, D. (1981). Time Series: Data Analysis and The-
ory. Holden-Day, San Francisco.
Dennis and Schnabel, 1983] Dennis, J. E. and Schnabel, R. B. (1983). Nu-
merical methods for unconstrained optimization and nonlinear equations.
Prentice-Hall.
Draper and Smith, 1981] Draper, N. and Smith, H. (1981). Applied Regres-
sion Analysis, 2nd ed. Wiley, New York.

63
Juditsky et al., 1995] Juditsky, A., Hjalmarsson, H., Benveniste, A., De-
lyon, B., Ljung, L., Sjoberg, J., and Zhang, Q. (1995). Nonlinear black-box
modeling in system identication: Mathematical foundations. Automatica,
31(12):1724{1750.
Landau, 1990] Landau, I. D. (1990). System Identicaiton and Control De-
sign Using P.I.M. + Software. Prentice Hall, Engelwood Clis.
Larimore, 1983] Larimore, W. E. (1983). System identication, reduced
order ltering and modelling via canonical variate analysis. In Proc 1983
American Control Conference, San Francisco.
Ljung, 1987] Ljung, L. (1987). System Identication - Theory for the User.
Prentice-Hall, Englewood Clis, N.J.
Ljung, 1993] Ljung, L. (1993). Identication of linear systems. In Linkens,
D. A., editor, CAD for Control Systems, chapter 6, pages 147{165. Marcel
Dekker, New York.
Ljung, 1995] Ljung, L. (1995). The System Identication Toolbox: The
Manual. The MathWorks Inc. 1st edition 1986, 4th edition 1995, Nat-
ick, MA.
Ljung and Glad, 1994] Ljung, L. and Glad, T. (1994). Modeling of Dynamic
Systems. Prentice Hall, Englewood Clis.
Ljung and Soderstrom, 1983] Ljung, L. and Soderstrom, T. (1983). Theory
and Practice of Recursive Identication. MIT press, Cambridge, Mass.
MATRIXx, 1991] MATRIXx (1991). MATRIXx users guide. Integrated
Systems Inc., Santa Clara, CA.
Overschee and DeMoor, 1994] Overschee, P. V. and DeMoor, B. (1994).
N4sid: Subspace algorithms for the identication of combined
deterministic-stochastic systems. Automatica, 30:75{93.
Poggio and Girosi, 1990] Poggio, T. and Girosi, F. (1990). Networks for
approximation and learning. Proc. of the IEEE, 78:1481{1497.
Rissanen, 1974] Rissanen, J. (1974). Basis of invariants and canonical forms
for linear dynamic systems. Automatica, 10:175{182.
64
Rissanen, 1978] Rissanen, J. (1978). Modelling by shortest data description.
Automatica, 14:465{471.
Schoukens and Pintelon, 1991] Schoukens, J. and Pintelon, R. (1991). Iden-
tication of Linear Systems: A Practical Guideline to Accurate Modeling.
Pergamon Press, London (U.K.).
Sjoberg et al., 1995] Sjoberg, J., Zhang, Q., Ljung, L., Benveniste, A., De-
lyon, B., Glorennec, P., Hjalmarsson, H., and Juditsky, A. (1995). Non-
linear black-box modeling in system identication: A unied overview.
Automatica, 31(12):1691{1724.
Soderstrom and Stoica, 1989] Soderstrom, T. and Stoica, P. (1989). System
Identication. Prentice-Hall Int., London.
Zhang and Benveniste, 1992] Zhang, Q. and Benveniste, A. (1992). Wavelet
networks. IEEE Trans Neural Networks, 3:889{898.

/home/rt/ljung/papers/eeenc/enc.tex

65
Division, Department Date
Avdelning, Institution Datum

Division of Automatic Control 2007-06-29


Department of Electrical Engineering

Language Report category


Språk Rapporttyp ISBN


 Svenska/Swedish  Licentiatavhandling ISRN

 Engelska/English
  Examensarbete 
 C-uppsats
 D-uppsats Title of series, numbering
Serietitel och serienummer ISSN

  Övrig rapport

1400-3902


LiTH-ISY-R-2809
URL för elektronisk version

http://www.control.isy.liu.se

System Identication
Title
Titel

Lennart Ljung
Author
Författare

Abstract
Sammanfattning

This is a survey of System Identication.

Keywords identication
Nyckelord

You might also like