System Identification
System Identification
System Identification
Lennart Ljung
Division of Automatic Control
E-mail: [email protected]
Address:
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
WWW: http://www.control.isy.liu.se
AUTOMATIC CONTROL
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
Technical reports from the Automatic Control group in Linköping are available from
http://www.control.isy.liu.se/publications.
Abstract
This is a survey of System Identication.
Keywords: identication
System Identication
Lennart Ljung
Department of Electrical Engineering, Linkoping University
S-581 83 Linkoping, Sweden. e-mail [email protected]
April 27, 1997
1 Introduction
The process of going from observed data to a mathematical model is funda-
mental in science and engineering. In the control area this process has been
termed \System Identication" and the objective is then to nd dynamical
models (dierence or dierential equations) from observed input and output
signals. Its basic features are however common with general model building
processes in statistics and other sciences.
System Identication covers the problem of building models of systems where
both when insignicant prior information is available and when the system's
properties are known, up to a few parameters (physical constants). Accord-
ingly, one talks about black box and gray box models. Among black box
models there are familiar linear models such as ARX and ARMAX, and
among non-linear black box models we have, e.g., Articial Neural Networks
(ANN).
1
0 20 40 60 80 100 120 140 160 180
Figure 1: Results from test ights of the new Swedish aircraft JAS-Gripen,
developed by SAAB Military Aircraft AB, Sweden. From above) Pitch rate.
b) Elevator angle. c) Canard angle. d) Leading edge ap.
So, the bottom line of these examples is that we have collected input-output
data from a process or a plant, and we need to extract information from
these to nd out (something about) the process's dynamical properties.
3
OUTPUT #1
25
20
15
10
5
0 100 200 300 400 500 600
INPUT #1
20
15
10
80
60
40
20
0
0 100 200 300 400 500 600
200
150
100
50
0
0 100 200 300 400 500 600
Figure 2: From the pulp factory at Skutskar, Sweden. The pulp ows con-
tinuously through the plant via several buer tanks. From above: a) The
-number of the pulp owing into a buer vessel. b) The -number of the
pulp coming out from the buer vessel. c) Flow out from the buer vessel.
d) Level in the buer vessel.
4
on-line model estimation Ljung and Soderstrom, 1983], non-parametric fre-
quency domain methods Brillinger, 1981], etc. To follow the development in
the eld, the IFAC series of Symposia on System Identication (Budapest,
Hungary (1991), Copenhagen, Denmark (1994), Fukuoka, Japan (1997)) is
also a good source.
1.3 Outline
The system identication procedure is characterized by four basic ingredients:
1. The observed data
2. A set of candidate models
3. A criterion of t
4. Validation
The problem can be expressed as nding that model in the candidate set, that
best describes the data, according to the criterion, and then evaluate and
validate that model's properties. To do this we need to penetrate a number
of things:
1. First, in Section 2 we give a preview of the whole process, as applied
to the simplest set of candidate models.
2. Then, at some length, in Sections 3 and 4 we display and discuss the
most common sets of candidate models used in system identication.
In general terms, a model will be a predictor of the next output y(t)
from the process, given past observations Z t;1 , and parameterized in
terms of a nite-dimensional parameter vector :
y^(tj) = g( Z t;1) (1)
3. We then, in Section 5, discuss the criterion of t for general model sets,.
This will have the character
X
VN () = ky(t) ; y^(tj)k2 (2)
5
We also discuss how to nd the best model (minimize the criterion),
how to assess its properties.
4. In Section 6 we shall describe special methods for linear black-box mod-
els. This includes frequency analysis, spectral analysis and so called
subspace methods for linear state-space models.
5. We then turn to the practical issues of system identication to assure
good quality of the data by proper experiment design (Section 7) how
to decide upon a good model structure (Section 8) and how to deal
with the data (Section 9).
VN ( Z N ) = 1X N
(y(t) ; y^(tj))2 =
N t=1 (10)
XN
= 1 (y(t) ; 'T (t))2
N t=1
7
We shall denote the value of that minimizes (9) by ^N :
^N = arg min
N
V ( Z N ) (11)
(\arg min" means the minimizing argument, i.e., that value of which min-
imizes VN .)
Since VN is quadratic in , we can nd the minimum value easily by setting
the derivative to zero:
XN
0 = d VN ( Z N ) = 2 '(t)(y(t) ; 'T (t))
d N t=1
which gives
X
N X
N
'(t)y(t) = '(t)'T (t) (12)
t=1 t=1
or
"X
N #;1 X
N
^N = '(t)'T (t) '(t)y(t) (13)
t=1 t=1
Once the vectors '(t) are dened, the solution can easily be found by modern
numerical software, such as MATLAB.
Example 3 First order di erence equation
Consider the simple model
y(t) + ay(t ; 1) = bu(t ; 1):
This gives us the estimate according to (5), (6) and (13)
" # "P 2 # " #
y (t ; 1) ; P y(t ;P1)u(t ; 1) ;1 ;PP y(t)y(t ; 1)
a^N = P
^bN ; y(t ; 1)u(t ; 1) u2(t ; 1) y(t)u(t ; 1)
8
All sums are from t = 1 to t = N . A typical convention is to take values
outside the measured range to be zero. In this case we would thus take y(0) =
0.
The simple model (3) and the well known least squares method (13) form
the archetype of System Identication. Not only that { they also give the
most commonly used parametric identication method and are much more
versatile than perhaps perceived at rst sight. In particular one should realize
that (3) can directly be extended to several dierent inputs (this just calls
for a redenition of '(t) in (6)) and that the inputs and outputs do not have
to be the raw measurements. On the contrary { it is often most important
to think over the physics of the application and come up with suitable inputs
and outputs for (3), formed from the actual measurements.
The change in temperature of the heater coil over one sample is pro-
portional to the electrical power in it (the inow power) minus the heat
loss to the liquid
The electrical power is proportional to v2(t)
The heat loss is proportional to y(t) ; r(t)
9
This suggests the model
10
Model Quality and Experiment Design
Let us consider the simplest special case, that of a Finite Impulse Response
(FIR) model. That is obtained from (3) by taking n = 0:
Suppose that the observed data really have been generated by a similar mech-
anism
where e(t) is a white noise sequence with variance , but otherwise unknown.
(That is, e(t) can be described as a sequence of independent random variables
with zero mean values and variances .) Analogous to (7), we can write this
as
We can now replace y(t) in (13) by the above expression, and obtain
"N #;1 N
^N = X '(t)'T (t) X '(t)y(t)
t=1 t=1
"X
N #;1 "X
N X
N #
= '(t)'T (t) '(t)'T (t) 0 + '(t)e(t)
t=1 t=1 t=1
or
"N #;1 N
~N = ^N ; 0 = X '(t)'T (t) X '(t)e(t) (17)
t=1 t=1
Suppose that the input u is independent of the noise e. Then ' and e are
independent in this expression, so it is easy to see that E ~N = 0, since
11
e has zero mean. The estimate is consequently unbiased. Here E denotes
mathematical expectation.
We can also form the expectation of ~N ~NT , i.e., the covariance matrix of the
parameter error. Denote the matrix within brackets by RN . Take expectation
with respect to the white noise e. Then RN is a deterministic matrix and we
have
X
N
PN = E ~N ~NT = RN;1 '(t)'T (s)Ee(t)e(s)RN;1 = RN;1 (18)
ts=1
This will be the covariance matrix of the input, i.e. the i ; j -element of R" is
Ruu(i ; j ) = Eu(t + i)u(t + j ).
If the matrix R" is non-singular, we nd that the covariance matrix of the
parameter estimate is approximately (and the approximation improves as
N ! 1)
PN = N R" ;1 (20)
A number of things follow from this. All of them are typical of the general
properties to be described in Section 5.2:
The covariance decays like
p 1=N , so the parameters approach the limit-
ing value at the rate 1= N .
The covariance is proportional to the Noise-To-Signal ratio. That is, it
is proportional to the noise variance and inversely proportional to the
input power.
12
The covariance does not depend on the input's or noise's signal shapes,
only on their variance/covariance properties.
Experiment design, i.e., the selection of the input u, aims at making
the matrix R";1 "as small as possible". Note that the same R" can be
obtained for many dierent signals u.
(assuming n m)
Here z is the z-transform variable, and one may simply think of the transfer
function G as a shorthand notation for the dierence equation (3).
We shall here use the shift operator q as an alternative for the variable z in
the (21). The shift operator q has the properties
fu(t) t = 1 : : : N g
13
we could then calculate the output for system (21) by running u as input to
this system:
y^(tj) = G(q)u(t) (23)
Notice the essential dierence between (23) and (7)! In (7) we calculated
y^(tj) using both past measured inputs and also past measured outputs y(t;k).
In (23) y^(tj) is calculated from past inputs only. As soon as we use data
from a real system (that does not exactly obey (12)) there will always be a
dierence between these two ways of obtaining the computed output.
Now, we could of course still say that a reasonable estimate of is obtained
by minimizing the quadratic t:
See (21) { (22). Based on (26) we can predict the next output from previous
measurements either as in (23)
or as in (4), (7):
Which one shall we choose? We can make the discussion more general by
writing for (26)
15
to indicate that the transfer function depends on the (numerator and denom-
inator) parameters (as in (5)). We can multiply both sides of (29) by an
arbitrary stable lter W (q ) giving
W (q )y(t) = W (q )G(q )u(t) (30)
then we can add y(t) to both sides of the equation and rearrange to obtain
y(t) = (1 ; W (q ))y(t) + W (q )G(q )u(t) (31)
We assume that the lter W starts with a 1:
W (q ) = 1 + w1q;1 + w2q;1 + w2q;2 + : : :
so that 1 ; W (q ) actually contains a delay. We thus obtain the predictor
y^(tj) = (1 ; W (q ))y(t) + W (q )G(q )u(t) (32)
Note that this formulation is now similar to that of (28).
We see that the method used in (27) corresponds to the choice W (q ) 1,
while the procedure in (28) is obtained for W (q ) = A(q).
Now, does the predictor (32) depend on the lter W (q )? Well, if the input
{ output data are exactly described by (29) and we know all relevant initial
conditions the predictor (32) produces identical predictions y^(tj), regardless
of the choice of stable lters W (q ).
To bring out the relevant dierences, we must accept the fact that there will
always be disturbances and noise that aect the system, so instead of (29)
we have a true system that relates the inputs and outputs by
y(t) = G0 (q)u(t) + v(t) (33)
for some disturbance sequence fv(t)g. So (32) becomes,
y^(tj) = f(1 ; W (q ))G0(q) + W (q )G(q )gu(t) + (1 ; W (q ))v(t)
16
Now, assume that there exists a value 0 , such that G(q 0) = G0 (q). Then
the error of the above prediction becomes
To make this error as small as possible we must thus match the choice of
the lter W (q 0) to the properties of the noise v(t). Suppose v(t) can be
described as ltered white noise
17
3. Since the dynamics G(q) and the noise model H (q) are typically un-
known, we will have to work with a parameterized description
y(t) = G(q )u(t) + H (q )e(t) (38)
The corresponding predictor in then obtained from (37):
y^(tj) = I ; H ;1(q )]y(t) + H ;1(q )G(q )u(t) (39)
We may now return to the question we posed at the end of Section 3.1. What
is the practical dierence between minimizing (10) and (25)? Comparing
(23) with (29) we see that this predictor corresponds to the assumption that
H = 1, i.e., that white measurement noise is added to the output. This also
means that minimizing the corresponding prediction error { (25) { will give
a clearly better estimate, if this assumptions more or less correct.
G(q ) = B (q)
A(q) H (q ) = A(1q) (40)
That is, we assume the system (plant) dynamics and the noise model to have
common poles, and no numerator dynamics for the noise. Its main feature is
that the predictor y^(tj) will be linear in the parameters according to (11)
or (7).
We can make (40) more general by allowing also numerator dynamics. We
then obtain the parameterization
G(q ) = B (q)
A(q) H (q ) = CA((qq)) (41)
18
The eect of the numerator C is that the current predicted value of y will
depend upon previous predicted values, not just measured values. This is
known as a ARMAX model, since the C (q)-term makes the noise model a
Moving Average of a white noise source. Also, (41) assumes that the dynam-
ics and the noise model have common poles, and is therefore particularly
suited for the case where the disturbances enter together with the input,
\early in the process" so to speak.
The output error (OE) model we considered in (23) corresponds to the case
G(q ) = B (q)
F (q) H (q ) = 1 (42)
(We use F in the denominator to distinguish the case from (40).) Its unique
feature is that the prediction is based on past inputs only. It also concentrates
on the model dynamics and does not bother about describing the noise.
We can also generalize this model by allowing a general noise model
G(q ) = B (q)
F (q)
C (q)
H (q ) = D (q) (43)
19
e
u-
B - &?m - 1 y-
ARX
A
e
u- B - &?m y - OE
F
e
?
C
u-
B - &?m - 1 y-
ARMAX
A
e
?
C
D
u- B - &?m y - BJ
F
20
OE: Concentrates on the input{output dynamics
BJ: Very exible. Assumes no common characteristics between noise and
input{output behavior.
Here x(t) is the state vector and typically consists of physical variables (such
as positions and velocities etc). The state space matrices A B and C are
parameterized by the parameter vector , reecting the physical insight we
have into the process. The parameters could be physical constants (resis-
tance, heat transfer coe(cients, aerodynamical derivatives etc) whose values
are not known. They could also reect other types of insights into the sys-
tem's properties.
Example 8.4 An electric motor
Consider an electric motor with the input u being the applied voltage and the
output y being the angular position of the motor shaft.
A rst, but reasonable approximation of the motor's dynamics is as a rst
order system from voltage to angular velocity, followed by an integrator:
G(s) = s(s b+ a)
21
If we select the state variables
x(t) = yy_ ((tt))
22
where
ZT
A"() = eA()T B" () = eA() B ()d (48)
0
This follows from solving (44) over one sampling period. We could also
further model the added noise term v(kT ) and represent the system in the
innovations form
x"((k + 1)T ) = A"()"x(kT ) + B" ()u(kT ) + K" ()e(kT ) (49)
y(kT ) = C ()"x(kT ) + e(kT )
where fe(kT )g is white noise. The step from (47) to (49) is really a standard
Kalman lter step: x" will be the one-step ahead predicted Kalman states. A
pragmatic way to think about it is as follows: In (47) the term v(kT ) may
not be white noise. If it is colored we may separate out that part of v(kT )
that cannot be predicted from past values. Denote this part by e(kT ): it will
be the innovation. The other part of v(kT ) { the one that can be predicted
{ can then be described as a combination of earlier innovations, e(`T ) ` < k.
Its eect on y(kT ) can then be described via the states, by changing them
from x to x", where x" contains additional states associated with getting v(kT )
from e(`T ) k `.
Now (49) can be written in input { output from as (let T = 1)
y(t) = G(q )u(t) + H (q )e(t) (50)
with
G(q ) = C ()(qI ; A"());1B" () (51)
H (q ) = I + C ()(qI ; A"());1K" ()
We are thus back at the basic linear model (38). The parameterization of G
and H in terms of is however more complicated than the ones we discussed
in Section 3.3.
The general estimation techniques, model properties (including the charac-
terization (85)), algorithms, etc., apply exactly as described in Section 5.
23
From these examples it is also quite clear that non-linear models with un-
known parameters can be approached in the same way. We would then
typically arrive at a a structure
x_ (t) = f (x(t) u(t) )
y(t) = h(x(t) u(t) ) + v(t) (52)
In this model, all noise eects are collected as additive output disturbances
v(t) which is a restriction, but also a very helpful simplication. If we dene
y^(tj) as the simulated output response to (52), for a given input, ignoring the
noise v(t), everything that was said in Section 5 about parameter estimation,
model properties, etc. is still applicable.
where
Let the dimension of ' be d. As before, we shall call this vector the regression
vector and its components will be referred to as the regressors. We also
allow the more general case that the formation of the regressors is itself
parameterized:
which we for short write '(t ). For simplicity, the extra argument will
however be used explicitly only when essential for the discussion.
The choice of the non-linear mapping in (53) has thus been reduced to two
partial problems for dynamical systems:
1. How to choose the non-linear mapping g(') from the regressor space
to the output space (i.e., from Rd to Rp).
2. How to choose the regressors '(t) from past inputs and outputs.
The second problem is the same for all dynamical systems, and it turns out
that the most useful choices of regression vectors are to let them contain past
inputs and outputs, and possibly also past predicted/simulated outputs. The
regression vector will thus be of the character (6). We now turn to the rst
problem.
25
4.2 Non-Linear Mappings: Possibilities
Now let us turn to the nonlinear mapping
g(' ) (57)
which for any given maps from Rd to Rp. For most of the discussion we
will use p = 1, i.e., the output is scalar-valued. At this point it does not
matter how the regression vector ' = ('1 : : : 'd )T was constructed. It is
just a vector that lives in Rd .
It is natural to think of the parameterized function family as function ex-
pansions:
X
g(' ) = k gk (') : (58)
We refer to gk as basis functions, since the role they play in (58) is similar
to that of a functional space basis. In some particular situations, they do
constitute a functional basis. Typical examples are wavelet bases (see below).
We are going to show that expansion (58) with dierent basis functions,
plays the role of a unied framework for investigating most known nonlinear
black-box model structures.
Now, the key question is: How to choose the basis functions gk ? The follow-
ing facts are essential to understand the connections between most known
nonlinear black-box model structures:
All the gk are formed from one "mother basis function", that we gener-
ically denote by (x).
This function (x) is a function of a scalar variable x.
Typically gk are dilated (scaled) and translated versions of . For the
scalar case d = 1 we may write
gk (') = gk (' k k ) = (k (' ; k )) (59)
We thus use k to denote the dilation parameters and k to denote
translation parameters.
26
A Scalar Example: Fourier Series Take (x) = cos(x). Then (58),(59)
will be the Fourier series expansion, with k as the frequencies and k as the
phases.
We then just have a variant of (60), since the indicator function can be
obtained as the dierence of two steps. A smooth version of the step, like
the sigmoid function
27
Classication of single-variable basis functions
Two classes of single-variable basis functions can be distinguished depending
on their nature :
Local Basis Functions are functions having their gradient with bounded
support, or at least vanishing rapidly at innity. Loosely speaking, their
variations are concentrated to some interval.
Global Basis Functions are functions having innitely spreading (bounded
or not) gradient.
Clearly the Fourier series is an example of a global basis function, while (60),
(61), (62) and (63) are all local functions.
Approximation Issues
For any of the described choices the resulting model becomes
X
n
g(' ) = k (k (' ; k )) (68)
k=1
with the dierent exact interpretations of the argument k (' ; k ) just dis-
cussed. The expansion is entirely determined by
the scalar valued function (x) of a scalar variable x
the way the basis functions are expanded to depend on a vector '.
The parameterization in terms of can be characterized by three types of
parameters:
29
The coordinates
The scale or dilation parameters
The location parameters
A key issue is how well the function expansion is capable of approximating
any possible \true system" g0('). There is a rather extensive literature on
this subject. For an identication oriented survey, see, e.g., Juditsky et al., 1995].
The bottom line is easy: For almost any choice of (x) { except being a
polynomial { the expansion (68) can approximate any \reasonable" function
g0(') arbitrarily well for suciently large n.
It is not di(cult to understand this. It is su(cient to check that the delta
function { or the indicator function for arbitrarily small areas { can be arbi-
trarily well approximated within the expansion. Then clearly all reasonable
functions can also be approximated. For local with radial construction this
is immediate: Indeed by scaling and location an arbitrarily small indicator
function can be places anywhere. For the ridge construction one needs to
show that a number of hyperplanes dened by and can be placed and
intersect so that any small area in Rd is cut out.
The question of how ecient the expansion is, i.e., how large n is required
to achieve a certain degree of approximation is more di(cult, and has no
general answer. We may point to the following aspects:
If the scale and location parameters and are allowed to depend
on the function g0 to be approximated, then the number of terms n
required for a certain degree of approximation is much less than if
k k k = 1 : : : is an a priori xed sequence.
For the local, radial approach the number of terms required to achieve
a certain degree of approximation of an p times dierentiable function
is proportional to
1
n (d=p (69)
)
Wavelet and Radial Basis Networks. The choice (61) without any or-
thogonalization is found in both wavelet networks, Zhang and Benveniste, 1992]
and radial basis neural networks Poggio and Girosi, 1990].
Neural Networks The ridge choice (67) with given by (63) gives a
much-used neural network structure, viz. the one hidden layer feedforward
sigmoidal net.
The regression vector (typically built up from past inputs and outputs)
The basic function (local) or (ridge)
The number of elements (nodes) in the expansion (58).
Once these choices have been made y^(tj) = g('(t) ) is a well dened func-
tion of past data and the parameters . The parameters are made up of
coordinates in the expansion (58), and from location and scale parameters
in the dierent basis functions.
All the algorithms and analytical results of Section 5 can thus be applied. For
Neural Network applications these are also the typical estimation algorithms
used, often complemented with regularization, which means that a term is
added to the criterion (74), that penalizes the norm of . This will reduce
the variance of the model, in that "spurious" parameters are not allowed to
take on large, and mostly random values. See e.g. Sjoberg et al., 1995].
For wavelet applications it is common to distinguish between those param-
eters that enter linearly in y^(tj) (i.e. the coordinates in the function ex-
pansion) and those that enter non-linearly (i.e. the location and scale pa-
rameters). Often the latter are seeded to xed values and the coordinates
are estimated by the linear least squares method. Basis functions that give
a small contribution to the t (corresponding to non-useful values of the
scale and location parameters) can them be trimmed away ("pruning" or
"shrinking").
32
5 General Parameter Estimation Techniques
In this section we shall deal with issues that are independent of model struc-
ture. Principles and algorithms for tting models to data, as well as the
general properties of the estimated models are all model-structure indepen-
dent and equally well applicable to, say, ARMAX models and Neural Network
models.
The section is organized as follows. In Section 5.1 the general principles
for parameter estimation are outlined. Sections 5.2 and 5.3 deal with the
asymptotic (in the number of observed data) properties of the models, while
algorithms are described in Section 5.4.
that depends on the unknown parameter vector and past data Z t;1 (see (8).
This predictor can be linear in y and u. This in turn contains several special
cases both in terms of black-box models and physically parameterized ones,
as was discussed in Sections 3 and 3.4, respectively. The predictor could also
be of general, non-linear nature, as was discussed in Section 4.
In any case we now need a method to determine a good value of , based
on the information in an observed, sampled data set (8). It suggests itself
that the basic least-squares like approach (9) through (11) still is a natural
approach, even when the predictor y^(tj) is a more general function of .
A procedure with some more degrees of freedom is the following one
33
1. From observed data and the predictor y^(tj) form the sequence of pre-
diction errors,
"(t ) = y(t) ; y^(tj) t = 1 2 : : : N (71)
2. Possibly lter the prediction errors through a linear lter L(q),
"F (t ) = L(q)"(t ) (72)
so as to enhance or depress interesting or unimportant frequency bands
in the signals.
3. Choose a scalar valued, positive function `(
) so as to measure the \size"
or \norm" of the prediction error:
`("F (t )) (73)
4. Minimize the sum of these norms:
^N = arg min
N
V ( Z N ) (74)
where
X
N
VN ( Z N ) = N1 `("F (t )) (75)
t=1
= arg min
E`("F (t )) (78)
That is, as more and more data become available, the estimate converges to
that value , that would minimize the expected value of the \norm" of the
ltered prediction errors. This is in a sense the best possible approximation of
the true system that is available within the model structure. The expectation
E in (78) is taken with respect to all random disturbances that aect the
data and it also includes averaging over the input properties. This means
in particular that will make y^(tj ) a good approximation of y(t) with
respect to those aspects of the system that are enhanced by the input signal
used.
The second basic result is the following one: If f"(t )g is approximately
35
white noise, then the covariance matrix of ^N is approximately given by
where
= E"2(t ) (80)
d y^(tj)j
(t) = d (81)
=
2
The results (77) through (81) are general and hold for all model structures,
both linear and non-linear ones, subject only to some regularity and smooth-
ness conditions. They are also fairly natural, and will give the guidelines for
all user choices involved in the process of identication. See Ljung, 1987] for
more details around this.
Let *u (!) be the input spectrum and *v (!) be the spectrum for the additive
disturbance v. Then the ltered prediction error can be written
If the noise model H (q ) = H(q) does not depend on (as in the output
error model (42)) the expression (85) thus shows that the resulting model
G(ei! ) will give that frequency function in the model set that is closest to
the true one, in a quadratic frequency norm with weighting function
This shows clearly that the t can be aected by the choice of prelter L,
the input spectrum *u and the noise model H.
37
5.3 Measures of Model Fit
Some quite general expressions for the expected model t, that are indepen-
dent of the model structure, can also be developed.
Let us measure the (average) t between any model (70) and the true system
as
Here expectation E is over the data properties (i.e. expectation over \Z 1"
with the notation (8)). Recall that expectation also can be interpreted as
sample means as in (82).
Before we continue, let us note the very important aspect that the t V"
will depend, not only on the model and the true system, but also on data
properties, like input spectra, possible feedback, etc. We shall say that the
t depends on the experimental conditions.
The estimated model parameter ^N is a random variable, because it is con-
structed from observed data, that can be described as random variables. To
evaluate the model t, we then take the expectation of V" (^N ) with respect
to the estimation data. That gives our measure
38
The rather remarkable fact is that if the two last data properties coincide,
then, asymptotically in N , (see, e.g., Ljung, 1987], Chapter 16)
Here is the value that minimizes the expected criterion (78). The notation
dim means the number of estimated parameters. The result also assumes
that the criterion function `(") = k"k2, and that the model structure is
successful in the sense that "F (t) is approximately white noise.
Despite the reservations about the formal validity of (89), it carries a most
important conceptual message: If a model is evaluated on a data set with the
same properties as the estimation data, then the t will not depend on the
data properties, and it will depend on the model structure only in terms of
the number of parameters used and of the best t o ered within the structure.
The expression can be rewritten as follows. Let y^0(tjt ; 1) denote the \true"
one step ahead prediction of y(t), and let
and let
Then is the innovations variance, i.e., that part of y(t) that cannot be pre-
dicted from the past. Moreover W () is the bias error, i.e. the discrepancy
between the true predictor and the best one available in the model structure.
Under the same assumptions as above, (89) can be rewritten as
FN
+ W () + dim
N (92)
The three terms constituting the model error then have the following inter-
pretations
39
is the unavoidable error, stemming from the fact that the output
cannot be exactly predicted, even with perfect system knowledge.
W () is the bias error. It depends on the model structure, and on the
experimental conditions. It will typically decrease as dim increases.
The last term is the variance error. It is proportional to the number
of estimated parameters and inversely proportional to the number of
data points. It does not depend on the particular model structure or
the experimental conditions.
Search directions
The basis for the local search is the gradient
X
N
VN0 () = dVd
N ()
= ; N1 (y(t) ; y^(tj))(t ) (95)
t=1
where
@ y^(tj)
(t ) = @ (96)
The gradient is in the general case a matrix with dim rows and dim y
columns. It is well known that gradient search for the minimum is ine(cient,
especially close to the minimum. Then it is optimal to use the Newton search
direction
R;1()VN0 () (97)
where
X
N
R() = VN00 () = d VdN2() = N1 (t )T (t )
2
t=1
X
N @ y^(tj)
+ N1
2
(y(t) ; y^(tj)) @ 2
(98)
t=1
The true Newton direction will thus require that the second derivative
@ 2 y^(tj)
@2
be computed. Also, far from the minimum, R() need not be positive semidef-
inite. Therefore alternative search directions are more common in practice:
41
- Gradient direction. Simply take
Ri = I (99)
- Gauss-Newton direction. Use
X
N
Ri = Hi = N1 (t ^(i) )T (t ^(i) ) (100)
t=1
Local Minima
A fundamental problem with minimization tasks like (9) is that VN () may
have several or many local (non-global) minima, where local search algo-
rithms may get caught. There is no easy solution to this problem. It is
usually well worth the eort to nd a good initial value (0) where to start
the iterations. Other than that, only various global search strategies are
left, such as random search, random restarts, simulated annealing, and the
genetic algorithm.
42
6 Special Estimation Techniques for Linear
Black Box Models
An important feature of a linear, time invariant system is that it is entirely
characterized by its impulse response. So if we know the system's response
to an impulse, we will also know its response to any input. Equivalently,
we could study the frequency response, which is the Fourier transform of the
impulse response.
In this section we shall consider estimation methods for linear systems, that
do not use particular model parameterizations. First, in Section 6.1, we shall
consider direct methods to determine the impulse response and the frequency
response, by simply applying the denitions of these concepts.
In Section 6.2 spectral analysis for frequency function estimation will be
discussed. Finally, in Section 6.3 a recent method to estimate general linear
systems (of given order, by unspecied structure) will be described.
Frequency Analysis
If a linear system has the transfer function G(q) and the input is
u(t) = u0 cos !kT (k ; 1)T t kT (102)
then the output after possible transients have faded away will be
y(t) = y0 cos(!t + ') for t = T 2T 3T : : : (103)
where
y0 = jG(ei!T )j
u0 (104)
' = arg G(ei!T ) (105)
If the system is driven by the input (102) for a certain u0 and !1 and we
measure y0 and ' from the output signal, it is possible to determine the
complex number G(ei!1 T ) using (104){(105). By repeating this procedure for
a number of dierent !, we can get a good estimate of the frequency function
G(ei!T ). This method is called frequency analysis. Sometimes it is possible
to see or measure u0, y0, and ' directly from graphs of the input and output
signals. Most of the time, however, there will be noise and irregularities that
make it di(cult to determine ' directly. A suitable procedure is then to
correlate the output with cos !t and sin !t.
44
6.2 Estimating the Frequency Response by Spectral
Analysis
Denitions
The cross spectrum between two (stationary) signals u(t) and y(t) is dened
as the Fourier transform of their cross covariance function, provided this
exists:
X
1
*yu (!) = Ryu( )e;i! (106)
=;1
The (auto) spectrum *u (!) of a signal u is dened as *uu(!), i.e. as its cross
spectrum with itself.
The spectrum describes the frequency contents of the signal. The connection
to more explicit Fourier techniques is evident by the following relationship
45
It is straightforward to show that the relationships between the spectra and
cross spectra of y and u (provided u and v are uncorrelated) is given by
It is easy to see how the transfer function G(ei! ) and the noise spectrum
v (!) can be estimated using these expressions, if only we have a method to
estimate cross spectra.
Estimation of Spectra
The spectrum is dened as the Fourier transform of the correlation function.
A natural idea would then be to take the transform of the estimate
^ 1 XN
Ryu( ) = N y(t)u(t ; )
N (113)
t=1
That will not work in most cases, though. The reason could be described as
follows: The estimate R^yu
N ( ) is not reliable for large , since it is based on
only a few observations. These "bad" estimates are mixed with good ones in
the Fourier transform, thus creating an overall bad estimate. It is better to
introduce a weighting, so that correlation estimates for large lags carry a
smaller weight:
46
covered). At the same time it will have to use "bad" estimates, so the sta-
tistical quality (the variance) is poorer. We shall return to this trade-o in
a moment. How should we choose the shape of the window function w (`)?
There is no optimal solution to this problem, but the most common window
used in spectral analysis is the Hamming window:
w (k) = 21 (1 + cos k ) jkj < (115)
w (k) = 0 jkj
From the spectral estimates *u, *y and *yu obtained in this way, we can
now use (111) to obtain a natural estimate of the frequency function G(ei! ):
^ i! *^ Nyu(!)
GN (e ) = ^ N (116)
*u (! )
Furthermore, the disturbance spectrum can be estimated from (112) as
^N 2
^*Nv (!) = *^ Ny (!) ; j*yu(!)j (117)
*^ Nu (!)
To compute these estimates, the following steps are performed:
5. Form the spectral estimates *^ Ny (!), *^ Nu (!), and *^ Nyu (!) according to
(114) and analogous expressions.
6. Form (116) and possibly also (117).
47
Quality of the Estimates
The estimates G^ N and *^ Nw are formed entirely from estimates of spectra and
cross spectra. Their properties will therefore be inherited from the properties
of the spectral estimates. For the Hamming window with width , it can be
shown that the frequency resolution will be about
p radians/time unit (118)
2
This means that details in the true frequency function that are ner than
this expression will be smeared out in the estimate. It is also possible to
show that the estimate's variances satisfy
Variance" here refers to taking expectation over the noise sequence v(t).]
Note that the relative variance in (119) typically increases dramatically as
! tends to the Nyquist frequency. The reason is that jG(i!)j typically de-
cays rapidly, while the noise-to-signal ratio *v (!)=*u(!) has a tendency to
increase as ! increases. In a Bode diagram the estimates will thus show con-
siderable uctuations at high frequencies. Moreover, the constant frequency
resolution (118) will look thinner and thinner at higher frequencies in a Bode
diagram due to the logarithmic frequency scale.
See Ljung and Glad, 1994] for a more detailed discussion.
49
w(t)
E (t) = e(t)
From this all the matrix elements in + can be estimated by the simple least
squares method, as described in Section 2. The covariance matrix for E (t)
can also be estimated easily as the sample sum of the model residuals. That
will give the covariance matrices for w and e, as well as the cross covariance
matrix between w and e. These matrices will, among other things, allow us
to compute the Kalman lter for (121). Note that all of the above holds
without changes for multivariable systems, i.e., when the output and input
signals are vectors.
The only remaining problem is where to get the state vector sequence x
from. It has long been known, e.g., Rissanen, 1974], Akaike, 1974b], that
all state vectors x(t) that can be reconstructed from input-output data in
fact are linear combinations of the components of the n k-step ahead output
predictors
where n is the model order (the dimension of x). See also Appendix 4.A in
Ljung, 1987]. We could then form these predictors, and select a basis among
their components:
0 y^(t + 1jt) 1
x(t) = L B
@ ... CA (124)
y^(t + njt)
The choice of L will determine the basis for the state-space realization, and
is done in such a way that it is well conditioned. The predictor y^(t + kjt) is
a linear function of u(s) y(s) 1 s t and can e(ciently be determined
50
by linear projections directly on the input output data. (There is one com-
plication in that u(t + 1) : : : u(t + k) should not be predicted, even if they
aect y(t + k).)
What we have described now is the subspace projection approach to es-
timating the matrices of the state-space model (121), including the ba-
sis for the representation and the noise covariance matrices. There are a
number of variants of this approach. See among several references, e.g.
Overschee and DeMoor, 1994], Larimore, 1983]
The approach gives very useful algorithms for model estimation, and is par-
ticularly well suited for multivariable systems. The algorithms also allow
numerically very reliable implementations. At present, the asymptotic prop-
erties of the methods are not fully investigated, and the general results quoted
in Section 5.2 are not directly applicable. Experience has shown, however,
that condence intervals computed according to the general asymptotic the-
ory, are good approximations. One may also use the estimates obtained by
a subspace method as initial conditions for minimizing the prediction error
criterion (74).
7 Data Quality
It is desirable to aect the conditions under which the data are collected.
The objective with such experiment design is to make the collected data set
Z N as informative as possible with respect to the models to be built using the
data. A considerable amount of theory around this topic can be developed
and we shall here just review some basic points.
The rst and most important point is the following one
1. The input signal u must be such that it exposes all the relevant proper-
ties of the system. It must thus not be too \simple". For example, a
pure sinusoid
u(t) = A cos !t
51
will only give information about the system's frequency response at
frequency !. This can also be seen from (85). The rule is that
the input must contain at least as many dierent frequencies as
the order of the linear model to be built.
To be on the safe side, a good choice is to let the input be random
(such as ltered white noise). It then contains all frequencies.
Another case where the input is too simple is when it is generated by
feedback such as
u(t) = ;Ky(t) (125)
If we would like to build a rst order ARX model
y(t) + ay(t ; 1) = bu(t ; 1) + e(t)
we nd that for any given all models such that
a + bK =
will give identical input-output data. We can thus not distinguish be-
tween these models using an experiment with (125). That is, we can
not distinguish between any combinations of \a" and \b" if they satisfy
the above condition for a given \". The rule is
If closed-loop experiments have to be performed, the feedback law
must not be too simple. It is to be preferred that a set-point in
the regulator is being changed in a random fashion.
The second main point in experimental design is
2. Allocate the input power to those frequency bands where a good model
in particularly important.
This is also seen from the expression (85).
If we let the input be ltered white noise, this gives information how
to choose the lter. In the time domain it is often useful to think like
this:
52
Use binary (two-level) inputs if linear models are to be built: This
gives maximal variance for amplitude-constrained inputs.
Check that the changes between the levels are such that the input
occasionally stays on one level so long that a step response from
the system has time, more or less, to settle. There is no need to let
the input signal switch so quickly back and forth that no response
in the output is clearly visible.
Note that the second point is really just a reformulation in the time
domain of the basic frequency domain advice: let the input energy be
concentrated in the important frequency bands.
A third basic piece of advice about experiment design concerns the
choice of sampling interval.
3. A typical good sampling frequency is 10 times the bandwidth of the
system. That corresponds roughly to 5 { 7 samples along the rise time
of a step response.
back and review the choice of model set, or perhaps modify the data set. See
Figure 4!
How do we check the quality of a model? The prime method is to investigate
how well it is capable of reproducing the behavior of a new set of data (the
validation data) that was not used to t the model. That is, we simulate the
obtained model with a new input and compare this simulated output. One
may then use one's eyes or numerical measurements of t to decide if the
t in question is good enough. Suppose we have obtained several dierent
models in dierent model structures (say a 4th order ARX model, a 2nd
order BJ model, a physically parameterized one and so on) and would like to
know which one is best. The simplest and most pragmatic approach to this
problem is then to simulate each one of them on validation data, evaluate
their performance, and pick the one that shows the most favorable t to
54
measured data. (This could indeed be a subjective criterion!)
Cross Validation
A very natural and pragmatic approach is Cross Validation. This means
N1 that
that the available data set is split into two parts, estimation data, Zest
is used to estimate the models:
Here VN is the criterion (75). Then F^N will be an unbiased estimate of the
measure FN , dened by (88), which was discussed at length in the previous
section. The procedure would the be to try out a number of model structures,
and choose the one that minimizes F^N1 .
Such cross validation techniques to nd a good model structure has an im-
mediate intuitive appeal. We simply check if the candidate model is capable
of "reproducing" data it hasn't yet seen. If that works well, we have some
condence in the model, regardless of any probabilistic framework that might
be imposed. Such techniques are also the most commonly used ones.
55
A few comments could be added. In the rst place, one could use dierent
splits of the original data into estimation and validation data. For example,
in statistics, there is a common cross validation technique called "leave one
out". This means that the validation data set consists of one data point "at
a time", but successively applied to the whole original set. In the second
place, the test of the model on the validation data does not have to be in
terms of the particular criterion (127). In system identication it is common
practice to simulate (or predict several steps ahead) the model using the
validation data, and then visually inspect the agreement between measured
and simulated (predicted) output.
56
8.3 Residual Analysis
The second basic method for model validation is to examine the residuals
(\the leftovers") from the identication process. These are the prediction
errors
57
A Handling of data, plotting, etc.
Filtering of data, removal of drift, choice of data segments, etc.
B Non-parametric identication methods
Estimation of covariances, Fourier transforms, correlation- and spectral-
analysis, etc.
C Parametric estimation methods
Calculation of parametric estimates in dierent model structures.
D Presentation of models
Simulation of models, estimation and plotting of poles and zeros, com-
putation of frequency functions, and plotting Bode diagrams, etc.
E Model validation
Computation and analysis of residuals ("(t ^N )). Comparison between
dierent models' properties, etc.
The existing program packages dier mainly in various user interfaces and
by dierent options regarding the choice of model structure according to C
above. For example, MATLAB's Identication Toolbox Ljung, 1995] covers
all linear model structures discussed here, including arbitrarily parameterized
linear models in continuous time.
Regarding the user interface, there is now a clear trend to make it graphically
oriented. This avoids syntax problems and relies more on \click and move",
at the same time as tedious menu-labyrinths are avoided. More aspects of
CAD tools for system identication are treated in Ljung, 1993].
1. Find out a good value for the delay between input and output, e.g. by
using correlation analysis.
2. Estimate a fourth order linear model with this delay using part of the
data, and simulate this model with the input and compare the model's
simulated output with the measured output over the whole data record.
In MATLAB language this is simple,
z = y u]
compare(z,arx(z(1:200,:),4 4 1]))
59
3. Some important non-linearities have been overlooked. We must
then resort to semi-physical modeling to nd out if some of the
measured signals should be subjected to non-linear transforma-
tions. If no such transformations suggest themselves, one might
have to try some non-linear black-box model, like a neural net-
work.
Clearly, this advice does not cover all the art of identication, but it is a
reasonable rst approximation.
60
0 20 40 60 80 100 120 140 160 180
Figure 5: Dashed line: Actual Pitch rate. Solid line: 10 step ahead predicted
pitch rate, based on the fourth order model from canard angle only.
61
6
-2
-4
-6
-8
0 100 200 300 400 500 600
Figure 7: Dashed line: -number after the vessel, actual measurements. Solid
line: Simulated -number using the input only and a fourth order linear
model with delay 12, estimated using the rst 200 data points.
Let us thus resample the date accordingly, i.e. so that a new sample is taken
(by interpolation from the original measurement) equidistantly in terms of
integrated ows divided by volume. In MATLAB terms this will be
z = y,u] pf = flow./level
t =1:length(z)
newt =
table1(cumsum(pf),t],pf(1)sum:(pf)]' )
newz = table1(t,z], newt)
We now apply the same procedure to the resampled data. This gives gure
8. This \looks good". Somewhat better numbers can then be obtained by
ne-tuning the orders.
References
Akaike, 1974a] Akaike, H. (1974a). A new look at the statistical model
identication. IEEE Transactions on Automatic Control, AC-19:716{723.
62
6
-2
-4
-6
-8
0 50 100 150 200 250 300 350 400
63
Juditsky et al., 1995] Juditsky, A., Hjalmarsson, H., Benveniste, A., De-
lyon, B., Ljung, L., Sjoberg, J., and Zhang, Q. (1995). Nonlinear black-box
modeling in system identication: Mathematical foundations. Automatica,
31(12):1724{1750.
Landau, 1990] Landau, I. D. (1990). System Identicaiton and Control De-
sign Using P.I.M. + Software. Prentice Hall, Engelwood Clis.
Larimore, 1983] Larimore, W. E. (1983). System identication, reduced
order ltering and modelling via canonical variate analysis. In Proc 1983
American Control Conference, San Francisco.
Ljung, 1987] Ljung, L. (1987). System Identication - Theory for the User.
Prentice-Hall, Englewood Clis, N.J.
Ljung, 1993] Ljung, L. (1993). Identication of linear systems. In Linkens,
D. A., editor, CAD for Control Systems, chapter 6, pages 147{165. Marcel
Dekker, New York.
Ljung, 1995] Ljung, L. (1995). The System Identication Toolbox: The
Manual. The MathWorks Inc. 1st edition 1986, 4th edition 1995, Nat-
ick, MA.
Ljung and Glad, 1994] Ljung, L. and Glad, T. (1994). Modeling of Dynamic
Systems. Prentice Hall, Englewood Clis.
Ljung and Soderstrom, 1983] Ljung, L. and Soderstrom, T. (1983). Theory
and Practice of Recursive Identication. MIT press, Cambridge, Mass.
MATRIXx, 1991] MATRIXx (1991). MATRIXx users guide. Integrated
Systems Inc., Santa Clara, CA.
Overschee and DeMoor, 1994] Overschee, P. V. and DeMoor, B. (1994).
N4sid: Subspace algorithms for the identication of combined
deterministic-stochastic systems. Automatica, 30:75{93.
Poggio and Girosi, 1990] Poggio, T. and Girosi, F. (1990). Networks for
approximation and learning. Proc. of the IEEE, 78:1481{1497.
Rissanen, 1974] Rissanen, J. (1974). Basis of invariants and canonical forms
for linear dynamic systems. Automatica, 10:175{182.
64
Rissanen, 1978] Rissanen, J. (1978). Modelling by shortest data description.
Automatica, 14:465{471.
Schoukens and Pintelon, 1991] Schoukens, J. and Pintelon, R. (1991). Iden-
tication of Linear Systems: A Practical Guideline to Accurate Modeling.
Pergamon Press, London (U.K.).
Sjoberg et al., 1995] Sjoberg, J., Zhang, Q., Ljung, L., Benveniste, A., De-
lyon, B., Glorennec, P., Hjalmarsson, H., and Juditsky, A. (1995). Non-
linear black-box modeling in system identication: A unied overview.
Automatica, 31(12):1691{1724.
Soderstrom and Stoica, 1989] Soderstrom, T. and Stoica, P. (1989). System
Identication. Prentice-Hall Int., London.
Zhang and Benveniste, 1992] Zhang, Q. and Benveniste, A. (1992). Wavelet
networks. IEEE Trans Neural Networks, 3:889{898.
/home/rt/ljung/papers/eeenc/enc.tex
65
Division, Department Date
Avdelning, Institution Datum
Svenska/Swedish Licentiatavhandling ISRN
Engelska/English
Examensarbete
C-uppsats
D-uppsats Title of series, numbering
Serietitel och serienummer ISSN
Övrig rapport
1400-3902
LiTH-ISY-R-2809
URL för elektronisk version
http://www.control.isy.liu.se
System Identication
Title
Titel
Lennart Ljung
Author
Författare
Abstract
Sammanfattning
Keywords identication
Nyckelord