System Identification
System Identification
System Identification
Torsten Soderstrom
Petre Stoica
Since its publication in 1989, System Identification has become a standard reference source for researchers and a frequently used textbook for classroom and self
studies. The book has also appeared in paperback in 1994, and a Polish translation has been published as well in 1997.
The original publisher has now declared the book out of print. Because we have
got very positive feedback from many colleagues who are missing our book, we
have decided to arrange for a reprinting, and here it is.
We have chosen to let the text appear in the same form as the 1989 version. Over
the years we have found only very few typing errors, and they are listed on the
next page. We hope that you, our readers, will enjoy the book.
h(k) =
SYSTEM
IDENTIFICATION
microprocessor implementation
BANKS, S.P., Mathematical Theories of Nonlinear Systems
BENNETT, S., Real-time Computer Control: an introduction
CEGRELL, T., Power Systems Control
COOK, P.A., Nonlinear Dynamical Systems
LUNZE, J., Robust Multivariable Feedback Control
PATrON, R., CLARK, R.N., FRANK, P.M. (editors), Fault Diagnosis in Dynamic Systems
SODERSTROM, T., STOICA, P., System Identification
WARWICK, K., Control Systems: an introduction
SYSTEM
IDENTIFICATION
TORSTEN SODERSTROM
Automatic Control and Systems Analysis Group
Department of Technology, Uppsa/a University
Uppsala, Sweden
PETRE STOICA
Department of Automatic Control
Polytechnic Institute of Bucharest
Bucharest, Romania
PRENTICE HALL
NEW YORK
LONDON
TORONTO
SYDNEY
TOKYO
1 2 3 4 5 93 92 91 90 1\9
ISBN 0-13-881236-5
CONTENTS
xii
xv
xviii
xxi
1 INTRODUCTION
2 INTRODUCTORY EXAMPLES
3 NONPARAMETRIC METHODS
3.1 Introduction
3.2 Transient analysis
3.3 Frequency analysis
3.4 Correlation analysis
3.5 Spectral analysis
Summary
Problems
Bibliographical notes
Appendices
A3.1 Covariance functions, spectral densities and linear filtering
A3.2 Accuracy of correlation analysis
4 LINEAR REGRESSION
10
10
12
18
23
25
28
29
31
32
32
32
37
42
43
50
51
54
55
58
60
60
65
67
71
74
77
77
83
vii
VlIl
Contents
C4.1 Best linear unbiased estimation under linear constraints
C4.2 Updating the parameter estimates in linear regression models
C4.3 Best linear unbiased estimates for linear regression models with possibly
singular residual covariance matrix
C4.4 Asymptotically best consistent estimation of certain nonlinear
regression parameters
5 INPUT SIGNALS
5.1 Some commonly used input signals
5.2 Spectral characteristics
S.3 Lowpass filtering
5.4 Persistent excitation
Summary
Problems
Bibliographical notes
Appendix
AS.1 Spectral properties of periodic signals
Complements
C5.1 Difference equation models with persistently exciting inputs
C5.2 Condition number of the covariance matrix of filtered white noise
C5.3 Pseudorandom binary sequences of maximum length
6 MODEL PARAMETRIZATIONS
6.1 Model classifications
6.2 A general model structure
6.3 Uniqueness properties
6.4 Identifiability
Summary
Problems
Bibliographical notes
Appendix
A6.1 Spectral factorization
Complements
C6.1 Uniqueness of the full polynomial form model
C6.2 Uniqueness of the parametrization and the positive definiteness of the
input-output covariance matrix
83
86
88
91
96
96
100
112
117
125
126
129
129
133
135
137
146
146
148
161
167
168
168
171
172
182
183
185
185
188
192
198
202
211
216
216
226
226
Contents
Complements
C7.1 Approximation models depend on the loss function used in estimation
C7.2 Multistep prediction of ARMA processes
C7.3 Least squares estimation of the parameters offull polynomial form models
C7.4 The generalized least squares method
C7.5 The output error method
C7.6 Unimodality of the PEM loss function for ARMA processes
C7.7 Exact maximum likelihood estimation of AR and ARMA parameters
C7.8 ML estimation from noisy input-output data
10
ix
228
229
233
236
239
247
249
256
260
260
264
277
280
280
284
285
286
288
292
298
303
305
310
320
320
321
324
327
328
334
348
350
351
357
359
361
373
381
381
382
389
x Contents
11
416
422
11.1 Introduction
11.2 Is a model flexible enough?
11.3 Is a model too complex?
11.4 The parsimony principle
11.5 Comparison of model structures
Summary
Problems
Bibliographical notes
Appendices
A 11.1 Analysis of tests on covariance functions
Al1.2 Asymptotic distribution of the relative decrease in the criterion function
Complement
Cll.1 A general form of the parsimony principle
APPENDIX A
395
396
401
406
407
412
412
422
423
433
438
440
451
451
456
457
461
464
468
468
468
474
482
487
490
493
493
495
499
501
502
502
509
511
511
518
Contents
A.3 The OR method
AA Matrix norms and numerical accuracy
A.5 Idempotent matrices
A.6 Sylvester matrices
A.7 Kronecker products
A.8 An optimization result for positive definite matrices
Bibliographical notes
Xl
527
532
537
540
542
544
546
547
547
552
559
560
565
567
568
570
576
579
REFERENCES
ANSWERS AND FURTHER HINTS TO THE PROBLEMS
AUTHOR INDEX
SUBJECT INDEX
580
596
606
609
PREFACE AND
ACKNOWLEDGMENTS
System identification is the field of mathematical modeling of systems from experimental
data. It has acquired widespread applications in many areas. In control and systems
engineering, system identification methods are used to get appropriate models for
synthesis of a regulator, design of a prediction algorithm, or simulation. In signal
processing applications (such as in communications, geophysical engineering and
mechanical engineering), models obtained by system identification are used for spectral
analysis, fault detection, pattern recognition, adaptive filtering, linear prediction and
other purposes. System identification techniques are also successfully used in nontechnical fields such as biology, environmental sciences and econometrics to develop
models for increasing scientific knowledge on the identified object, or for prediction and
controL
This book is aimed to be used for senior undergraduate and graduate level courses on
system identification. It will provide the reader a profound understanding of the subject
matter as well as the necessary background for performing research in the field. The
book is primarily designed for classroom studies but can be used equally well for selfstudies.
To reach its twofold goal of being both a basic and an advanced text on system identification, which addresses both the student and the researcher, the book is organized as
follows. The chapters contain a main text that should fit the needs for graduate and
advanced undergraduate courses. For most of the chapters some additional (often more
detailed or more advanced) results are presented in extra sections called complements.
In a short or undergraduate course many of the complements may be skipped. In other
courses, such material can be included at the instructor's choice to provide a more
profound treatment of specific methods or algorithmic aspects of implementation.
Throughout the book, the important general results are included in solid boxes. In a few
places, intermediate results that are essential to later developments, are included in
dashed boxes. More complicated derivations or calculations are placed in chapter
appendices that follow immediately the chapter text. Several general background results
from linear algebra, matrix theory, probability theory and statistics are collected in the
general appendices A and B at the end of the book. All chapters, except the first one,
include problems to be dealt with as exercises for the reader. Some problems are
illustrations of the results derived in the chapter and are rather simple, while others are
aimed to give new results and insight and are often more complicated. The problem
sections can thus provide appropriate homework exercises as well as challenges for more
advanced readers. For each chapter, the simple problems are given before the more
advanced ones. A separate solutions manual has been prepared which contains solutions
to all the problems.
Xl!
XlII
The book does not contain computer exercises. However, we find it very important
that the students really apply some identification methods, preferably on real data. This
will give a deeper understanding of the practical value of identification techniques that is
hard to obtain from just reading a book. As we mention in Chapter 12, there are several
good program packages available that are convenient to use.
Concerning the references in the text, our purpose has been to give some key
references and hints for a further reading. Any attempt to cover the whole range of
references would be an enormous, and perhaps not particularly useful, task.
We assume that the reader has a background corresponding to at least a senior-level
academic experience in electrical engineering. This would include a basic knowledge of
introductory probability theory and statistical estimation, time series analysis (or
stochastic processes in discrete time), and models for dynamic systems. However, in the
text and the appendices we include many of the necessary background results.
The text has been used, in a preliminary form, in several different ways. These include
regular graduate and undergraduate courses, intensive courses for graduate students and
for people working in industry, as weI! as for extra reading in graduate courses and
for independent studies. The text has been tested in such various ways at Uppsa\a
University, Polytechnic Institute of Bucharest, Lund Institute of Technology, Royal
Institute of Technology, Stockholm, Yale University, and INTEC, Santa Fe, Argentina.
The experience gained has been very useful when preparing the final text.
In writing the text we have been helped in various ways by several persons, whom we
would like to sincerely thank.
We acknowledge the influence on our research work of our colleagues Professor Karl
lohan Astrom, Professor Pieter Eykhoff, Dr Ben Friedlander, Professor Lennart Ljung,
Professor Arye Nehorai and Professor Mihai Tertito who, directly or indirectly, have
had a considerable impact on our writing.
The text has been read by a number of persons who have given many useful suggestions for improvements. In particular we would like to sincerely thank Professor Randy
Moses, Professor Arye Nehorai, and Dr John Norton for many useful comments. We are
also grateful to a number of students at Uppsa\a University, Polytechnic Institute of
Bucharest, INTEC at Santa Fe, and Yale University, for several valuable proposals.
The first inspiration for writing this book is due to Dr Greg Meira, who invited the first
author to give a short graduate course at INTEC, Santa Fe, in 1983. The material
produced for that course has since then been extended and revised by us jointly before
reaching its present form.
The preparation of the text has been a task extended over a considerable period of
time. The often cumbersome job of typing and correcting the text has been done with
patience and perseverance by Ylva Johansson, Ingrid Ringard, Maria Dahlin, Helena
Jansson, Ann-Cristin Lundquist and Lis Timner. We are most grateful to them for their
excellent work carried out over the years with great skill.
Several of the figures were originally prepared by using the packages ID PAC
(developed at Lund Institute of Technology) for some parameter estimations and
BLAISE (developed at INRIA, France) for some of the general figures.
We have enjoyed the very pleasant collaboration with Prentice Hall International.
We would like to thank Professor Mike Grimble, Andrew Binnie, Glen Murray and
Ruth Freestone for their permanent encouragement and support. Richard Shaw
XIV
deserves special thanks for the many useful comments made on the presentation. We
acknowledge his help with gratitude.
Torsten Soderstrom
Uppsa/a
Petre Stoica
Bucharest
GLOSSARY
Notations
!!ii
DTV,..4)
E
e(t)
F(q-I)
C(q-I)
Ceq-I)
H(q-I)
f/(q-l)
f
I
I"
k;(t)
log
clI
,11(0)
(min)
JV
JV(m, P)
N
n
nu
ny
nO
On
O(x)
p(xly)
i?l
i?ln
./
AT
tr
t
u(t)
V
vec(A)
yet)
y(tlt z(t)
1)
xv
XVI
Glossary
yet)
0.1,1
oCc)
E(t, 8)
8
8
80
A
,/
A
a
<j>(w)
<j>//( w)
<j>y//( w)
<pet)
q)
len)
1.jJ(t)
w
gain sequence
Kronecker delta (= 1 if s = t, else = 0)
Dirac function
prediction error corresponding to the parameter vector 8
parameter vector
estimate of parameter vector
true value of parameter vector
covariance matrix of innovations
variance of white noise
forgetting factor
variance or standard deviation of white noise
spectral density
spectra! density of the signal u(t)
cross-spectral density between the signals yet) and u(t)
vector formed by lagged input and output data
regressor matrix
X2 distribution with n degrees of freedom
negative gradient of the prediction error E(t, 8) with respect to 8
angular frequency
Abbreviations
ABCE
adj
AIC
AR
AR(n)
ARIMA
ARMA
ARMA(n1, n2)
ARMAX
ARX
BLUE
CARIMA
cov
dim
deg
ELS
FIR
FFT
FPE
GLS
iid
IV
LDA
LIP
LMS
LS
MA
MA(n)
MAP
Glossary
MFD
mgf
MIMO
ML
mse
MVE
ODE
OEM
pdf
pe
PEM
PI
PLR
PRBS
RIV
RLS
RPEM
SA
Sf
SISO
SVD
var
w.p.1
w.r.t
WWRA
YW
XVll
Notational conventions
[H(q-l)]'
[cp(tW
[A -llT
dist
-----?>
A~B
A>B
~
EEl
EEl
E
V'
V"
EXAMPLES
1.4
1.5
A stirred tank
An industrial robot
Aircraft dynamics
Effect of a drug
Modeling a stirred tank
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
Transient analysis
Correlation analysis
A PRBS as input
A step function as input
Prediction accuracy
An impulse as input
A feedback signal as input
A feedback signal and an additional setpoint as input
15
19
21
23
26
26
3.1
3.2
3.3
3.4
3.5
33
34
37
46
47
4.1
4.2
4.3
4.4
4.5
4.6
A polynomial trend
A weighted sum of exponentials
Truncated weighting function
Estimation of a constant
Estimation of a constant (continued from Example 4.4)
Sensitivity of the normal equations
60
60
61
65
70
76
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
5.16
A step function
A pseudorandom binary sequence
An autoregressive moving average sequence
A sum of sinusoids
Characterization of PRBS
Characterization of an ARMA process
Characterization of a sum of sinusoids
Comparison of a filtered PRBS and a filtered white noise
Standard filtering
Increasing the clock period
Decreasing the probability of level change
Spectral density interpretations
Order of persistent excitation
A step function
APRBS
An ARMA process
1.1
1.2
1.3
1
2
2
4
XVlII
10
12
97
97
97
98
102
103
104
III
113
113
115
117
121
124
124
125
Examples
5.17
C5.3.1
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
A6.1.1
A6.1.2
A6.1.3
A6.1.4
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
C8.3.1
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
10.1
10.2
10.3
10.4
XIX
125
An ARMAX model
A general SISO model structure
The full polynomial form
Diagonal form of a multi variable system
A state space model
A Hammerstein model
Uniqueness properties for an ARMAX model
Nonuniqueness of a multivariable model
Nonuniqueness of general state space models
Spectral factorization for an ARMA process
Spectral factorization for a state space model
Another spectral factorization for a state space model
Spectral factorization for a nonstationary process
149
152
154
155
157
160
162
165
166
Criterion functions
The least squares method as a prediction error method
Prediction for a first-order ARMAX model
Prediction for a first-order ARMAX model, continued
The exact likelihood function for a first-order autoregressive process
Accuracy for a linear regression
Accuracy for a first-order ARMA process
Gradient calculation for an A RMAX model
Initial estimate for an ARMAX model
Initial estimates for increasing model structures
190
191
192
196
199
207
207
213
215
215
263
264
266
267
270
271
275
278
321
331
333
334
335
340
345
346
383
384
139
177
178
179
180
302
386
390
xx
Examples
No external input
10.5
10.6
External input
Joint input-output identification of a first-order system
10.7
Accuracy for a first-order system
10.8
ClO.1.1 Identifiability properties of an nth-order system with an mth-order regulator
Cl0.1.2 Identifiability properties of a minimum-phase system under minimum
variance control
392
393
398
401
417
418
An autocorrelation test
A cross-correlation test
Testing changes of sign
Use of statistical tests on residuals
Statistical tests and common sense
Effects of overparametrizing an ARMAX system
The FPE criterion
Significance levels for FPE and AIC
Numerical comparisons of the F-test, FPE and AlC
Consistent model structure determination
A singular P r matrix due to overparametrization
423
426
428
429
431
433
443
445
446
450
459
469
473
481
485
489
490
494
495
497
A.l
A.2
Decomposition of vectors
An orthogonal projector
518
538
B.l
562
567
574
575
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
Al1.1.1
12.1
12.2
12.3
12.4
12.5
12.6
12.7
12.8
12.9
B.2
B.3
B.4
PROBLEMS
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
5.1
Nonnegative definiteness of the sample covariance matrix
5.2
A rapid method for generating sinusoidal signals on a computer
5.3
Spectral density of the sum of two sinusoids
5.4
Admissible domain for QI and Q2 of a stationary process
5.5
Admissible domain for QI and Q2 for MA(2) and AR(2) processes
5.6
Spectral properties of a random wave
5.7
Spectral properties of a square wave
5.8
Simple conditions on the covariance of a moving average process
5.9 Weighting function estimation with PRBS
5.10 The cross-covariance matrix of two autoregressive processes obeys a Lyapunov
equation
XXI
XXII
6.1
6.2
6.3
6.4
6.5
6.6
6.7
Problems
Stability boundary for a second-order system
Spectral factorization
Further comments on the non uniqueness of stochastic state space models
A state space representation of autoregressive processes
Uniqueness properties of ARARX models
Uniqueness properties of a state space model
Sampling a simple continuous time system
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.1 ()
7.11
7.12
7.13
7.14
7.15
7.16
7.17
9.1
9.2
9.3
9.4
Problems
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15
9.16
XXlll
On the use of the cross-correlation test for the least squares method
Identifiability results for ARX models
Variance of the prediction error when future and past experimental conditions
differ
An assessment criterion for closed loop operation
11.4
Variance of the multi-step prediction error as an assessment criterion
11.5
11.6
Misspecification of the model structure and prediction ability
An illustration of the parsimony principle
11.7
11.8 The parsimony principle does not necessarily apply to nonhierarchical model
structures
11. 9
On testing cross-correlations between residuals and input
11.10 Extension of the prediction error formula (11.40) to the multivariable case
11.1
11.2
11.3
XXIV
Problems
Chapter 1
INTRODUCTION
The need for modeling dynamic systems
System identification is the field of modeling dynamic systems from experimental data. A
dynamic system can be conceptually described as in Figure 1.1. The system is driven by
input variables u(t) and disturbances vet). The user can control u(t) but not vet). In some
signal processing applications the inputs may be missing. The output signals are variables
which provide useful information about the system. For a dynamic system the control
action at time t will influence the output at time instants s > t.
Disturbance
vet)
Input
Output
u(t)
yet)
FIGURE 1.1 A dynamic system with input u(t), output yet) and disturbance vCt),
where t denotes time.
The following examples of dynamic systems illustrate the need for mathematical
models.
Example 1.1 A stirred tank
Consider the stirred tank shown in Figure 1.2, where two flows are mixed. The
concentration in each of the flows can vary. The flows FI and F2 can be controlled with
valves. The signals FI (t) and r"2(t) are the inputs to the system. The output flow F(t) and
the concentration c(t) in the tank constitute the output variables. The input concentrations C1 (t) and C2 (t) cannot be controlled and are viewed as disturbances.
Suppose we want to design a regulator which acts on the flows FI (t) and F2 (t) using
the measurements of F(t) and c(t). The purpose of the regulator is to ensure that F(t) or
e(t) remain as constant as possible even if the concentrations CI (t) and C2(t) vary
considerably. For such a design we need some form of mathematical model which
II
describes how the input, the output and the disturbances are related.
Example 1.2 An industrial robot
An industrial robot can be seen as an advanced servo system. The robot arm has to
perform certain movements, for example for welding at specific positions. It is then
natural to regard the position of the robot arm as an output. The robot arm is controlled
by electrical motors. The currents to these motors can be regarded as the control inputs.
The movement of the robot can also be influenced by varying the load on the arm and by
Introduction
Chapter 1
Volume V
Cone. c
---+
Flow F
Cone.c
friction. Such variables are the disturbances. It is very important that the robot will move
in a fast and reliable way to the desired positions without violating various geometrical
constraints. In order to design an appropriate servo system it is of course necessary to
have some model of how the behavior of the robot is influenced by the input and the
disturbances.
III
Example 1.3 Aircraft dynamics
An aircraft can be viewed as a complex dynamic system. Consider the problem of maintaining constant altitude and speed - these arc the output variables. Elevator position
and engine thrust are the inputs. The behavior of the airplane is also influenced by its
load and by the atmospheric conditions. Such variables can be viewed as disturbances.
In order to design an autopilot for keeping constant speed and course we need a model
of how the aircraft's behavior is influenced by inputs and disturbances. The dynamic
properties of an aircraft vary considerably, for example with speed and altitude, so
III
identification methods will need to track these variations.
Example 1.4 Effect of a drug
A medicine is generally required to produce an effect in a certain part of the body. If the
drug is swallowed it will take some time before the drug passes the stomach and is
absorbed in the intestines, and then some further time until it reaches the target organ,
for example the liver or the heart. After some metabolism the concentration of the drug
decreases and the waste products are secreted from the body. In order to understand
what effect (and when) the drug has on the targeted organ and to design an appropriate
schedule for taking the drug it is necessary to have some model that describes the
properties of the drug dynamics.
III
The above examples demonstrate the need for modeling dynamic systems both in
technical and non-technical areas.
Introduction
Many industrial processes, for example for production of paper, iron, glass or chemical
compounds, must be controlled in order to run safely and efficiently. To design regulators,
some type of model of the process is needed. The models can be of various types and
degrees of sophistication. Sometimes it is sufficient to know the crossover frequency and
the phase margin in a Bode plot. In other cases, such as the design of an optimal
controller, the designer will need a much more detailed model which also describes the
properties of the disturbances acting on the process.
In most applications of signal processing in forecasting, data communication, speech
processing, radar, sonar and electrocardiogram analysis, the recorded data are filtered in
some way and a good design of the filter should reflect the properties (such as high-pass
characteristics, low-pass characteristics, existence of resonance frequencies, etc.) of the
signal. To describe such spectral properties, a model of the signal is needed.
In many cases the primary aim of modeling is to aid in design. In other cases the
knowledge of a model can itself be the purpose, as for example when describing the
effect of a drug, as in Example 1.4. Relatively simple models for describing certain
phenomena in ecology (like the interplay between prey and predator) have been postulated. If the models can explain measured data satisfactorily then they might also be
used to explain and understand the observed phenomena. In a more general sense
modeling is used in many branches of science as an aid to describe and understand
reality.
Sometimes it is interesting to model a technical system that does not exist, but may be
constructed at some time in the future. Also in such a case the purpose of modeling is to
gain insight into and knowledge of the dynamic behavior of the system. An example is a
large space structure, where the dynamic behavior cannot be deduced by studying
structures on earth, because of gravitational and atmospheric effects. Needless to say, for
examples like this, the modeling must be based on theory and a priori knowledge, since
experimental data are not available.
Types of model
4 Introduction
Chapter 1
It should be stressed that although we speak generally about systems with inputs and
outputs, the discussion here is to a large extent applicable also to time series analysis. We
may then regard a signal as a discrete time stochastic process. In the latter case the
system does not include any input signal. Signal models for times series can be useful in
the design of spectral estimators, predictors, or filters that adapt to the signal properties.
Mathematical modeling. This is an analytic approach. Basic laws from physics (such as
Newton's laws and balance equations) are used to describe the dynamic behavior of a
phenomenon or a process .
System identification. This is an experimental approach. Some experiments are performed on the system; a model is then fitted to the recorded data by assigning suitable
numerical values to its parameters.
Example 1.5 Modeling a stirred tank
Consider the stirred tank in Example 1.1. Assume that the liquid is incompressible, so
that the density is constant; assume also that the mixing of the flows in the tank is very
fast so that a homogeneous concentration c exists in the tank. To derive a mathematical
model we will use balance equations of the form
net change = flow in - flow out
Applying this idea to the volume V in the tank,
dV
ili=Fl + F2 -F
(1.1)
cF
(1.2)
The model can be completed in several ways. The flow F may depend on the tank level
h. This is certainly true if this flow is not controlled, for example by a valve. Ideally the
flow in such a case is given by Torricelli's law:
F= aV(2gh)
(1.3)
where a is the effective area of the flow and g = 10 m/sec2 . Equation (1.3) is an
idealization and may not be accurate or even applicable. Finally, if the tank area A does
not depend on the tank level h, then by simple geometry
V=Ah
(1.4)
To summarize, equations (1.1) to (1.4) constitute a simple model of the stirred tank. The
degree of validity of (1.3) is not obvious. The geometry of the tank is easy to measure,
but the constant a in (1.3) is more difficult to determine.
II
Introduction
A comparison can be made of the two modeling approaches: mathematical modeling and
system identification. In many cases the processes are so complex that it is not possible
to obtain reasonable models using only physical insight (using first principles, e.g.
balance equations). In such cases one is forced to use identification techniques. It often
happens that a model based on physical insight contains a number of unknown parameters even if the structure is derived from physical laws. Identification methods can be
applied to estimate the unknown parameters.
The models obtained by system identification have the following properties, in
contrast to models based solely on mathematical modeling (i.e. physical insight):
They have limited validity (they are valid for a certain working point, a certain type of
input, a certain process, etc.) .
.. They give little physical insight, since in most cases the parameters of the model have
no direct physical meaning. The parameters are used only as tools to give a good
description of the system's overall behavior.
.. They are relatively easy to construct and use.
It
Identification is not a foolproof methodology that can be used without interaction from
the user. The reasons for this include:
An appropriate model structure must be found. This can be a difficult problem, in
particular if the dynamics of the system are nonlinear.
.. There are certainly no 'perfect' data in real life. The fact that the recorded data are
disturbed by noise must be taken into consideration.
It The process may vary with time, which can cause problems if an attempt is made to
describe it with a time-invariant model.
It It may be difficult or impossible to measure some variables/signals that are of central
importance for the model.
It
Start
A priori knowledge
Design of
experiment
Perform
experiment
Collect data
Determine/
choose model
structure
I------;
....
PI
Choose method
Estimate
parameters
Model
L-_ _v_a_l_id...a_ti_o_n_ _
...J~
---------------l
I
1
No
Model
accepted?
Yes
End
Introduction
Identification of distributed parameter systems is not treated at all. For some survey
papers in this field, see Polis (1982), Polis and Goodson (1976), Kubrusly (1977),
Chavent (1979) and Banks et ai. (1983).
Identification of nonlinear systems is only marginally treated; see for example Example
6.6, where so-called Hammerstein models are treated. Some surveys of black box
modeling of nonlinear systems have been given by Billings (1980), Mehra (1979),
Haber and Keviczky (1976) .
.. Identification and model approximation. When the model structure used is not ilexible
enough to describe the true system dynamics, identification can be viewed as a form of
model approximation or model reduction. Out of many available methods for model
approximation, those based on partial realizations have given rise to much interest in
the current literature. See, for example, Glover (1984) for a survey of such methods.
Techniques for model approximation have close links to system identification, as
described, for example, by Wahlberg (1985, 1986, 1987) .
.. Estimation of parameters in continuous time models is only discussed in parts of
Chapter 2 and 3, and indirectly in the other chapters (see Example 6.5). In many cases,
such as in the design of digital controllers, simulation, prediction, etc., it is sufficient to
have a discrete time model. However, the parameters in a discrete time model most
often have less physical sense than parameters of a continuous time model.
.. Frequency domain aspects are only touched upon in the book (see Section 3.3 and
8 Introduction
Chapter 1
Examples 10.1, 10.2 and 12.1). There are several new results on how estimated
models, even approximate ones, can be characterized and evaluated in the frequency
domain; see Ljung (1985b, 1987).
Bibliographical notes
There are several books and other publications of general interest in the field of system
identification. An old but still excellent survey paper was written by Astrom and Eykhoff
(1971). For further reading see the books by Ljung (1987), Norton (1986), Eykhoff
(1974), Goodwin and Payne (1977), Kashyap and Rao (1976), Davis and Vinter (1985),
Unbehauen and Rao (1987), Isermann (1987), Caines (1988), and Hannan and Deistler
(1988). Advanced texts have been edited by Mehra and Lainiotis (1976), Eykhoff (1981),
Hannan et al. (1985) and Leondes (1987). The text by Ljung and Soderstrom (1983) gives
a comprehensive treatment of recursive identification methods, while the books by
Soderstrom and Stoica (1983) and Young (1984) focus on the so-called instrumental
variable identification methods. Other texts on system identification have been published
by Mendel (1973), Hsia (1977) and Isermann (1974).
Several books have been published in the field of mathematical modeling. For some
interesting treatments, see Aris (1978), Nicholson (1980), Wellstead (1979), MarcusRoberts and Thompson (1976), Burghes and Borrie (1981), Burghes et at. (1982) and
Blundell (1982).
The International Federation of Automatic Control (IFAC) has held a number of
symposia on identification and system parameter estimation (Prague 1967, Prague 1970,
the Hague 1973, Tbilisi 1976, Darmstadt 1979, Washington DC 1982, York 1985, Beijing
1988). In the proceedings of these conferences many papers have appeared. There are
papers dealing with theory and others treating applications. The IFAC journal Automatica has published a special tutorial section on identification (Vol. 16, Sept. 1980) and
a special issue on system identification (Vol. 18, Jan. 1981); and a new special issue is
scheduled for Nov. 1989. Also, IEEE Transactions on Automatic Control has had a
special issue on system identification and time-series analysis (Vol. 19. Dec. 1974).
Chapter 2
INTRODUCTORY EXAMPLES
The system J. The physical reality that provides the experimental data will generally
be referred to as the process. In order to perform a theoretical analysis of an identification it is necessary to introduce assumptions on the data. In such cases we will use the
word system to denote a mathematical description of the process. In practice, where
real data are used, the system is unknown and can even be an idealization. For simulated data, however, it is not only known but also used directly for the data generation
in the computer. Note that to apply identification techniques it is not necessary to
know the system. We will use the system concept only for investigation of how different identification methods behave under various circumstances. For that purpose the
concept of a 'system' will be most useful.
The model structure~Jt. Sometimes we will use nonparametric models. Such a model is
described by a curve, function or table. A step response is an example. It is a curve
that carries some information about the characteristic properties of a system. Impulse
responses and frequency diagrams (Bode plots) are other examples of nonparametric
models. However, in many cases it is relevant to deal with parametric models. Such
models are characterized by a parameter vector, which we will denote by 8. The
corresponding model will be denoted.At (e). When 8 is varied over some set of feasible
values we obtain a model set (a set of models) or a model structure .. /t.
The identification method f. A large variety of identification methods have been
proposed in the literature. Some important ones will be discussed later, especially in
Chapters 7 and 8 and their complements. It is worth noting here that several proposed
methods could and should be regarded as versions of the same approach, which are
tied to different model structures, even if they are known under different names.
The experimental conditiong{'. In general terms g{'is a description of how the identification experiment is carried out. This includes the selection and generation of the input
signal, possible feedback loops in the process, the sampling interval, prefiltering of the
data prior to estimation of the parameters, etc.
10
Chapter 2
Introductory examples
Before turning to some examples it should be mentioned that of the four concepts J,
vIt,f,ge, the systemJmust be regarded as fixed. It is 'given' in the sense that its properties cannot be changed by the user. The experimental condition ge is determined
when the data are collected from the process. It can often be influenced to some degree
by the user. However, there may be restrictions - such as safety considerations or requirements of 'nearly normal' operations - that prevent a free choice of the experimental
condition ge. Once the data are collected the user can still choose the identification
method f and the model structure ~4. Several choices of f and vIt can be tried on the
same set of data until a satisfactory result is obtained.
(2.1)
J\: ao = -0.8
./'2: ao = -0.8
= 1.0
= 1.0
(2.2)
1:
yet) - O.8y(t - 1)
= l.Ou(t -
1)
+ e(t)
(2.3)
= l.Ou(t -
1)
(2.4)
The white noise e(t) thus enters into the systems in different ways. For the system Jl it
appears as an equation disturbance, while for .12 it is added on the output (d. Figure
2.1). Note that for .12 the signal x(t) can be interpreted as the deterministic or noisefree output.
A typical example of transient analysis is to let the input be a step and record the step
response. This response will by itself give some characteristic properties (dominating
Section 2.3
Nonparametric methods
11
eel)
eel)
u(t)
u(t)
yet)
System J
yet)
System Jz
FIGURE 2.1 The systems./] and '/2, The symbol q-l denotes the backward shift
operator.
8~----------------------~------~----,
yet)
:"'2
()
50
100
FIGURE 2.2 Step response of the system (jerky line). For comparison the true step
response of the undisturbed system (smooth line) is shown. The step is applied at time
t = 10.
Chapter 2
12 Introductory examples
Example 2.2 Correlation analysis
A weighting function can be used to model the process:
yet) =
2: h(k)u(t -
k)
+ v(t)
(2.5)
k=O
where {h(k)} is the weighting function sequence (or weighting sequence, for short) and
vet) is a disturbance term. Now let u(t) be white noise, of zero mean and variance 0 2,
which is independent of the disturbances. Multiplying (2.5) by u(t - -r) (-r > 0) and taking
expectation will then give
ryu(-r) ~ Ey(t)u(t - L) =
2:
h(k)Eu(t - k)u(t - L)
= 02h(t)
(2.6)
k=O
Based on this relation the weighting function coefficients {h(k)} are estimated as
1
N
A
2:
y(t)u(t - t)
1=,+1
h(k) = - - - - - - - 1 N
N
u 2 (t)
(2.7)
2:
(=1
where N denotes the number of data points. Figure 2.3 illustrates the result of simulating
the system ..1'1 with u(t) as white noise with 0 = 1, and the weighting function estimated
as in (2.7).
The true weighting sequence of the system can be found to be
h(k) = O.8k -
;?;
1; h(O) = 0
The results obtained in Figure 2.3 would indicate some exponential decrease, although
it is not easy to determine the parameter from the figure. To facilitate a comparison
between the model and the true system the corresponding step responses are given in
Figure 2.4.
It is dear from Figure 2.4 that the model obtained is not very accurate. In particular,
its static gain (the stationary value of the response to a unit step) differs considerably
from that of the true system. Nevertheless, it gives the correct magnitude of the time
constant (or rise time) of the system.
II
y(t)
+ ay(t - 1)
= bu(t - 1) -1-
E(t)
(2.8)
2~------------------------------------'
h(k)
o~~----------------~------------------~
10
20
10~----------------------------~------,
Model
yet)
True system
0~------------------~-------------------1
10
o
20
FIGURE 2.4 Step responses for the model obtained by correlation analysis and for the
true (undisturbed) system.
14
Introductory examples
Chapter 2
This model structure is a set of first-order linear discrete time models. The parameter
vector is defined as
(2.9)
In (2.8), y(t) is the output signal at time t, u(t) the input signal and f(l) a disturbance
term. We will often call f(l) an equation error or a residual. The reason for including f(t)
in the model (2.8) is that we can hardly hope that the model with f(t) == 0 (for all t) can
give a perfect description of a real process. Therefore f(t) will describe the deviation in
the data from a perfect (deterministic) first-order linear system. Using (2.8), (2.9), for
a given data set {u(I), y(I), u(2), y(2), ... , u(N), yeN)}, {f(t)} can be regarded as
functions of the parameter vector e. To see this, simply rewrite (2.8) as
f(t) = y(t)
(2.10)
In the following chapters we will consider various generalizations of the simple model
structure (2.8). Note that it is straightforward to extend it to an nth-order linear model
biu~t - i) for i = 2, ... ,n.
simply by adding terms aiy(t The identification method J should now be specified. In this chapter we will confine
ourselves to the least squares method. Then the parameter vector is determined as the
vector that minimizes the sum of squared equation errors. This means that we define the
estimate by
n,
(2.11)
vee)
is given by
vee) = ~ f2(t)
(2.12)
1=1
As can be seen from (2.10) the residual f(t) is a linear function of e and thus Vee) will be
well defined for any value of e.
For the simple model structure (2.8) it is easy to obtain an explicit expression of Vee)
as a function of e. Denoting r.f'!:l by r. for short,
Vee)
= r.[y(t)
+ ay(t - 1) - bu(t -
= [a 2 r.i(t - 1)
1)F
(2.13)
+ r.iCt)
The estimate is then obtained, according to (2.11), by minimizing (2.13). The minimum
point is determined by setting the gradient of Vee) to zero. This gives
o = <wee)
aa
aves)
o = ----ab
r.y(t)y(t - 1)]
(2.14)
A?
or in matrix form
A parametric method
Section 2.4
(
:El(t - 1)
- :Ey(t - l)u(t - 1)
= (-:Ey(t)y(t
15
-Ey(t - l)u(t E u 2 (t - 1)
(2.15)
-1)
Ey(t)u(t - 1)
Note that (2.15) is a system of linear equations in the two unknowns aand h. As we will
see, there exists a unique solution under quite mild conditions.
In the following parameter estimates will be computed according to (2.15) for a
number of cases. We will use simulated data generated in a computer but also theoretical
analysis as a complement. In the analysis we will assume that the number of data points,
N, is large. Then the following approximation can be made:
1
N
2.: y2(t -
(2.16)
1) = El(t - 1)
1=1
where E is the expectation operator for stochastic variables, and similarly for the other
sums. It can be shown (see Lemma B.2 in Appendix B at the end of the book) that for all
the cases treated here the left-hand side of (2.16) will converge, as N tends to infinity, to
the right-hand side. The advantage of using expectations instead of sums is that in this
way the analysis can deal with a deterministic problem, or more exactly a problem which
does not depend on a particular realization of the data. For a deterministic signal the
notation E will be used to mean
1 N
lim N
N-?>oo
2:
t=l
xm(t)
+ aXm(t -
1) = bu(t - 1)
where u(t) is the input used in the identification. One would expect that xm(t) should be
close to the true noise-free output x(t). The latter signal is, however, not known to the
TABLE 2.1 Parameter estimates for Example 2.3
farameter
True value
-0.8
1.0
Estimated value
SystemJ1
SystemJ2
-0.795
0.941
-0.580
0.959
..-
o
-1
()
50
t 100
7.5
o
-7.5+---------------------r------------------~
()
50
100
(a)
r--
,......
-1
()
50
100
50
100
-6
(h)
FIGURE 2.5 (a) Input (upper part), output (1, lower part) and model output (2, lower
part) for Example 2.3, system ..1'\. (b) Similarly for system ..1'2'
A parametric method
Section 2.4
17
user in the general case. Instead one can compare xm(t) to the 'measured' output signal
yet). For a good model the signal xm(t) should explain all patterns in yet) that are due to
the input. However, xm(t) should not be identical to yet) since the output also has a
component that is caused by the disturbances. This stochastic component should not
appear (or possibly only in an indirect fashion) in the model output xm(t).
Figure 2.6 shows the step responses of the true system and the estimated models. It is
easily seen, in particular from Table 2.1 but also from Figure 2.6, that a good description
is obtained of the system ./1 while the system ./2 is quite badly modeled. Let us now
see if this result can be explained by a theoretical analysis. From (2.15) and (2.16) we
obtain the following equations for the limiting estimates:
(
Ey2(t)
-Ey(t)u(t)
-Ey(t)U(t)(~)
Eu 2 (t)
b
(2.17)
1)
1)
Here we have used the fact that, by the stationarity assumption, Ey2(t - 1) = Ey2(t),
etc. Now let u(t) be a white noise process of zero mean and variance 0 2 . This will be an
accurate approximation of the PRBS used as far as the first- and second-order moments
are concerned. See Section 5.2 for a discussion of this point. Then for the system (2.1),
after some straightforward calculations,
Ey 2( t)
b60 2 + (1
+ c6 - 2aoco)A2
= -"-___
'---_-"---c-_-"
1-
a6
y(t)
J
vitI
4
vI{2
o~.----------------------------------------____~
50
FIGURE 2.6 Step responses of the true system (..1'), the estimated model of ..1'1 (.All)
and the estimated model of ..1'2 (.Al2), Example 2.3.
18
Introductory examples
=0
Eu 2 (t) = 0 2
Chapter 2
Ey(t)u(t)
Ey(t)u(t - 1)
(2.18a)
= bo0 2
Using these results in (2.17) the following expressions are obtained for the limiting
parameter estimates:
'.
a = ao +
I; =
-co(1 - a~)1-,2
b~02
+ (1 +
c~ - 2aoCO)1-,2
(2.18b)
bo
a=
-0.8
b=
(2.19)
1.0
which means that asymptotically (i.e. for large values of N) the parameter estimates will
be close to the 'true values' ao, boo This is well in accordance with what was observed in
the simulations.
For the system J 2 the corresponding results are
A
a =
02
-0.80 2
+ 0,361-,2 = -0.588
(2.20)
Thus in this case the estimate a will deviate from the true value. This was also obvious
from the simulations; see Table 2.1 and Figure 2.6. (Compare (2.20) with the estimated
values for./'2 in Table 2.1.) The theoretical analysis shows that this was not due to 'bad
luck' in the simulation nor to the data series being too short. No matter how many data
pairs are used, there will, according to the theoretical expression (2.20), be a systematic
deviation in the parameter estimate a.
II
e
e
The difference E9 - 80 is called the bias. If instead equality applies in (2.21), is said to
be unbiased.
In Example 2.3 it seems reasonable to believe that for large N the estimate may be
unbiased for the system./'J but that it is surely biased for the systemJ2'
Section 2.5
8 is
19
consistent if
Since 8 is a stochastic variable we must define in what sense the limit in (2.22) should
be taken. A reasonable choice is 'limit with probability one'. We will generally use this
alternative. Some convergence concepts for stochastic variables are reviewed in
Appendix B (Section B.l).
The analysis carried out in Example 2.3 indicates that is consistent for the system
'/'1 but not for the system .1'2'
Loosely speaking, we say that a system is identifiable (in a given model set) if the
corresponding parameter estimates are consistent. A more formal definition of identifiability is given in Section 6.4. Let us note that the identifiability properties of a given
system will depend on the model structure j { , the identification method 5 and the
experimental condition ff.
The following example demonstrates how the experimental condition can influence
the result of an identification.
True value
-0.8
1.0
Estimated value
System./j
SystemJ2
-0.788
1.059
-0.058
4.693
Again, a good model is obtained for system .1'1' For system .1'2 there is a considerable deviation from the true parameters. The estimates are also quite different from
those in Example 2.3. In particular, now there is also a considerable deviation in the
estimate b.
For a theoretical analysis of the facts observed, equation (2.17) must be solved. For
this purpose it is necessary to evaluate the different covariance elements which occur in
(2.17). Let u(t) be a step of size a and introduce S = bo/(l + ao) as a notation for the
static gain of the system. Then
Ey(t)u(t) = Sa 2
Eu 2 (t) = a 2
(2.23a)
20
Chapter 2
Introductory examples
3
-J~------------------r------------------;
50
o
100
o
o
100
50
(a)
-l~----
()
______________
________________
50
100
50
100
o
()
(b)
FIGURE 2.7 (a) Input (upper part), output (1, lower part) and model output (2, lower
part) for Example 2.4, system '/1' (b) Similarly for system '/2'
E (t) (t _ 1)
yy.
S2
+ (co -- ao)1(12
- aoco)A?
- ao
Ey(t)u(t - 1) = Sa 2
Using these results in (2.17) the following expressions for the parameter estimates are
obtained:
Section 2.5
"
b
= bo -- boco,
1 - ao
21
(2.23b)
-2--"-'
Co - 2aocn
Note that now both parameter estimates will in general deviate from the true parameter
values. The deviations are independent of the size of the input step, 0, and they vanish if
Co = O. For system.fl,
a = -0.8
h = 1.0
(2.24)
a=
0.0
bo
= - - = 5.0
1 + ao
(2.25)
which clearly deviates considerably from the true values. Note, though, that the static
gain is estimated correctly, since from (2.23b)
b
b
- - -- - -o 1+
1 + ao
(2.26)
The theoretical results (2.24), (2.25) are quite similar to those based on the simulation;
see Table 2.2. Note that in the noise-free case (when 'A? = 0) a problem will be
encountered. Then the matrix in (2.17) becomes
,., (
0"'
S2
-S
-S)
1
Clearly this matrix is singular. Hence the system of equations (2.17) does not have
a unique solution. In fact, the solutions in the noise-free case can be precisely characterized by
l+a=S
(2.27)
We have seen in two examples how to obtain consistent estimates for system'!1 while
there is a systematic error in the parameter estimates for .f2' The reason for this difference between ./1 and .f2 is that, even if both systems fit into the model structure .At
given by (2.8), only .f1 will correspond to E(t) being white noise. A more detailed
explanation for this behavior will be given in Chapter 7. (See also (2.34) below.), The
models obtained for./2 can be seen as approximations of the true system. The approximation is clearly dependent on the experimental condition used. The following example
presents some detailed calculations.
Chapter 2
22 Introductory examples
distribution or the covariance structure of f(t), a reasonable prediction of yet) given data
up to (and including) time t - 1 is given (see (2.8)) by
(2.28)
This predictor will be justified in Section 7.3. The prediction error will, according to
(2.1), satisfy
(2.29)
The variance W = Ey2(t) of the prediction error will be evaluated in a number of cases.
For system .1'2 Co = ao. First consider the true parameter values, i.e. a == ao, b = boo
Then the prediction error of./2 becomes
+ a6)
(2.30)
Note that the variance (2.30) is independent of the experimental condition. In the
following we will assume that the prediction is used with u(t) a step of size a. Using the
estimates (2.23), (2.25), in the stationary phase for system./2 the prediction error is
bo
-ao [ 1 + ao a + e(l - 1)
e(t)
o
J + 1aob
+ ao a + e(t) + aoe(t -
1)
and thus
W 2 = ",,1< WI
(2.31)
Note that here a better result is obtained than if the true values ao and bo were used! This
is not unexpected since a and b are determined to make the sample variance r. y2(t)/N
as small as possible. In other words, the identification method uses a and b as vehicles to
get a good prediction.
Note that it is crucial in the above calculation that the same experimental condition
(u(t) a step of size a) is used in the identification and when evaluating the prediction. To
see the importance of this fact, assume that the parameter estimates are determined from
an experiment with u(t) as white noise of variance 0 2. Then (2.18), (2.20) with Co = ao
give the parameter estimates. Now assume that the estimated model is used as a predictor for a step of size 0 as input. Let S be the static gain, S = bo/(l + ao). From (2.29)
we have
yet)
= (a - ao)
1)
Section 2.6
a6)".z.
a-a
-r+1-r+1
-ao- - -aor0
The expected value of y2(t) becomes
W3 = ".1(1
=
1,2
+ a2 ) + (a -
ao)2S202
[1 + Cat1 J + C:)1
(2.32)
S202
> 1,2(2r + 1)
(2.33)
In the following the discussion will be confined to system J 1 only. In particular we will
analyze the properties ofthe matrix appearing in (2.15). Assume that this matrix is 'well
behaved'. Then there exists a unique solution. For system./1 this solution is asymptotically (N ~ 00) given by the true values au, bo since
2:l(t - 1)
1(
N -2:y(t - l)u(t - 1)
1)
_ ~(-2:y(t)y(t N
2:y(t)u(t - 1)
(2.34)
1
( yet - 1) ) [y(t) + aoy(t - 1) - bou(t - l)J
== N' ~
.
-u(t - 1)
=
~"
N ~
yet - 1) ) e(t)
-u(t - 1)
E( -u(t
yet - 1) ) eel)
- 1)
The last equality follows since e(t) is white noise and is hence independent of all past
data.
24
Chapter 2
Introductory examples
TABLE 2.3 Parameter estimates for Example 2.6
Parameter
True value
Estimated value
-0.8
1.0
--0.796
2.950
This time it can be seen that the parameter ao is accurately estimated while the estimate
of b o is POOf. This inaccurate estimation of bo was expected since the input gives little
contribution to the output. It is only through the influence of u() on y(.) that information about b() can be obtained. On the other hand, the parameter ao will also describe
the effect of the noise on the output. Since the noise is present in the data all the time, it
is natural that ao is estimated much more accurately than boo
For a theoretical analysis consider (2.15) in the case where u(t) is an impulse of
magnitude 0 at time t = 1. Using the notation
1
Ro = N
y2(t - 1)
1= ,
1
RI = N
y(t)y(t - 1)
{= ,
(ba) = (NRo
-y(l)o
-Y(l)O)-l
02
(-NR1)
y(2)0
1 (-R + +
(2.35)
y(1)y(2)IN )
(-y(l)R,
y(2)Ro)/o
1
- Ro - l(l)/N
0.5
O;.u----------,.----------I
o
50
100
()
-6~----------------._---------------4
50
100
FIGURE 2.8 Input (upper part), output (1, lower part) and model output (2, lower
part) for Example 2.6, system ./,.
Section 2. 7
25
b=
(aoy(l) + y(2/o
= bo + e(2)/o
(2.36)
We see that b has a second term that makes it differ from the true value boo This
deviation depends on the realization, so it will have quite different values depending
on the actual data. In the present realization e(2) = 1.957, which according to (2.36)
should give (asymptotically) b = 2.957. This values agrees very well with the result given
in Table 2.3.
The behavior of the least squares estimate observed above can be explained as follows.
The case analyzed in this example is degenerate in two respects. First, the matrix in
(2.35) multiplied by liN tends to a singular matrix as N ~ 00. However, in spite of this
fact, it can be seen from (2.35) that the least squares estimate uniquely exists and can be
computed for any value of N (possibly for N ~ 00). Second, and more important, in this
case (2.34) is not valid since the sums involving the input signal do not tend to
expectations (u(t) being equal to zero almost all the time). Due to this fact, the least
squares estimate converges to a stochastic variable rather than to a constant value (see
the limiting estimate b in (2.36). In particular, due to the combination of the two types
of degenerency discussed above, the least squares method failed to provide consistent
estimates.
III
Examples 2.3 and 2.4 have shown that for system ./1 consistent parameter estimates are
obtained if the input is white noise or a step function (in the latter case it must be
assumed that there is noise acting on the system so that)..2 > 0; otherwise the system of
equations (2.17) does not have a unique solution; see the discussion at the end of
Example 2.4). If u(t) is an impulse, however, the least squares method fails to give
consistent estimates. The reason, in loose terms, is that an impulse function 'is equal to
zero too often'. To guarantee consistency an input must be used that excites the process
sufficiently. In technical terms the requirement is that the input signal be persistently
exciting of order 1 (see Chapter 5 for a definition and discussions of this property).
26
Chapter 2
Introductory examples
= -ky(t)
(2.37)
Ll(t - 1)
(~
:2)
which is singular. As a consequence the least squares method cannot be applied. It can
be seen in other ways that the system cannot be identified using the input (2.37). Assume
that k is known (otherwise it is easily determined from measurements of u(t) and y(t)
using (2.37. Thus, only {y(t)} carries information about the dynamics of the system.
Using (u(t)} cannot provide any more information. From (2.8) and (2.37),
E(t)
= yet) + (a + bk)y(t -
1)
(2.38)
This expression shows that only the linear combination a + bk can be estimated from the
data. All values of a and b that give the same value of a + bk will also give the same
sequence of residuals {E(t)} and the same value of the loss function. In particular, the
loss function (2.12) will not have a unique mimimum. It is minimized along a straight
line. In the asymptotic case (N ~ (0) this line is simply given by
{Bla
bk
= ao +
bok}
Since there is a valley of minima, the Hessian (the matrix of second-order derivatives)
must be singular. This matrix is precisely twice the matrix appearing in (2.15). This
brings us back to the earlier observation that we cannot identify the parameters a and b
using the input as given by (2.37).
II
Based on Example 2.7 one may be led to believe that there is no chance of obtaining
consistent parameter estimates if feedback must be used during the identification
experiment. Fortunately, the situation is not so hopeless, as the following example
demonstrates.
Example 2.8 A feedback signal and an additional serpoint as input
The system J( was simulated generating 1000 data points. The input was a feedback from
the output plus an additional signal,
u(t)
= -ky(t) + ret)
(2.39)
The signal ret) was generated as a PRBS of magnitude 0.5, while the feedback gain was
chosen as k = 0.5. The least squares estimates were computed; the numerical values are
given in Table 2.4. Figure 2.9 depicts the input, the output and the model output.
Section 2.7
27
2.5
o
-2.5 +----------,..----------/
(j
50
100
-4+-------------------,..------------------/
o
100
50
FIGURE 2.9 Input (upper part), output (1, lower part) and model output (2, lower
part) for Example 2.8.
TABLE 2.4 Parameter estimates for Example 2.8
Parameter
True value
Estimated value
-0.8
1.0
-0.754
0.885
It can be seen that this time reasonable parameter estimates were obtained even
though the data were generated using feedback.
To analyze this situation, first note that the consistency calculations (2.34) are still
valid. It is thus sufficient to show that the matrix appearing in (2.15) is well behaved (in
particular, nonsingular) for large N. For this purpose assume that ret) is white noise of
zero mean and variance 0 2 , which is independent of e(s) for all t and s. (As already
explained for 0 = 0.5 this is a dose approximation to the PRBS used.) Then from (2.1)
with Co = 0 and (2.39) (for convenience, we omit the index 0 of a and b), the following
equations are obtained:
yet) + (a + bk)y(t - 1)
u(t) + (a
br(t - 1) + e(t)
Ey2(t)
-Ey(t)u(t)
-Ey(t)U(t)
1
Eu 2 (t)
- 1 - (a + bk)2
which is positive definite. Here we have assumed that the closed loop system is
asymptotically stable so that la + bkl < 1.
IlIII
28
Introductory examples
Chapter 2
Summary and outlook
The experiences gained so far and the conclusions that can be drawn may be summarized
as follows.
1. When a nonparametric method such as transient analysis or correlation analysis is
used, the result is easy to obtain but the derived model will be rather inaccurate. A
parametric method was shown to give more accurate results.
2. When the least squares method is used, the consistency properties depend critically
on how the noise enters into the system. This means that the inherent model structure
may not be suitable unless the system has certain properties. This gives a requirement
on the system J. When that requirement is not satisfied, the estimated model will not
be an exact representation of ./ even asymptotically (for N - 7 (0). The model will
only be an approximation of J. The sense in which the model approximates the
system is determined by the identification method used. Furthermore, the parameter
values of the approximation model depend on the input signal (or, in more general
terms, on the experimental condition).
3. As far as the experimental conditionf!f'is concerned it is important that the input signal
is persistently exciting. Roughly speaking, this implies that all modes of the system
should be excited during the identification experiment.
4. When the experimental condition includes feedback from yet) to u(t) it may not be
possible to identify the system parameters even if the input is persistently exciting.
On the other hand, when a persistently exciting reference signal is added to the
system, the parameters may be estimated with reasonable accuracy.
Needless to say, the statements above have not been strictly proven. They merely rely on
some simple first-order examples. However, the subsequent chapters will show how
these conclusions can be proven to hold under much more general conditions.
The remaining chapters are organized in the following way.
Nonparametric methods are described in Chapter 3, where some methods other than
those briefly introduced in Section 2.3 are analyzed. Nonparametric methods are often
sensitive to noise and do not give very accurate results. However, as they are easy to
apply they are often useful means of deriving preliminary or crude models.
Chapter 4 treats linear regression models, confined to static models, that is models
which do not include any dynamics. The extension to dynamic models is straightforward
from a purely algorithmic point of view. The statistical properties of the parameter
estimates in that case will be different, however, except in the special case of weighting
function models. In particular, the analysis presented in Chapter 4 is crucially dependent
on the assumption of static models. The extension to dynamic models is imbedded in a
more general problem discussed in Chapter 7.
Chapter 5 is devoted to discussions of input signals and their properties relevant to
system identification. In particular, the concept of persistent excitation is treated in some
detail.
We have seen by studying the two simple systems./I and./2 that the choice of model
Problems
29
structure can be very important and that the model must match the system in a certain
way. Otherwise it may not be possible to get consistent parameter estimates. Chapter 6
presents some general model structures, which describe general linear systems.
Chapters 7 and 8 discuss two different classes of important identification methods.
Example 2.3 shows that the least squares method has some relation to minimization of
the prediction error variance. In Chapter 7 this idea is extended to general model sets.
The class of so-called prediction error identification methods is obtained in this way.
Another class of identification methods is obtained by using instrumental variables
techniques. We saw in (2.34) that the least squares method gives consistent estimates
only for a certain type of disturbance acting on the system. The instrumental variable
methods discussed in Chapter 8 can be seen as rather simple modifications of the least
squares method (a linear system of equations similar to (2.15) is solved), but consistency
can be guaranteed for a very general class of disturbances. The analysis carried out in
Chapters 7 and 8 will deal with both the consistency and the asymptotic distribution of
the parameter estimates.
In Chapter 9 recursive identification methods are introduced. For such methods the
parameter estimates are modified every time a new input-output data pair becomes
available. They are therefore perfectly suited to on-line or real-time applications. In
particular, it is of interest to combine them with time-varying regulators or filters that
depend on the current parameter vectors. In such a way one can design adaptive systems
for control and filtering. It will be shown how the off-line identification methods
introduced in previous chapters can be modified to recursive algorithms.
The role of the experimental condition in system identification is very important. A
detailed discussion of this aspect is presented in Chapter 10. In particular, we investigate
the conditions under which identifiability can be achieved when the system operates
under feedback control during the identification experiment.
A very important phase in system identification is model validation. By this we mean
different methods of determining if a model obtained by identification should be accepted
as an appropriate description of the process or not. This is certainly a difficult problem.
Chapter 11 provides some hints on how it can be tackled in practice. In particular, we
discuss how to select between two or more competing model structures (which may, for
example, correspond to different model orders).
It is sometimes claimed that system identification is more art than science. There are
no foolproof methods that always and directly lead to a correct result. Instead, there are
a number of theoretical results which are useful from a practical point of view. Even so,
the user must combine the application of such a theory with common sense and intuition
to get the most appropriate result. Chapter 12 should be seen in this light. There we
discuss a number of practical issues and how the previously developed theory can help
when dealing with system identification in practice.
Problems
Problem 2.1 Bias, variance and mean square error
Let Eli, i = 1, 2 denote two estimates of the same scalar parameter B. Assume, with N
30
Introductory examples
Chapter 2
bias (8;)
E(8;) - 8
{~/N ;~~ ~ :
=
2
{liN for i = 1
3/N for i = 2
where var (-) is the abbreviation for variance (.). The mean square error (mse) is defined
as E(B; - 8? Which one of 81 , B2 is the best estimate in terms of mse? Comment on
the result.
Problem 2.2 Convergence rates for consistent estimators
For most consistent estimators of the parameters of stationary processes, the variance of
the estimation error tends to zero as 11 N when N ~ 00 (N = the number of data points)
(see, for example, Chapters 4, 7 and 8). For nonstationary processes, faster convergence
rates may be expected. To see this, derive the variance of the least squares estimate of a
in
y(t)
= at +
e(t)
= 1,2, ... ,
11
= -N LXi
i=1
1
A
-j:j
L (X; N
0:
~
Ii)~
A
i= I
N
'"
2, = N 1_ 1,L,
(Xi - Ii')2
;=1
Determine the means and variances of the estimates 11, 0, and 02' Discuss their
(un)biasedness and consistency properties. Compare 01 and 02 in terms of their mse's.
Hint. Lemma B.9 will be useful for calculating var(oi)'
Remark. A generalization of the problem above is treated in Section B.9.
Problems
31
Problem 2.6 Least squares estimates with a step function as input, continued
(a) Verify the expression (2.26).
(b) Assume that the data are noise-free. Show that all solutions to the system of
equations (2.17) are given by (2.27).
Problem 2.7 Conditions for a minimum
Show that the solution to (2.15) gives a minimum point of the loss function, and not
another type of stationary point.
Problem 2.8 Weighting sequence and step response
Assume that the weighting sequence {h(k) h"'=o of a system is given. Let yet) be the step
response of the system. Show that yet) can be obtained by integrating the weighting
sequence, in the following sense:
yet) -yet - 1) = h(t)
y(-1) = 0
Bibliographical notes
The concepts of system, model structure, identification method and experimental
condition have turned out to be valuable ways of describing the items that influence an
identification result. These concepts have been described in Ljung (1976) and Gustavsson
et al. (1977, 1981). A classical discussion along similar lines has been given by Zadeh
(1962).
Chapter 3
NONPARAMETRIC
METHODS
3.1 Introduction
This chapter describes some nonparametric methods for system identification. Such
identification methods are characterized by the property that the resulting models are
curves or functions, which are not necessarily parametrized by a finite-dimensional
parameter vector. Two nonparametric methods were considered in Examples 2.1-2.2.
The following methods will be discussed here:
Transient analysis. The input is taken as a step or an impulse and the recorded
output constitutes the model.
Frequency analysis. The input is a sinusoid. For a linear system in steady state the
output will also be sinusoidal. The change in amplitude and phase will give the
frequency response for the used frequency.
Correlation analysis. The input is white noise. A normalized cross-covariance
function between output and input will provide an estimate of the weighting function.
Spectral analysis. The frequency response can be estimated for arbitrary inputs by
dividing the cross-spectrum between output and input to the input spectrum.
Step response
Sometimes it is of interest to fit a simple low-order model to a step response. This is
illustrated in the following examples for some classes of first- and second-order systems,
which are described using the transfer function model
32
Section 3.2
Transient analysis 33
(3.1)
where Yes) is the Laplace transform of the output signal yet), U(s) is the Laplace
transform of the input u(t), and G(s) is the transfer function of the system.
Example 3.1 Step response of a first-order system
Consider a system given by the transfer function
G(s) =
K
_
e s,
+ sT
(3.2a)
This means that the system is described by the first-order differential equation
T~ (t) + y(t)
= Ku(t - 't)
(3.2b)
Note that a time delay 1 is included in the model. The step response of such a system is
illustrated in Figure 3.1.
Figure 3.1 demonstrates a graphical method for determining the parameters K, T
and 1 from the step response. The gain K is given by the final value. By fitting the
steepest tangent, T and 1 can be obtained. The slope of this tangent is KIT, where Tis
the time constant. The tangent crosses the t axis at t = 1, the time delay.
II
K+---------------------,-------------------------------~
Chapter 3
34 Nonparametric methods
Example 3.2 Step response of a damped oscillator
Consider a second-order system given by
G( )
s =
KW6
S2
+ 2~wos +
(3.3a)
w6
dt2
)-
dy
(3.3b)
Physically this equation describes a damped oscillator. After some calculations the step
response is found to be
yet) =
L =
K[l -- V(l 1-
arccos
~2)
~2)t +
L)]
(3.3c)
2K+-----------------?-~----------------------------------
__~~
Section 3.2
Transient analysis 35
(3.3d)
and that
(3.3e)
where the overshoot M is given by
M
(3.3f)
The relation (3.3f) between the overshoot M and the relative damping is illustrated in
Figure 3.3.
The parameters K, ~ and Wo can be determined as follows (see Figure 3.4). The gain
K is easily obtained as the final value (after convergence). The overshoot M can be
36 Nonparametric methods
Chapter 3
K(l + M) -+------.,.,......
FIGURE 3.4 Determination of the parameters of a damped oscillator from the step
response.
determined in several ways. One possibility is to use the first maximum. An alternative
is to use several extrema and the fact (see (3.3e)) that the amplitude of the oscillations
is reduced by a factor M for every half-period. Once M is determined, ~ can be derived
from (3.3f):
-log M
(3.3g)
From the step response the period T of the oscillations can also be determined. From
(3.3d),
(3.3h)
Then
Wo
is given by
(3.3i)
II
Frequency analysis
Section 3.3
37
Impulse response
Theoretically, for an impulse response a Dirac function Nt) is needed as input. Then
the output will be equal to the weighting function h(t) of the system. However, an ideal
impulse cannot be realized in practice, and an approximate impulse must be used; for
example
()_{lla
o
u t .-
0 ~ t < a
a ~ t
(3.4)
This input satisfies Ju(t)dt = 1 as the idealized impulse and should resemble it for
sufficiently small values of the impulse length a.
Use of the approximate impulse (3.4) will give a distortion of the output, as can be
seen from the following simple calculation:
y(t) =
J
oo
1
h(s)u(t - s)ds = ~
It
h(s)ds = h(t)
(3.5)
max(O, I-a)
If the duration a of the impulse (3.4) is short compared to the time constants of interest,
then the distortion introduced may be negligible. This fact is illustrated in the following
example.
Example 3.3 Nonideal impulse response
Consider a damped oscillator with transfer function
1
G(s) = f)2 + O.4f) + 1
(3.6)
Figure 3.5 shows the weighting function and the responses to the approximate impulse
(3.4) for various values of the impulse duration a. It can be seen that the (nonideal)
impulse response deviates very little from the weighting function if a is small compared
to the oscillation period.
III
Problem 3.11 and Complement C7.S contain a discussion of how a parametric model
can be fitted to an estimated impulse response.
To summarize this section, we note that transient analysis is often simple to apply.
It gives at least a first model which can be used to obtain rough estimates of the relative damping, the dominating time constants and a possible time delay. Therefore,
transient analysis is a convenient way of deriving crude models. However, it is quite
sensitive to noise.
Chapter 3
38 Nonparametric methods
O~--------~Hr.-------~++--------~
10
FIGURE 3.5 Weighting function and impulse responses for the damped oscillator (3.6)
excited by the approximate impulses (3.4).
u(t) = a sin(wt)
(3.7)
and the system is asymptotically stable, then in the steady state the output will become
(3.8)
where
b = aIG(iw)1
(3.9a)
cp = arg[G(iw)]
(3.9b)
This can be proved as follows. Assume for convenience that the system is initially at
rest. (The initial values will only give a transient effect, due to the assumption of
stability.) Then the system can be represented using a weighting function h(t) as
follows:
yet) = {
h(L)U(t - L)dL
(3.10)
Section 3.3
Frequency analysis 39
where h(t) is the function whose Laplace transform equals G(s). Now set
GtCs) = {h('t)e-S1:d-r;
(3.11)
Since
sin wt =
zi1(ewt -
.
e -loot)
yet)
= ~ J~ h(-r;)[eiw(t-.,) - e-iw(t-T)]d-r;
(3.12)
When t tends to infinity, Gliw) will tend to G(iw). With this observation, the proof of
(3.8), (3.9) is completed.
Note that normally the phase cp will be negative. By measuring the amplitudes a
and b as well as the phase difference cp, the complex variable G(iw) can be found from
(3.9). If such a procedure is repeated for a number of frequencies then one can obtain a
graphical representation of G(iw) as a function of Ul. Such Bode plots (or Nyquist plots
or other equivalent representations) are wen suited as tools for classical design of
control systems.
The procedure outlined above is rather sensitive to disturbances. In practice it can
seldom be used in such a simple form. This is not difficult to understand: assume that
the true system can be described by
(3.13)
where E(s) is the Laplace transform of some disturbance e(t). Then instead of (3.8) we
will have
(3.14)
and due to the presence of the noise it will be difficult to obtain an accurate estimate of
the amplitude b and the phase difference cp.
40
Chapter 3
N onparametric methods
sin wt
yet)
1 - - - yJT)
cos wt
=
=
J:
J:
= bT cos qJ -
Yc(T) =
qJ)
sin wtdt
J:
(3.1Sa)
IT b sinewt +
o
bT sin
-2
qJ)
cos wtdt
(3.15b)
qJ -
If the measurements are noise-free (e(t) = 0) and the integration time T is a multiple of
the sinusoid period, say T = k2rr'/w, then
y\.(T) =
2bT cos qJ
(3.16)
Yc(T)
= b; sin
qJ
From these relations it is easy to determine band qJ; then IG(iw)1 is calculated according
to (3.9a). Note that (3.9) and (3.16) imply
Section 3.3
Ys(T) =
YcCT) =
Frequency analysis 41
a;
a;
Re G(iw)
(3.17)
1m G(iw)
Sensitivity to noise
. Intuitively, the approach (3.15)-(3.17) has better noise suppression properties than the
basic frequency analysis method. The reason is that the effect of the noise is suppressed
by the averaging inherent in (3.15).
A simplified analysis of the sensitivity to noise can be made as follows. Assume that
e(t) is a stationary process with covariance function re('t). The variance of the output is
then re(O). The amplitude b can be difficult to estimate by inspection of the signals unless
bZ is much larger than re(O). The variance of the term Ys(T) can be analyzed as follows
(Ye(T) can be analyzed similarly). We have
f: f:
f: f:
= E
(3.18)
Assume that the noise covariance function re(-t) is bounded by an exponential function
(a> 0)
(3.19)
(For a stationary disturbance this is a weak assumption.) Then an upper bound for
var[YsCT)] is derived as follows:
var[Ys(T)] ::;:;
f: f:
IreCt l
JT[fT-t2
: ;:; f: [f~oo
J0
=
-t
tz)ldtldtz
Ire(t)ld't dtz
oo
::;:; 2T
(3.20)
:0
2T
42 Nonparametric methods
Chapter 3
re(O)
b 2 sin 2 ( w!
(3.22)
+ <p)
yet)
2:
h(k)u(t - k)
+ vet)
(3.23)
k=O
or its continuous time counterpart. In (3.23) {h(k)} is the weighting sequence and vet) is
a disturbance term. Af'mme that the input is a stationary stochastic process which is
independent of the disturbance. Then the following relation (called the Wiener- Hopf
equation) holds for the covariance functions:
ryuCr)
2:
(3.24)
h(k)rJt - k)
k=O
where ryuCr) = Ey(t + .)u(t) and ru (') = Eu(t + .)u(t) (see equation (A3.1.11) in
Appendix A3.1 at the end of this chapter). The covariance functions in (3.24) can be
estimated from the data as
1
Tyu(') = N
N-maX(T,O)
2:
yet + .)u(t)
= 0, 1, 2, ...
t=l-min("O)
Tu {')
=N
(3.25)
N--1:
2:
u(t + ')U(t)
0, 1,2, ...
1=1
Then an estimate {h(k)} ofthe weighting function {h(k)} can be determined in principle
by solving
Section 3.5
fyuC L ) =
Spectral analysis
2:
h(k)fu(r; - k)
43
(3.26)
k=O
This will in general give a linear system of infinite dimension. The problem is greatly
simplified if we use white noise as input. Then it is known a priori that ru(L) = 0 for T, O.
For this case (3.24) gives
*'
h(k) = ryuCk)/ru(O)
(3.27)
which is easy to estimate from the data using (3.25). In Appendix A3.2 the accuracy of
this estimate is analyzed. Equipment exists for performing correlation analysis automatically in this way.
Another approach to the simplification of (3.26) is to consider a truncated weighting
function. This will lead to a linear system of finite order. Assume that
k
h(k) = 0
;?:
(3.28)
Such a model is often called a finite impulse re::.ponse (FIR) model in signal processing
applications. The integer M should be chosen to be large in comparison with the
dominant time constants of the system. Then (3.28) will be a good approximation. Using
(3.28), equation (3.26) becomes
M-l
fyuCL)
2::
h(k)fu(T, - k)
(3.29)
k=O
) =
(;:i~~
fuCM _ 1)
fiM
~ 1)) (. h(O)
fu(O)
)
(3.30)
heM -1)
Equation (3.29) can also be applied with more than M different values of T, giving rise to
an overdetermined linear system of equations. The method of determining {h(k)} from
(3.30) is discussed in further detail in Chapter 4. The condition required in order
to ensure that the system of equations (3.30) has a unique solution is derived in
Section 5.4.
44
Nonparametric methods
<Pyu( w)
Chapter 3
B (e -iw)<puC w)
(3.31)
where
1
<Pu( w) = 2n
2:
00
(3.32)
ru('t)e -nw
"(=-00
B(e- iw ) =
2:
h(k)e- ikw
k=O
To use (3.33) we must find a reasonable method for estimating the spectral densities. A
straightforward approach would be to take
.+,. ( )
't'yu
W
1 L.,
~
= 2n
r yu 't e
A
()
-hw
(3.34)
,=-N
(d. (3.25), (3.32), and similarly for <l>u(w). The computations in (3.34) can be organized
as follows. Using (3.25),
1
N
N-max(t,O)
,j,
(w) = 2nN L.,
'"
'"
yet + 't)u(t)e -hw
't'yu
L.,
A
,=-N t=1-min("O)
'to
~------------~~--------------~-+
FIGURE 3.7 Change of summation variables. Summation is over the shaded area.
Section 3.5
Spectral analysis
<pyu( (0)
1
2JtN
45
.
ltW
(3.35)
where
N
YN(oo) =
2: y(s)e-
isW
s=J
(3.36)
UN(oo) =
2: u(s)e-isw
8=1
are the discrete Fourier transforms of the sequences {yet)} and {u(t)}, padded with
zeros, respectively. For 00 = 0, 2JtIN, 4JtIN, ... , Jt they can be computed efficiently using
FFT (fast Fourier transform) algorithms.
In a similar way,
(3.37)
This estimate of the spectral density is called the periodogram. From (3.33), (3.35),
(3.37), the estimate for the transfer function is
(3.38)
This quantity is sometimes called the empirical transfer function estimate. See Ljung
(1985b) for a discussion.
The foregoing approach to estimating the spectral densities and hence the transfer
function will give poor results. For example, if u(t) is a stochastic process, then the
estimates (3.35), (3.37) do not converge (in the mean square sense) to the true spectrum
as N, the number of data points, tends to infinity. In particular, the estimate 4>u( (0) will
on average behave like <Pu( (0), but its variance does not tend to zero as N tends to
infinity. See BriUinger (1981) for a discussion of this point. One of the main reasons for
this behavior is that the estimate fyu('t) will be quite inaccurate for large values of L (in
which case only a few terms are used in (3.25)), but all covariance elements fyuCt) are
given the same weight in (3.34) regardless of their accuracy. Another more subtle
reason may be explained as follows. In (3.34) 2N + 1 terms are summed. Even if the
estimation error of each term tended to zero as N - 7 00, there is no guarantee that the
global estimation error of the sum also tends to zero. These difficulties may be
overcome if the terms of (3.34) corresponding to large values of L are weighted out. (The
above discussion of the estimates fyuCt) and 4>yu(oo) applies also to fueL) and 4>,,(00).)
Thus, instead of (3.34) the following improved estimate of the cross-spectrum (and
analogously for the auto-spectrum) is used:
1:::'
,h ( )
't'yu
00 = 2Jt
L.,;
T;=--N
ryu
A
()
()
L W L
e -i'tw
(3.39)
46 Nonparametric methods
Chapter 3
where Wet) is a so-called lag window. It should be equal to 1 for T = 0, decrease with
increasing T, and should be equal to 0 for large values of T. ('Large values' refer to a
certain proportion such as 5 or 10 percent of the number of data points, N). Several
different forms of lag windows have been proposed in the literature; see Brillinger
(1981).
Some simple windows are presented in the following example.
Example 3.4 Some lag windows
The following lag windows are often referred to in the literature:
Iii ~
ITI >
ITI ~
ITI >
M
M
(3.40a)
(3.40b)
M
(3.40c)
The window Wl(T) is caned rectangular, W2(T) is attributed to Bartlett, and W3("t) to
Hamming and Tukey. These windows are depicted in Figure 3.8.
Note that all the windows vanish for ITI > M. If the parameter M is chosen to be
sufficiently large, the periodogram will not be smoothed very much. On the other hand,
a small M may mean that essential parts of the spectrum are smoothed out. It is not
o +-________________~--__--------~~
o
M
FIGURE 3.8 The lag windows w](-c), W2(l:) and W3(l:) given by (3.40).
Section 3.5
Spectral analysis 47
The use of a lag window is necessary to obtain a reasonable accuracy. On the other
hand, the sharp peaks in the spectrum may be smeared out. It may therefore not be
possible to separate adjacent peaks. Thus, use of a lag window will give a limited
frequency resolution. The effect of a lag window on frequency resolution is illustrated
in the following simple example.
Example 3.5 Effect of lag window on frequency resolution
Assume for simplicity that the input signal is a sinusoid, u(t) = V2 sin wot, and that a
rectangular window (3.40a) is used. (Note that when a sinusoid is used as input, lag
windows are not necessary since UN(w) will behave as the tire spectrum, i.e. will be
much larger for the sinusoidal frequency and small for all other arguments. However,
we consider here such a case since it provides a simple illustration of the frequency
resolution.) To emphasize the effect of the lag window, assume that the true covariance
function is available. As shown in Example 5.7,
ruel:) = cos
(3.41a)
WOL
Hence
1 M
"
2:rr L;
cos
(3.41b)
WOle -ieW
T;=--M
1
<Pu(W) = 2[6(w -
Wo)
+ 6(w +
<PuCw)
A
4i1
O(
_
[e' WO-W)T
L;
(3.41c)
WO)]
= wo.
Evaluating (3.41b),
O(
e-10)0+m),]
T=-M
ei(Olo-m)(M+ 112)
4:rr
ei (wo-w)/2
e -i(wo-w)12
ei(wo+w)(M+ 1/2) _
+
=
ei (wo+w)l2
e -i(WO+W)(M+1I2)
e -i(wo+0))/2
(3.41d)
10~
____________________________________
M
o r--
w
(a)
10T-__________________________________
M=lO
</>u(w)
w
(b)
FIGURE 3.9 The effect of the width of a rectangular lag window on the windowed
spectrum (3.41d), (00 = 1.5.
Spectral analysis
St;ction 3.5
49
l0r-__________________________________- .
M
= 50
(c)
The windowed spectrum (3.41d) is depicted in Figure 3.9 for three different values of M.
It can be seen that the peak is spread out and that the width increases as M decreases.
When M is large,
<Pu( wo)
A
IM+1I2
= 4n
112
n )
= 2n
sin(n/2)
11M
= n2
= 2n2M
=-
(3.41e)
50
Nonparametric methods
Chapter 3
Summary
As shown in Chapter 2, nonparametric methods are easy to apply but give only
moderately accurate models. If high accuracy is needed a parametric method has to be
used. In such cases nonparametric methods can be used to get a first crude model,
which may give useful information on how to apply the parametric method.
Ka ~------------------r-----------~--T-"
Ka
cos <p
a sin c:p
a cos <p
-Ka~~--~------------~----------------~
-a
Problems
51
Problems
Problem 3.1 Determination of time constant T from step response
Prove the rule for determining T given in Figure 3.1.
Problem 3.2 Analysis of the step response of a second-order damped oscillator
Prove the relationships (3.3d)-(3.3f).
Problem 3.3 Determining amplitude and phase
One method of determining amplitude and phase changes, as needed for frequency
analysis, is to plot yet) versus u(t). If u(t) = a sin wt, yet) = Ka sin(wt + <p), show that
the relationship depicted in the Figure 3.10 applies.
Problem 3.4 The covariance function of a simple process
Let e(t) denote white noise of zero mean and variance 'I,? Consider the filtered white
noise process
1
yet) = 1 + aq
e(t)
lal
< 1
where q-l denotes the unit delay operator. Calculate the covariances ry(k)
Ey(t)y(t - k), k = 0, 1, 2, ... , of y(t).
Problem 3.5 Some properties of the spectral density function
Let <Pu(w) denote the spectral density function of a stationary signal u(t) (see (3.32:
1
<Pu( w) = 2n
00
2:
ruCt)e -ton
WE
[-n, n]
't"=-oo
Assume that
which guarantees the existence of <Pu(w). Show that <Pu(w) has the following properties:
(a) <Pu(w) is real valued and <Pu(-w) = <Pu(w)
(b) <Pu(w) ~ 0 for all w
Hint. Set <pet) = (u(t - 1)
that
= -;;
2:
(n - 1"tI)ru(t)eiW~
"(:=-n
52 Nonparametric methods
Chapter 3
2:
Hkq-k
(min)
Gkq-k
(pin)
k=O
G(q-l)
2:
k=O
be two stable matrix transfer functions, and let e(t) be a white noise of zero mean and
covariance matrix A (nln). Show that
2~ f~J[ H(ei<D)AGT(e-i<D)dw
~o HkAGI
= E[H(q-l)e(t)][G(q-l)e(t)]T
The first equality above is called Parseval's formula. The second equality provides an
'interpretation' of the terms occurring in the first equality.
Problem 3.7 Correlation analysis with truncated weighting function
Consider equation (3.30) as a means of applying correlation analysis. Assume that the
covariance estimates f yu (') and rue-) are without errors.
(a) Let the input be zero-mean white noise. Show that, regardless of the choice of M,
the weighting function estimate is exact in the sense that
= 0, ... , M
h(k) = h(k)
- 1
(b) To show that the result of (a) does not apply for an arbitrary input, consider the
input u(t) given by
u(t) - au(t - 1)
vet)
lal
< 1
+ ay(t - 1)
= bu(t - 1)
lal
< ]
Show that
h(O) = 0
h(k) = b( _a)k-l
h(k) = h(k)
heM _
1) =
heM (1
k ~ 1
k = 0, ... , M - 2
1)
+ aa)
(52,
Problems
53
Hint
. a M-l
a
1
1
a
-1
1
-a
-a
1 + a2
-a
(This result can be verified by direct calculations. It can be derived, for example, from
(C7.7.18).)
Problem 3.8 Accuracy of correlation analysis
Consider the noise-free first-order system
yet)
ay(t - 1)
lal <
= hu(t - 1)
Assume that correlation analysis is applied in the case where u(t) is zero-mean white
noise. Evaluate the variances of the estimates h(k), k = 1, 2, ... using the results of
Appendix A3.2.
Problem 3.9 Improved frequency analysis as a special case of spectral analysis
Show that if the input is a sinusoid of the form
u(t)
= a sin
wt
w=
(N
2n
N n, n
[0, N - 1]
the spectral analysis (3.33)-(3.36) reduces to the discrete time counterpart of the
improved frequency analysis (3.15)-(3.17). More specifically, show that
A._
ReH(e Wl )
2 1
= --;; N
2: yet) sin wt
t=1
A._
2 1
h(k)q-k.
L.J'
k=O
k = 0, ... , N
and show that it is approximately equal to the estimate provided by the spectral
analysis.
54
Chapter 3
Nonparametric methods
(i)
B(q-I)
One possibility to determine {ai, hj } from {h(k)} is to require that the model (i) has a
weighting function that coincides with the given sequence for k = 1, ... , 2n.
(a) Set H(q~l) = r,k=lh(k)q~k. Show that the above procedure can be described in a
polynomial formulation as
(ii)
and that Oi) is equivalent to the following linear system of equations:
I
0:
o
-~h(l)
-h(2)
-h(l)
I
I
-h(l)
hen)
h(l)
o
o
1
(iii)
~~------------
hI
I
I
I
-h(2n - 1)
:
I
b"
h(2n)
Also derive (iii) directly from the difference equation (i), using the fact that {h(k)}
is the impulse response of the system.
(b) Assume that {h(k)} is the noise-free impulse response of an nth-order system
where Ao, 8 0 are coprime. Show that the above procedure gives a perfect model in
the sense that A(q-l) = AO(q--l), B(q-I) = BO(q-l).
Bibliographical notes
Eykhoff (1974) and the tutorial papers by Rake (1980, 1987) and Glover (1987) give
some general and more detailed results on nonparametric identification methods.
Some different ways of determining a parametric model from a step response have
been given by Schwarze (1964).
Frequency analysis has been analyzed thoroughly by Astrom (1975), while Davies
(1970) gives a further treatment of correlation analysis.
The book by Jenkins and Watts (1969) is still a standard text on spectral analysis. For
Appendix A3.1 55
more recent references in this area, see Brillinger (1981), Priestley (1982), Ljung
(1985b, 1987), Hannan (1970), Wellstead (1981), and Bendat and Piersol (1980). Kay
(1988) also presents many parametric methods for spectral analysis. The FFT algorithm
for efficiently computing the discrete Fourier transforms is due to Cooley and Tukey
(1965). See also Bergland (1969) for a tutorial description.
Appendix A3.1
Covariance junctions, spectral densities and linear filtering
Let u(t) be an nu-dimensional stationary stochastic process. Assume that its mean
value is mu and its covariance function is
(A3.1.1)
The inverse relation to (A3.1.2) describes how the covariance function can be found
from the spectral density. This relation is given by
ru(T) =
J~", <puCw)ei1;wdw
(A3.1.3)
As a verification, the right-hand side of (A3.1.3) can be evaluated using (A3.1.2), giving
1: ' = - 0 0
yet) =
2:
h(k)u(t - k)
(A3.1.4)
k=O
56 Nonparametric methods
H(q-l) =
Chapter 3
h(k)q-k
(A3.1.5)
k=O
where q-l is the backward shift operator. Using H(q-l), the filtering (A3.1.4) can be
rewritten as
yet) = H(q-l)U(t)
(A3.1.6)
my = Ey(t) = E
h(k)u(t - k) =
h(k)m" = H(l)mu
(A3.1.7)
k=O
k=O
Note that H(l) can be interpreted as the static (de) gain of the filter.
Now consider how the deviations from the mean values yet) ~ yet) - my, aCt)
u(t) - m" are related. From (A3.1.4) and (A3.1.7),
yet)
h(k)u(t - k) -
k=O
h(k)mu
h(k)a(t - k)
h(k)(u(t - k) - mu]
k=O
k=O
f'.,
(A3.1.8)
= H(q-l)a(t)
k=O
Thus (a(t), yet) are related in the same way as (u(t) , y(t. When analyzing the
covariance functions, strictly speaking we should deal with aCt), y(t). For simplicity we
drop the - notation. This means formally that u(t) is assumed to have zero mean. Note,
however, that the following results are true also for mu *- O.
Consider first the covariance function of y(t). Some straightforward calculations give
L L
j=O
k=O
L 2:
(A3.1.9)
h(j)ru (. - j + k)hT(k)
j=O k=O
In most situations this relation is not very useful, but its counterpart for the spectral
densities has an attractive form. Applying the definition (A3.1.2),
100.
<j:>
"
r y(.)e-n:w
y(0) = -2n L.
,[=-00
= 2~
i: i: i: h(j)e-ijwr
1:=-00 j=O
u (. -
k)e-i(r-j+k)whT(k)eikW
k=O
2~ j~ ~O h(f)e--ijWL,~oo ruC.')e-iT'W]hT(k)eikW
Appendix A3.1
57
or
(A3.1.1O)
This is a useful relation. It describes how the frequency content of the output depends on
the input spectral density 4>u(w) and on the transfer function H(e iW ). For example,
suppose the system has a weakly damped resonance frequency W00 Then IH(eiWo)1 will be
large and so will 4>y(wo) (assuming 4>,,(wo) =I=- 0).
Next consider the cross-covariance function. For this case
2: h(j)Eu(t +
L -
j)uT(t)
(A3.1.11)
j=O
2: h(j)ru(L -
j)
j=O
()
't'yu W
1 L..i
~
2n
( ) e -hw
ryu L
1'=-00
= 2n
0000
2: 2: h(j)e
T=
-00
-ijw
ru( L
j)e -i(T-j)O)
j=O
or
(A3.1.12)
The results of this appendix were derived for stationary processes. Ljung (1985c) has
shown that they remain valid, with appropriate interpretations, for quasi-stationary
signals. Such signals are stochastic processes with deterministic components. In analogy
with (A3 .1.1), mean and covariance functions are then defined as
mu = lim
N----'J>oo
~b1'1 L.J
!!:.
(A3.1.13a)
Eu(t)
t=l
ru(L)
= N--+oo
lim -N"
L.J
E(u(t
L) - mu)(u(t) - mSI
(A3.1.13b)
t=l
assuming the limits above exist. Once the covariance function is defined, the spectral
density can be introduced as in (A3.1.2). As mentioned above, the general results
(A3.1.1O) and (A3.1.12) for linear filtering also hold for quasi-stationary signals.
Chapter 3
58 Nonparametric methods
Appendix A3.2
Accuracy of correlation analysis
Consider the system
yet) =
2:
h(k)u(t - k) + vet)
(A3.2.1)
k=O
where vet) is a stationary stochastic process, independent of the input signal. In Section
3.4 the following estimate for the weighting function was derived:
1 N
N yet + k)u(t)
ryu(k)
1=1
h(k) = Yu(O) = --1-N- - - k = 0, 1, ...
(A3.2.2)
2:
2:
u2 (t)
1=1
assuming that u(t) is white noise of zero mean and variance 1)2.
The accuracy of the estimate (A3.2.2) can be determined using ergodic theory
(see Section B.I of Appendix B). We find Yyu(k) - ? ryu(k) , Y,,(O) -~ ru(O) and hence
h(k) - ? h(k) as N tends to infinity. To examine the deviation h(k) -- h(k) for a finite but
large N it is necessary to find the accuracy of the covariance estimates fyu(k) and Yu(O).
This can be done as in Section B.S, where results on the accuracy of sample
auto covariance functions are derived. However, here we choose a more direct way. First
note that
h(k) - h(k)
l[N
1 N ~ {y(t + k) - h(k)u(t)}u(t)
= ;:-u(O)
IlN{'"
= 0 2 N ~ ~ h(i)u(t
+k -
i)
(A3.2.3)
+ vet + k) u(t)
i*k
The covariance P IAV between {h([.l) - h([.l)} and {heY) - hey)} is calculated as follows:
i) +
V(I + ~) JU(I)]
Z*IA
N;04
i: i:
(A3.2.4)
h(i)h(j)E[u(t + fl - i) u(t)u(s +
1
N204
2: 2:
1=1 s=1
Ev(t +
fl)V(S
+ Y)Eu(t)u(s)
Y -
j)u(s)]
Appendix A3.2 59
where use has been made of the fact that u and v are uncorrelated. The second term in
(A3.2.4) is easily found to be
No 2 rv(1!
- v)
To evaluate the first term, note that for a white noise process
= 0 for i <
rv(~;Z v) + ~ i: i: h(i)h(f)6!"-i,v-j + ~2
i=O j=O
he[.t tFv
= rv(~;Z
v) +
i:
h(i)h(i -,
(N - IT!)h(T
+ I!)h(v - T)
T=-N
'1:'FO
+ v)
i=O
~2
i*!"
't=-[.t
,,*0
+N
2
h(T + I!)h(v - L) - N h(l!)h(v)
(A3.2.5)
T=-!"
Note that the covariance element P [.tv will not vanish in the noise-free case (v( t) == 0). In
contrast, the variances-covariances of the estimation errors associated with the
parametric methods studied in Chapters 7 and 8, vanish in the noise-free case. Further
note that the covariance elements P [.tv approach zero when N tends to infinity. This is in
contrast to spectral analysis (the counterpart of correlation analysis in the frequency
domain), which does not give consistent estimates of H.
Chapter 4
LINEAR REGRESSION
y(t) = cpT(t)8
(4.1)
yet)
<I>T(t)8
(4.2)
where yet) is a p-vector, <I>(t) an (n Ip) matrix and 8 an n-vector. Least squares
estimation of the parameters in multivariable models of the form (4.2) will be treated in
some detail in Complement C7.3.
Example 4.1 A polynomial trend
Suppose the model is of the form
yet)
ao
+ alt + ... + a/
with unknown coefficients ao, ... , ar' This can be written in the form (4.1) by defining
cp(t) = (1
Such a model can be used to describe a trend in a time series. The integer r is typically
taken as 0 or 1. When r = 0 only the mean value is described by the model.
II
Example 4.2 A weighted sum of exponentials
In the analysis of transient signals a suitable model is
yet)
ble- klt
Section 4.1
It is assumed here that kI, ... , k n (the inverse time constants) are known but that the
weights b I , ... , bn are unknown. Then a model of the form (4.1) is obtained by setting
cp(t) = (e-k]t
...
8 = (b l
e-knt)T
bn)T
III
hIU(t - 1)
+ ... +
hM-1U(t - M
1)
The input signal u(t) is recorded during the experiment and can hence be considered as
known. In this case
cp(t) = (u(t)
8
u(t - 1)
u(t - M
+ l)f
(h o
This type of model will often require many parameters in order to give an accurate
description of the dynamics (M typically being of the order 20-50; in certain signal
processing applications it may be several hundreds or even thousands). Nevertheless, it
is quite simple conceptually and fits directly into the framework discussed in this
III
chapter.
The problem is to find an estimate 8 of the parameter vector 8 from measurements y(l),
cp(l), ... , yeN), cp(N). Given these measurements, a system of linear equations is
obtained, namely
y(l) = cpT(1)8
y(2) = cpT(2)8
Y = <1>8
(4.3)
where
an (Nil) vector
<1> = (<:PT;(l),
an (NI n) matrix
(4.4a)
(4.4b)
cpT(N)
One way to find 8 from (4.3) would of course be to choose the number of measurements,
N, to be equal to n. Then <1> becomes a square matrix. If this matrix is nonsingular the
62
Linear regression
Chapter 4
linear system of equations, (4.3), could easily be solved for 8. In practice, however,
noise, disturbances and model misfit are good reasons for using a number of data
greater than 11. With the additional data it should be possible to get an improved
estimate. When N > n the linear system of equations, (4.3), becomes overdetermined.
An exact solution will then in general not exist.
Now introduce the equation errors as
(4.5)
and stack these in a vector
In the statistical literature the equation errors are often called residuals. The least
squares estimate of 8 is defined as the vector that minimizes the loss function
(4.6)
where 1111 denotes the Euclidean vector norm. According to (4.5) the equation error
E(t) is a linear function of the parameter vector 8.
The solution of the optimization problem stated above is given in the following
lemma.
Lemma 4.1
Consider the loss function V(8) given by (4.5), (4.6). Suppose that the matrix <1>T<1> is
positive definite. Then Vee) has a unique minimum point given by
_________________ (4.7)J
The corresponding minimal value of
vee) is
(4.8)
Proof. Using (4.3), (4.5) and (4.6) an explicit expression for the loss function V(8) can
be obtained. The point is to see that Vee) as a function of e has quadratic, linear and
constant terms. Therefore it is possible to use the technique of completing the squares.
We have
E =--::
and
Y .- <1>8
Section 4.1
V(8) =
~ [Y -
(J;>8f[Y - <P8]
Hence
V(8) =
+ 2:[yTy _ yT$J;>T(J;-l(J;>Ty]
2
The second term does not depend on 8. Since (J;>T$ by assumption is positive definite, the
first term is always greater than or equal to zero. Thus V(8) can be minimized by setting
the first term to zero. This gives (4.7), as required. The minimal value (4.8) of the loss
function then follows directly.
II
Remark 2. Some basic results from linear algebra (on overdetermined linear systems of
equations) are given in Appendix A (see Lemmas A.7 to A.IS). In particular, the least
squares solutions are characterized and the so-called pseudoinverse is introduced; this
replaces the usual inverse when the matrix is rectangular or does not have full rank. In
particular, when (J;> is (Nln) and of rank n then the matrix (cpT(J;-l(J;>T which appears in
(4.7) is the pseudoinverse of (J;> (see Lemma A.ll).
III
Remark 3. The form (4.7) of the least squares estimate can be rewritten in the
equivalent form
(4.10)
In many cases cp(t) is known as a function of t. Then (4.10) might be easier to
implement than (4.7) since the matrix (J;> of large dimension is not needed in (4.10). Also
64
Linear regression
Chapter 4
the form (4.10) is the starting point in deriving several recursive estimates which will be
presented in Chapter 9. Note, however, that for a sound numerical implementation,
neither (4.7) nor (4.10) should be used, as will be explained in Section 4.5. Both these
forms are quite sensitive to rounding errors.
IIIlIII
i = 1, ... , n
y* =
<D/lj
j=l
<Pl
k::._
_+'<:.-...-----+ <1>2
FIGURE 4.1 Geometrical illustration of the least squares solution for the case N = 3,
n = 2.
Section 4.2
Example 4.4 Estimation of a constant
Assume that the model (4.1) is
yet) = b
This means that a constant is to be estimated from a number of noisy measurements. In
this case
qJ(t) = 1
1
N[y(l) + ... + yeN)]
yet)
= qJ T (t)8 0
e(t)
(4.l1a)
where 80 is called the true parameter vector. Assume further that eel) is a stochastic
variable with zero mean and variance A? In matrix form equation (4.11a) is written as
<pO o
+e
(4.l1b)
where
Lemma 4.2
Consider the estimate (4.7). Assume that the data obey (4.11) with e(t) white noise of
zero mean and variance ",.2. Then the following properties hold:
66
Linear regression
S2
Chapter 4
2V(e)/(N - n)
(4.13)
-----------------------------------------------~
8=
($T$)-l$T{$OO
+ e}
00
+ ($T$)-I$Te
and hence
EO
= 00 +
($T$)-l$TEe = 80
yr =
E[($T$)-l$Te][($T$)-l$Tep'
Vee)
~ [<1>80 +
= !eT[1 - <I>(<I>T<I-l<1>T]e
Consider the estimate S2 given by (4.13). Its mean value can be evaluated as
ES2
- n)
= [trIN - tr{<I>(<I>T<I-I<1>T}]"A.2/(N - n)
= [trlN - tr{(<I>T<I-l(<I>T<I}]A.2/(N - n)
= [trIN - trlnJ"A.2/(N - n) = [N - n]"A.2/(N - n)
"A.2
In the calculations above h denotes the identity matrix of dimension k. We used the
fact that for matrices A and B of compatible dimensions tr(AB) = tr(BA). The
calculations show that S2 is an unbiased estimate of "A.2.
II
Remark 1. Note that it is essential in the proof that <I> is a deterministic matrix. When
taking expectations it is then only necessary to take e into consideration.
II
Remark 2. In Appendix B (Lemma B.15) it is shown that for every unbiased estimate
is a lower bound, PeR, on the covariance matrix of O. This means that
8 there
Section 4.3
Example B.1 analyzes the least squares estimate (4.7) as well as the estimate S2, (4.13),
under the assumption that the data are Gaussian distributed and satisfy (4.11). It is
shown there that cov(O) attains the lower bound, while var(s2) does so only
II1II
asymptotically, i.e. for a very large number of data points.
~_C_O_V_(O_)_=__(~_T_~_)_-_l~_T_R_~_(_~_T~__)-_l__________________________(4_.GiJ
Next we will extend the class of identification methods and consider general linear
estimates of 80 , By a linear estimate we mean that 0 is a linear function of the data
vector Y. Such estimates can be written as
(4.17)
where Z is an (Nln) matrix which does not depend on Y. The least squares estimate
(4.7) is a special case obtained by taking
Z =
~(~T~)-l
We shall see how to choose Z so that the corresponding estimate is unbiased and has a
minimal covariance matrix. The result is known as the best linear unbiased estimate
(BLUE) and also as the Markov estimate. It is given in the following lemma.
Lemma 4.3
Consider the estimate (4.17). Assume that the data satisfy (4.11), (4.15). Let
(4.18)
Then the estimate given by (4.17), (4.18) is an unbiased estimate of 8 0 , Furthermore, its
covariance matrix is minimal in the sense that the inequality
68
Linear regression
covz*(8) = (CPTR-1cp)-J
Chapter 4
covz(8)
(4.19)
~
P 2 means that P2
P j is nonnegative
Elo = EEl'=T
EZ (CPEl n + e)
= Z T CPEl o
(4.20)
In particular, note that the choice (4.18) of Z satisfies this condition.
The covariance matrix of the estimate (4.17) is given in the general case by
covz(8)
In particular, if Z
covz*(8)
(4.21)
= (CPTR-l<I-lcpTR-lRR-lcp(cpTR-lcp)-l
(CPTR-Jcp)-l
(4.22)
which proves the equality in (4.19). It remains to show the inequality in (4.19). For
illustration, we will give four different proofs based on different methodologies.
Proof A. Let 0 denote the estimation error 8 - 80 for a general estimate (subject to
(4.20)); and let 0* denote the error which corresponds to the choice (4.18) of Z. Then
covz(8)
(4.23)
E8*8*T
= (CPTR-1cp)-1
E88*T
Note that we have used the constraint (4.20) in these calculations. Now (4.23) gives
covz(8)
~ (CP'1'K-1cp)-1
Section 4.3
ZTRZ - (Z TRZ*)(Z*TRZ*)-l(Z*TRZ)
;:?:
(4.24)
Since this holds for all Z (subject to (4.20)), it is also true in particular that
Z*TRZ*
(<I>TR-1<I-1
ZTRZ
;:?:
covz*(8)
(4.25)
= Z'IlR -- <I>(<I>TR-1<I-1<I>T]Z
However, the matrix in square brackets in this last expression can be written as
R - <I>(<I>TR- 1<I-1<I>T = [R - <I>(<I>TR- 1<I-1<I>T]R- 1 [R - <I>(<I>TR-1<I-1<I>T]
and it is hence nonnegative definite. It then follows easily from (4.25) that
covz(8) - covz*(8)
;:?:
+ tr{A(ZT<I> - I)}
where the (nln) matrix A represents the multipliers. Using the definition
for the derivative of a scalar function with respect to a matrix, the following result can
be obtained:
o=
aL
az
= 2RZaaT
<I>A
(4.26)
This equation is to be solved together with (4.20). Multiplying (4.26) on the left by
<I>TR- 1 ,
70
Linear regression
o=
2<l>TZaaT
Chapter 4
+ (<l>TR-1<lA
-lR-1<l>A
Thus
r-
Remark I. Suppose that R = A? I. Then Z* = <l>( <l> T<l> 1, which leads to the simple
least squares estimate (4.7). In the case where e(t) is white noise the least squares
IIIlI
estimate is therefore the best linear unbiased estimate.
Remark 2. One may ask whether there are nonlinear estimates with better accuracy
than the BLUE. It is shown in Example B.2 that if the disturbances e(t) have a
Gaussian distribution, then the linear estimate given by (4.17), (4.18) is the best one
among all nonlinear unbiased estimates. If e(t) is not Gaussian distributed then there
II
may exist nonlinear estimates which are more accurate than the BLUE.
Remark 3. The result of the lemma can be slightly generalized as follows. Let 8 be the
BLUE of 8 0 , and let A be a constant matrix. Then the BLUE of A8 0 is A8. This can be
shown by calculations analogous to those of the above proof. Note that equation (4.20)
will have to be modified to ZT<l> = A.
II
Remark 4. In the complements to this chapter we give several extensions to the above
lemma. In Complement C4.1 we consider the BLUE when a linear constraint of the
form A8() = b is imposed. The case when the residual covariance matrix R may be
singular is dealt with in Complement C4.3. Complement C4.4 contains some extensions
to an important class of nonlinear models.
II
y(t)
Section 4.4
71
Assume that the measurement errors are independent but that they may have different
variances, so
yet) = bo + eel)
Then
Ai
A~
R=
<1>=
1
1
N
2:
(lIA})
2:
(liAr) y(i)
;=1
j=l
This is a weighted arithmetic mean of the observations. Note that the weight of y(i), i.e.
1
----liAr
N
2: (lII-.J)
j=l
II
The treatment so far has investigated some statistical properties of the least squares
and other linear estimates of the regression parameters. The discussion now turns to the
choice of an appropriate model dimension. Consider the situation where there is a
sequence of model structures of increasing dimension. For Example 4.1 this would
simply mean that r is increased. With more free parameters in a model structure, a
better fit will be obtained to the observed data. The important thing here is to
investigate whether or not the improvement in the fit is significant. Consider first an
ideal case. Assume that the data are noise-free or N = 00, and that there is a model
structure, say uf(*, such that, with suitable parameter values, it can describe the
system exactly. Then the relationship of loss function to the model structure will be as
shown in Figure 4.2.
In this ideal case the loss
will remain constant as long as ,At is 'at least as
large as' uf(*. Note that for N ~ 00 and uf( :::l uf(* we have 2V(8)!N - ? Ee 2 (t). In the
Vee)
72
Linear regression
Chapter 4
V(S) r - - - : : : - - - - - - - - - - - ,
.At'
FIGURE 4.2 Minimal value of the loss function versus the model structure. Ideal case
(noise-free data or N ~ 00). The model structure .It * corresponds to the true system.
practical case, however, when the data are noisy and N < 00, the situation is somewhat
will decrease slowly with increasing .At, as
different. Then the minimal loss
illustrated in Figure 4.3.
The problem is thus to decide whether the decrease Ll V = VI - V z is 'small' or not
(see Figure 4.3). If it is small then the model structure.At 1 should be chosen; otherwise
the larger model .Atz should be preferred. For a normalized test quantity it seems
reasonable to consider Ll V = (VI - Vz)IVz. Further, it can be argued that if the true
system can be described within the smaller model structure .At 1, then the decrease Ll V
should tend to zero as the number of data points, N, tends to infinity. One would
therefore expect N(VI - Vz)IVz to be an appropriate test quantity. For 'small' values of
this quantity, the structure .At 1 should be selected, while otherwise .-4tz should be
chosen.
Vee)
Statistical analysis
To get a more precise picture of how to apply the test introduced above, a statistical
analysis is needed. This is given in the following lemma.
Lemma 4.4
Consider two model structures .At1 and .Atz given by
'-.....
-,..
FIGURE 4.3 Minimal value of the loss function versus the model structure. Realistic
case (noisy data and N < 00).
Section 4.4
.At1 : yet) = cpT(t)8 1
(4.27)
(4.28)
CP2() -
(CPI(t)
<P2(t)
(4.29)
where {e(t)} is a sequence of independent f i (0, A?) distributed random variables. Let
Vi denote the minimal value of the loss function as given by (4.8) for the model structure .Ati , i = 1, 2. Then the following results are true:
2V2
(1.) --:;::-
X2( N - n2 )
Proof. The proof relies heavily on Lemma A.29, which is rather technical. As in the
proof of Lemma 4.2 we have
where
P2 = I - <l>2(<I>i<l>2)-l<1>i
We now apply Lemma A.29 (where F} corresponds to <1>1 and F corresponds to <1>2)' Set
e = Uel"A, where U is a specific orthogonal matrix which simultaneously diagonalizes
P2 and PI - P2 Since U is orthogonal we have e - fiCO, I).
Next note that
2V2/"A2 = eTP2el"A2
= eT(IN-n2
O)e
On2
ON-n z
P2)el"A2 = eT (
o
The results now follow by invoking Lemma B.13.
74
Linear regression
Chapter 4
Remark. The x2 (n) distribution and some of its properties are described in Section B.2
of Appendix B.
II
Corollary. The quantity
(4.30)
Proof. The result is immediate by definition of the F-distribution; see Section B.2.
II
The variable [in (4.30) can be used to construct a statistical test for comparing the model
structures .Atl and .At2. For each model structure the parameter estimates and
the minimal value of the loss function can be determined. In this way the test quantity [
can easily be computed. If N is large compared to n2 the x2(n2 - nl) distribution can be
used instead of F(n2 - nb N - n2) (see Lemma B.14). Thus, if [ is smaller than
x~(n2 - nl) then we can accept .At1 at a significance level of a. Here x~(n2 - nl) is the
test threshold defined as follows. If x is a random variable which is x2(n2 - nl) distributed and a E (0, 1) is a given number, then by definition P(x > X~( n2 - nl = a. As a
rough rule of thumb, .Atl should be accepted when [< (n2 - nl) + V[8(n2 - nl)] and
rejected otherwise (see the end of Section B.2). This corresponds approximately to a =
0.05. Thus, if [~ (n2 - nl) + V[8(n2 - nl)], the larger model structure .At2 should be
regarded as more appropriate.
It is not easy to make a strict analysis of how to select a, for the following reason.
Note that
Computational aspects 75
Section 4.5
Solution of the normal equations.
Orthogonal triangularization.
A recursive algorithm.
Normal equations
The first approach is to compute I> TI> and I>Ty and then solve the so-called normal
equations (cf. (4.9:
[
(I>T<<IS = (<<I>TY)
(4.31)
(4.32)
(4.33)
Suppose now that the orthogonal matrix Q can be chosen so that QI> has a 'convenient'
form. In Appendix A, Lemmas A.16-A.20, it is shown how Q can be constructed to
make QI> upper triangular. This means that
QI> =
(~)
QY
(~~)
(4.34)
where R is a square, upper triangular matrix. The loss function then becomes
V(8) = IIQI>S -- QYI1 2 = IIR8 - zl11 2
IIz2112
(4.35)
Assuming that R is nonsingular (which is equivalent to I> being of fun rank, and also to
I>TI> being nonsingular), it is easy to see that V(8) is minimized for 8 given by
(4.36)
The minimum value is
min V(8)
e
IIz2112 =
ZiZ2
(4.37)
76
Linear regression
Chapter 4
It is an easy task to solve the linear system (4.36) since R is a triangular matrix.
The QR method requires approximately twice as many computations as a direct
solution to the normal equations. Its advantage is that it is much less sensitive to
rounding errors. Assume that the relative errors in the data are of magnitude 6 and
that the precision in the computation is 'rI. Then in order to avoid unreasonable errors
in the result we should require 'rI < if for the normal equations, whereas 'rI < 6 is
sufficient for the QR method. A further discussion of this point is given in Appendix A
(see Section A.4).
The following example illustrates that the normal equations are more sensitive to
rounding errors than the QR method.
Example 4.6 Sensitivity of the normal equations
Consider the system
4 4
(4.38)
where 6 is a small number. Since the column vectors of the matrix in (4.38) are almost
parallel, one can expect that the solution is sensitive to small changes in the coefficients.
The exact solution is
(-1/6)
1/6
(4.39)
25
25 + 6
25+6)
25 + 26 + 26 2 x =
(1)
1 + 26
(4.40)
If a QR method is applied to (4.38) with Q constructed as a Householder transformation (see Lemma A.18), then
Q=
~G ~3)
( 5 5 + 6/5)
o -76/5 x
( 115 )
-7/5
(4.41)
Assume now that the equations are solved on a computer using finite arithmetic and
that due to truncation errors 6 2 = O. The 'QR equations' (4.41) are then not affected
and the solution is still given by (4.39). However, after truncation the normal equations
(4.40) become
( 25 25 + 6) x (1)
25 + 3
25 + 23
1 + 23
(4.42)
Problems 77
x
(49/0 +
-49/0
2)
(4.43)
Note that this solution differs considerably from the true solution (4.39) to the original
II1II
problem.
Recursive algorithm
A third approach is to use a recursive algorithm. The mathematical description and
analysis of such algorithms are given in Chapter 9. The idea is to rewrite the estimate
+ K(t)[y(t) - cpT(t)e(t -
1)]
(4.44)
Here e(t) denotes the estimate based on t equations (t rows in <1. The term yet) cpT(t)e(t - 1) describes how well the 'measurement' yet) can be explained by using
the parameter vector e(t - 1) obtained from the previous data. Further, in (4.44), K(t) is
a gain vector. Complement C4.2 shows among other things how a linear regression
model can be updated when new information becomes available. This is conceptually
and algebraically very close to a recursive algorithm.
Summary
Linear regression models have been defined and the least squares estimates of their
unknown parameters were derived (Lemma 4.1). The statistical properties of the least
squares parameter estimates were examined (Lemma 4.2). The estimates were shown to
be unbiased and an expression for their covariance matrix was given. It was a crucial
assumption that the regression variables were deterministic and known functions. Next
the analysis was extended to general linear unbiased estimates. In particular Lemma 4.3
derived the estimate in this class with the smallest covariance matrix of the parameter
estimates. Finally, Lemma 4.4 provided a systematic way to determine which one of a
number of competitive model structures should be chosen. It was also pointed out that
orthogonal triangularization (the QR method) is a numerically sound way to compute a
least squares estimate.
Problems
Problem 4.1 A linear trend model I
Consider the linear regression model
yet) = a + bt + E(t)
Find the least squares estimates of a and b. Treat the following cases.
78
Linear regression
Chapter 4
So
2: yet)
2:
Sl =
1=1
ty(t)
1=1
Oi) The data are y(-N), y(-N + 1), ... , yeN). Set
N
So
2:
yet)
I=-N
Hint. r.~l t
2:
Sl
ty(t)
I=-N
= N(N +
r.~1 t 2
1)/2;
= N(N + 1)(2N +
1)/6.
1 S
=N
0
(case (i
Ii
= 2N 1+
1 So
(case (ii
Then b is estimated (by the least squares method) in the model structure
yet) - Ii
= bt + E(t)
yet) = a + bt + e(t)
where eel) is white noise of variance A2.
ec)
Find the variance of the quantity set) = Ii + ht. What is the value of this variance for
t = 1 and for t = N? What is its minimal value with respect to t?
Cd) Write the covariance matrix of 8 = (Ii h?' in the form
p
cov(8)
(OI QOIO?)
2 02
QO I02
Find the asymptotic value of the correlation coefficient Q when N tends to infinity.
(e) Introduce the concentration ellipsoid
(8 - 8)Tp-l(8 - 8)
= t
(Roughly, vectors inside the ellipsoid are likely while vectors outside are unlikely,
provided ~ is given a value ~n. In fact,
E(8 - 8?,p-l(8 - 8)
8 small enough.
x2 (n), see
t = 2 and
Problems 79
Problem 4.2 A linear trend model II
Consider the linear regression model
vitI: yet) = a
bt
I(t)
Assume that the aim is to estimate the parameter b. One alternative is of course to use
linear regression to estimate both a and b.
Another alternative is to work with differenced data. For this purpose introduce
vlt 2: z(t)
+ 2(t)
(where 2(t) == l(t) - fl(t - 1). Linear regression can be applied to vlt2 for estimating b only (treating 2(t) as the equation error).
Compare the variances of the estimate of b obtained in the two cases. Assume that the
data are collected at times t = 1, ... , N and obey
Y - <PO
= <PO + e
where
Y =
(~~)
<P =
(:~)
(:~)
and
Assume that Yb <PI and <P2 are available but Y2 is missing. Derive the least squares
estimates of 0 and Y2 defined as
0, h
<P8)T(y - <PO)
80
Linear regression
Chapter 4
cond(<I>T<I ~ O(N2rl(2r
+ 1))
for N large
e=
(al
<pet) = (cos
WIt
2:rr .
W' = - l
~ [~J
2:
yet) =
(ak
cos
wkt
+ b k sin
Wk t )
t = 1, ... , N
k=l
Show that the least squares estimates of {ad and {bd are equal to the Fourier
coefficients
2 N
ak
=N
1=1
(k = 1, ... , n)
1=1
Problem 4.7 Minimum points and normal equations when <I> does not have full rank
As a simple illustration of the case when <I> does not have full rank consider
Problems
81
Find all the minimum points of the loss function V(8) = Ily - <1>8112. Compare them with
the set of solutions to the normal equations
(<1>T<18
<1>Ty
Also calculate the pseudoinverse of <l>(see Section A.2) and the minimum norm solution
to the normal equations.
Problem 4.8 Optimal input design for gain estimation
Consider the following straight-line (or static gain) equation:
Ee(t)e(s) = j.,.2o t ,s
which is the simplest case of a regression. The parameter 8 is estimated from N data
points {yet), u(t)} using the least squares (LS) method. The variance ofthe LS estimate 8
of e is given (see Lemma 4.2) by
02
~ var(8)
= A2 /
~ u (t)
2
02
under the
Ax
=b
(i)
M)x
(ii)
is solved.
Assume that the matrix A is symmetric and positive semidefinite (and hence singular).
Then it can be factorized as
where B is (nip),
rankB=p<n
(iii)
(iv)
82 Linear regression
Chapter 4
(b) Writing the equations as B(BTx - d) = 0 we may choose to drop B and get the
system BTx = d, which is underdetermined. Find the minimum norm least squares
solution of this system, X2 = (BT)td.
(c) The regularized solution is X3 = (A + 6I)-lb. What happens when 0 tends to zero?
Problem 4.10 Conditions for least squares estimates to be BLUE
For the linear regression model
y
<PEl
+e
Ee = 0
EeeT = R > 0
R<P = <p F
=0
(i)
(ii)
<p(<pT<p)-l<pT]
Provide a simple direct proof of this inequality. Use that proof to obtain condition (i) of
Problem 4.10 which is necessary and sufficient for the least squares estimate (LSE) to be
BLUE.
Hint. Use some calculations similar to proof B of Lemma 4.3.
Problem 4.12 The least squares estimate of the mean is BLUE asymptotically
The least squares estimate of a in
yet)
= a
+ eel)
t = 1, .. "' N
1]
(see (4.16. The variance of the BLUE is
P BLUE = (<PTR-1<l-1
Let the process {e(t)} be stationary with spectral density <Piw) > 0 for w E (-Jr, Jr) and
covariance function r(k) = Ee(t)e(t - k) decaying exponentially to zero as k --,> 00
Then show that
Complement C4.1
lim NP LS
N...-.07CO
lim NPBLUE
N--'}>oo
83
= 2n<piO)
,
Hint. For evaluating lim NPBLUE , first use the fact that Amin(R) > 0 (see e.g.
N->oo
IIR-111
oo
I(R-1hpl
m:x ~
~ CYN
p=l
Bibliographical notes
There are many books in the statistical literature which treat linear regression. One
reason for this is that linear regressions constitute a 'simple' yet fundamental class of
models. See Rao (1973), Dhrymes (1978), Draper and Smith (1981), Weisberg (1985),
Astrom (1968), and Wetherill et al. (1986) for further reading. Nonlinear regression is
treated by Jennrich (1969) and Ratkowsky (1983), for example. The testing of statistical
hypotheses is treated in depth by Lehmann (1986).
For the differentiation (4.26), see for example Rogers (1980), who also gives a number
of related results on differentiation with respect to a matrix.
Complement C4.1
Best linear unbiased estimation under linear constraints
Consider the regression equation
y = <PO
Ee
EeeT
>
A of dimension (mlnO)
rank A = m < nO
The problem is to find the BLUE of O. We present two approaches to solve this problem.
Approach A
84
Linear regression
Chapter 4
+ aT(AS - b)
(see Problem 4.3) where a denotes the vector of Lagrange multipliers. By equating the
derivatives of F with respect to S and a to zero, we get
-<1>TR- 1(y - <1>S)
+ ATa
= 0
(C4.1.1)
AS = b
(C4.1.2)
= AS
- b
where S = Q<1>TR- 1Yis the unconstrained BLUE (see (4.17), (4.18. Thus, sinceAQA T
is positive definite,
a = (AQAT)-l(AS - b)
which inserted into (C4.1.1) gives the constrained BLUE
() = Q<1>TR- 1 y -
QATa
or
(C4.1.3)
ApproachB
The problem of determining () can be stated in the following way: find the BLUE of S in
the following regression equation with singular residual covariance matrix
(f)
(~)s + (~)
(C4.1.4)
with = O. First take > 0 (which makes the matrix nonsingular) and then apply the
standard BLUE result (4.18). Next suppose that tends to zero. Note that it would also
be possible to apply the results for a BLUE with singular residual covariance matrix as
developed in Complement C4.3. The BLUE for > 0 is given by
Complement C4.1
85
Be
8-
~QATb
- .!.QAT(1 + .!.AQAT)-lAQ.!.ATb
c
c
c
=
8 . ,.
+
~QAT(1 + ~AQAT)
-l (1
+ -iAQAT) -
~AQAT Jb
- b)
b)
as c ~ O.
e- B =
=
(8 -
B) - QAT(AQAT)-lA(8 - B)
[I - QA T(AQA T)-lAJ(8 - 8)
covCS - B) = Q - QAT(AQAT)-lAQ
Note that
cov(S - B) ~ cov(8 - B)
which can be seen as an illustration of the 'parsimony principle' (see Complement
CIl.1). That is to say, taking into account the constraints on B leads to more accurate
estimates. Finally note that B obeys the constraints
AS
86
Linear regression
Chapter 4
Complement C4.2
Updating the parameter estimates in linear regression models
New measurements
cPO +
(C4.2.I)
where
= ( ~)}ny
o }nO
cP=
<I}ny
I
}nf)
and
EE = 0
EEET =
(S o)}ny >
P }nO
= (<I>TS-l<I>
=
8+
(C4.2.2)
+ p- 1y-l(cpTS-ly + P- 18)
(<I>TS-'<I>
p- I )-l<I>TS-l[y - <I>8]
p-I)-l = P _ p<I>T(S
<I>p<I>T)-J<I>p
p--l)-l<I>TS-l
p<I>T(S
<I>p<I>T)-l[(S
<I>p<I>T)
- <I>p<I>T]S-l
= p<I>T(S
(C4.2.3)
<I>p<I>T)-l
(_C_4~
L-_O_'_=_O_-_+_P_<I>
__
T(_S_+__
<I>_P_<I>_T_)-_1_Cy
__
-_<I>
__
8)____________________
Complement C4.2
The covariance matrix of
87
1~
__p_A_=_P_____P_~_T_(S_+__~_P_~_1_)_-1_~_P________________________(_C_4_.2_.5_)~
e
Note that it !ollows explicitly that P ~ P. This means that the accuracy of is better
than that of 8, which is very natural. The use of the additional information carried by Y
should indeed improve the accuracy.
Decreasing dimension
~e
(C4.2.6)
EE = 0
e = Q~TR-Iy
~ (~TR-l~)-l
has been computed. In some situations it may be required to compute theA BLUE of the
parameters of a reduced-dimension regression, by 'updating' or modifying e. Specifically,
assume that after computing it was realized that the regressor corresponding to the last
(say) column of ~ does not influence Y. The problem is to determine the BLUE of the
parameters in a regression where the last column of ~ was eliminated. Furthermore this
should be done in a numerically efficient manner, presumably by 'ur~dating'
This problem can be stated equivalently as follows: find the BLUE 8 of e in (C4.2.6)
under the constraint
e.
A8
=b
A = (0
1)
b = 0
(3 = [J - QAT(AQA1)-lA]e
(C4.2.7)
tjJnS) T denote the last column of Q, and let !\ be the ith component of
Let (tjJl
Then, from (C4.2.7),
,
o ...
8 =
-tjJl/tjJnS
-tjJnS-1/tjJnS
81
e=
tjJl '
nS-l -
e.
~8ns
nS
tjJnS-l
nS
nS
(C4.2.8)
88 Linear regression
Chapter 4
The BLUE of the reduced-order regression parameters are given by the first (n8 - 1)
components of the above vector.
Note that the result (C4.2.8) is closely related to the Levinson-Durbin recursion
discussed in Complement C8.2. That recursion corresponds to a special structure of the
model (more exactly, of the <I> matrix), which leads to very simple expressions for {1V;}.
If more components of than the last one should be constrained to zero, the above
procedure can be repeated. The variables \jJj must of course then be redefined for each
reduction in the dimension. One can alternatively proceed by taking
A = [0
J]
=0
Q\2 Q22 }m
8, -_
(131)}n8 - m
A
8 2 }m
or
(C4.2.9~
Complement C4.3
Best linear unbiased estimates for linear regression models with possibly
singular residual covariance matrix
Consider the regression model
Y = <I>I:J
EeeT = R
(C4.3.1)
n8
(C4.3.2)
Complement C4.3
89
Lemma C4.3.1
Under the assumptions introduced above the BLUE of 0 is given by
(C4.3.3)
where At denotes the Moore- Penrose pseudoinverse of the matrix A. (see Section A.2
or Rao (1973) for definition, properties and computation of the pseudoinverse).
Proof. First it has to be shown that the inverse appearing in (C4.3.3) exists. Let
(C4.3.4)
r3
(C4.3.5)
= q,a
(C4.3.6)
q,q,T)~
0 ~RtR~
0 ~ q,Tf3 = 0 ~ (q,Tq,)a
0 ~ R~
0 ~ a
which proves that only a = 0 satisfies (C4.3.6) and thus that q,TRtq, is a positive definite
matrix. Here we have repeatedly used the properties of pseudoinverses given in Lemma
A.lS.
Next observe that 8* is an unbiased estimate of 8. It remains to prove that 8* has the
smallest covariance matrix in the class of linear unbiased estimators of 8:
where by unbiasedness ZTq, = I (see (4.17), (4.20. The covariance matrix of 8* is given
by
cov(8*) = (q,TRtq,)-lq,TRtRRtq,(q,TRtq,)-l
= (q,TRtq,)-lq,TRt(R - q,q,1)Rtq,(q,TRtq,)-1
= (q,Tktq,)-l - I
Since cov(S)
ZTRZ
+ I - (q,'l'Rtq,)-l ;::;:
(C4.3.7)
90
Linear regression
Chapter 4
R - <})(<})TRt<}-l<})T = [R - <})(<})TRt<}-l<})T]Rt[R -
<})(C:PTRt<}-lc:pT]
- (J - RRt)<})(<})TRt<}-l<})T
-
<})(C:PTRt<}-l<})T(I -
(C4.3.9)
RtR)
RRt)C:P = 0
(I -
(C4.3.1O)
13 = (J - RRt)<})a
(J - RtR)<})a
(C4.3.11)
RI3
(C4.3.12)
= 0
I3T <})
= 0
(C4.3.13)
13TRRt<})]a
=0
ZT[R -
<})(<})TRtc:pr-l<})T]Z =
ZT[R - <})(<})TRt<}-l<})T]Rt
(C4.3.14)
x [R -
<})(<})TR'c:p)-l<})T]Z
II1II
<})T(R +
= (I
=
(J +
C:PTR-lc:p -
<})TR- 1 c:p)-1<})TR- 1 ]
<})TR-l<}(J
<})TR-1<}-1<})TR- 1
<})TR- 1 <})-1<})TR- 1
Thus
8* ==
(<})TK-l<})"-l(J
<})TR-l<}(J
<l>TR- 1<}-1<})TR- 1 y
= (<})TR-1<l-1<})TR-1y
Complement C4.4
91
Complement C4.4
Asymptotically best consistent estimation of certain nonlinear
regression parameters
Earlier in the chapter several results for linear regressions were given. In this complement some of these results will be generalized to certain nonlinear models. Consider the
following special nonlinear regression model:
(C4.4.I)
where Y N and eN are M-vectors, g(.) is a vector-valued differentiable function, and 80 is
the vector of unknown parameters to be estimated. Both Y and e depend on the integer
N. This dependence will be illustrated below by means of an example. It is assumed that
the function g(.) admits a left inverse. The covariance matrix of eN is not known for
N < 00. However, it is known that
(C4.4.2)
Note that R is allowed to depend on 80 , It follows from (C4.4.1) and (C4.4.2) that the
difference [YN - g(8 0 )] tends to zero as lIVN when N ---? 00. Thus YN is a consistent
estimate of g(8 0 ); it is sometimes called a root-N consistent estimate. Note that it is
assumption (C4.4.2) which makes the nonlinear regression model (C4.4.I) 'special'.
Let
(C4.4.3)
where f() is a differentiable function, denote a general consistent (for N ---? 00) estimate
of 80 , The objective is to find a consistent estimate of 80 , which is asymptotically best in
the following sense. LAet Sdenote the asymptotically best consistent estimate (ABCE) of
80 , assuming such a 8 exists, and let
PM(8)
:;?:
(C4.4.4)
PM(O)
An example
Yk =
N-k
2:
t=J
x(t)x(t + k)
0, ... , M - 1
92
Linear ref.?ression
Chapter 4
e,
N-.....",oo
Since
e = f(g(Oo
f(g(Oo)) = So
(C4.4.5)
As this must hold for an arbitrary 80 , it follows that f (-) is a left inverse of g(.). Moreover
(C4.4.5) implies
afl
ag I
= I
ag g=g(8,,) as 8=80
(C4.4.6)
With
F - af(g)
og
G - og
- as
an (n8IM) matrix
g=g(8,,)
(C4.4.7)
an (Mln8) matrix
8=8"
FG = I
(C4.4.8)
Next we derive the asymptotic covariance matrix (for N ~ (0) of 8 in (C4.4.3). A Taylor
series expansion of 0 = fCYN ) around 80 = f(g(So gives
e= 8
which gives
Complement C4.4
YN(8 - 80 ) = PYN eN
93
+ O(l/YN)
Therefore
PM(8)
= N-'>oo
lim NE(8 - 80 )(8 -
eo?' = PRpT
(C4.4.9)
Then
(C4.4.1O)
To prove (C4.4.l0) recall that P and G are related by (C4.4.8). Thus (C4.4.1O) is
equivalent to
PRpT ;:::;: PG(GTR-1G)-lOTpT
However, this is exactly the type of inequality encountered in (4.25). In the following it is
shown that the lower bound P~ is achievable. This means that there exists a consistent
estimate of 80 , whose asymptotic covariance matrix is equal to PR-r.
An ABC estimate
Define
(C4.4.11a)
where
(C4.4.11b)
The asymptotic properties of
Taylor series expansion gives
eN
V'(Soyr = -GTR-leN
+!
aR-1(S)
aS 1 elV
(C4.4.13a)
94
Linear regression
Chapter 4
and
V"(8 0 ) = GTR-1G
+ O(lIYN)
(C4.4.13b)
(0 -
80 )
O(l/YN)
and that
YN(8 - 80 )
(GTR-1G)-lGTR-1YNeN
+ O(lIYN)
(C4.4.14)
(GTR-1G)-1
P~
(C4.4.1S)
]
where R - R(8 o) = O(lIYN), is asympto!ically equal to
similar to those made above when analyzing 8 show that
8.
Indeed, calculations
Some properties of P~
Under weak conditions the matrices {P~} form a sequence of nonincreasing positive
definite matrices, as M increases. To be more specific, assume that the model (C4.4.1) is
extended by adding one new equation. The ABCE of 80 in the extended regression has
the following asymptotic (for N ~ 00) covariance matrix (see (C4.4.15):
P~+l = (G1+1RM~1 G M + 1)-1
Complement C4.4 95
where
(~;)
(~~ !)
showing explicitly the dependence of G, R and pO on M. The exact expressions for the
scalar a and the vectors 'IjJ and ~ (which correspond to the new equation added to
(C4.4.1 are of no interest for the present discussion.
Using the nested structure of G and R and Lemma A.2, the following relationship can
be derived:
(P~+l)-l
= (G1 'ljJT){
G)
Rii(I
0)
+ (-R1iiB)(-BTRAl 1)/Y}
= (p~)-l
+ ['ljJT -
(:M)
G1Ril~]['ljJT
(C4.4.16)
G1R;l~]T/y
where
y = a - BTRil~
Since R M + 1 > 0 implies y > 0 (see Lemma A.2), it readily follows from (C4.4.16) that
(C4.4.17)
which completes the proof of the claim introduced at the beginning of this subsection.
Thus by adding new equations to (C4.4.1) the asymptotic accuracy of the ABCE
increases, which is an intuitively pleasing property. Furthermore, it follows from
(C4.4.17) that P~ must have a limit as M -l- 00. It would be interesting to be able to
evaluate the limiting matrix
p~ =
lim P~
M-,>oo
However, this seems possible only in specific cases where more structure is introduced
into the problem.
The problem of estimating the parameters of a finite-order model of a stationary
process from sample covariances was touched on previously. For this case it was proved
in Porat and Friedlander (1985, 1986), Stoica et al. (1985c) that P~ = PCR; where PCR
denotes the Cramer-Rao lower bound on the covariance matrix of any consistent
estimate of 8 0 (see Section BA). Thus, in this case the estimate is not only asymptotically (for N -l- 00) the best estimate of 80 based on a fixed number M of sample
covariances, but when both Nand M tend to infinity it is also the most accurate possible
estimate.
Chapter 5
INPUT SIGNALS
Step function.
Pseudorandom binary sequence.
Autoregressive moving average process.
Sum of sinusoids.
Examples of these input signals are given in this section. Their spectral properties are
described in Section 5.2. In Section 5.3 it is shown how the inputs can be modified in
various ways in order to give them a low-frequency character. Section 5.4 demonstrates
that the input spectrum must satisfy certain properties in order to guarantee that the
system can be identified. This will lead to the concept of persistent excitation.
Some practical aspects on the choice of input are discussed in Section 12.2. In some
situations the choice of input is imposed by the type of identification method employed.
For instance, transient analysis requires a step or an impulse as input, while correlation
analysis generally uses a pseudorandom binary sequence as input signal. In other
situations, however, the input may be chosen in many different ways and the problem of
choosing it becomes an important aspect of designing system identification experiments.
We generally assume that the system to be identified is a sampled data system. This
implies that the input and output data are recorded in discrete time. In most cases we will
use discrete time models to describe the system. In reality the input will be a continuous
time signal. During the sampling intervals it may be held constant by sending it through a
sample and hold circuit. Note that this is the normal way of generating inputs in digital
control. In other situations, however, the system may operate with a continuous time
controller or the input signal may not be under the investigator's control (the so-called
'normal operation' mode). In such situations the input signal cannot in general be
restored between the sampling instants. For the forthcoming analysis it will be sufficient
to describe the input and its properties in discrete time. We will not be concerned with
the behavior of the input during the sampling intervals.
With a few exceptions we will deal with linear models only. It will then be sufficient
to characterize signals in terms of first- and second-order moments (mean value and
96
Section 5.1
covariance function). Note however, that two signals can have equal mean and covariance function but still have drastically different realizations. As an illustration think of a
white random variable distributed as ..%(0, 1), and another white random variable
that equals 1 with probability 0.5 and -1 with probability 0.5. Both variables will have
zero mean and unit variance, although their realizations (outcomes) will look quite
different.
There follow some examples of typical input signals.
Example 5.1 A step function
A step function is given by
u(t) =
{OUo
t <
t ~ 0
(5.1)
The user has to choose only the amplitude uo. For systems with a large signal-to-noise
ratio, an input step can give valuable information about the dynamics. Rise time,
overshoot and static gain are directly related to the step response. Also the major time
constants and a possible resonance can be at least crudely estimated from a step
response.
III
Example 5.2 A pseudorandom binary sequence
A pseudorandom binary sequence (PRBS) is a signal that shifts between two levels in a
certain fashion. It can be generated by using shift registers for realizing a finite state
system, (see Complement C5.3; Davies, 1970; or Eykhoff, 1974), and is a periodic signal.
In most cases the period is chosen to be of the same order as the number of samples in
the experiment, or larger. PRBS was used in Example 2.3 and is illustrated in Figure 2.5.
When applying a PRBS, the user must select the two levels, the period and the clock
period. The clock period is the minimal number of sampling intervals after which the
sequence is allowed to shift. In Example 2.3 the clock period was equal to one sampling
interval.
III
Example 5.3 An autoregressive moving average sequence
There are many ways of generating pseudorandom numbers on a computer (see, for
example, Rubinstein, 1981; or Morgan, 1984, for a description). Let {e(l)} be a pseudorandom sequence which is similar to white noise in the sense that
1 N
N
e(t)e(t +
1:) ->
as N
->
00
(1:
=1=
0)
(5.2)
t=1
This relation is to hold for 1: at least as large as the dominating time constant of the
unknown system. From the sequence {e(t)} a rather general input u(t) can be obtained
by linear filtering as follows:
u(t) +
C1U(t -
+ dme(t - m)
(5.3)
Signals such as u(t) given by (5.3) are often called ARMA (autoregressive moving
average) processes. When all Ci = 0 it is called an MA (moving average) process, while
98
Input signals
Chapter 5
for an AR (autoregressive) process all d i = O. Occasionally the notation ARMA (mb m2)
is used, where ml and m2 denote the number of Cc and d;-coefficients, respectively.
ARMA models are discussed in some detail in Chapter 6 as a way of modeling time
series.
With this approach the user has to select the filter parameters m, {Cj}, {dj } and the
random generator for e(t). The latter includes the distribution of e(t) which often is taken
as Gaussian or rectangular, but other choices are possible.
The filtering (S.3) can be written as
C(q-I)U(t) = D(q-l)e(t)
(S.4a)
or
u(t)
D( -1)
= C(qq 1) e(t)
(S.4b)
where q-I is the backward shift operator (q-Ie(t) = e(t - 1), etc.) and
(S.4c)
The filter parameters should be chosen so that the polynomials C(z) and D(z) have all
zeros outside the unit circle. The requirement on C(z) guarantees that u(t) is a stationary
signal. It follows from spectral factorization theory (see Appendix A6.1) that the requirement on D(z) does not impose any constraints. There will always exist an
equivalent representation such that D(z) has all zeros on or outside the unit circle, as
long as only the spectral density of the signal is being considered. The above requirement
on D(z) will be most useful in a somewhat different context, when deriving optimal
predictors (Section 7.3).
Different choices of the filter parameters {Ci' dj } lead to input signals with various
frequency contents and various shapes of time realizations. Simulations of three different
ARMA processes are illustrated in Figure S.1. (The continuous curves shown are
obtained by linear interpolation.)
The curves (a) and (b) show a rather periodic pattern. The 'resonance' is more
pronounced in (b), which is explained by the fact that in (b) C(z) has zeros closer to
the unit circle than in (a). The curve for (c) has quite a different pattern. It is rather
irregular and different values seem little correlated unless they lie very close together.
u(t)
2: aj sineWjt + CPJ
(S.Sa)
j=1
10~
________________________________--,
u(t)
100
50
(a)
10~--------~~---__--------------~
u(t)
II
v
-10+-________________r - _ _. ____________~
50
(b)
100
Chapter 5
____________________________________,
u(t)
-5+-________________~--------------~~
o
50
100
(c)
am sin(wm(t + 1) + CPm)
Wm
n will
Figure S.2 illustrates a sum of two sinusoids. Both continuous time and sampled signals
III
passed through a zero-order holding device are shown.
Ey(t)
(S.6a)
3~
__________________, ,____________________________,
u(t)
-3~
__________________________________________
~~~
__
50
(a)
3~
__________________
~~
__________________________- ,
u(t)
-3~'
____________________________________________
~=-
____
50
(b)
W2
Chapter 5
1
m ~ N-.oo
lim N
2: yet)
(=1
(S.6b)
1 N
.L..J
(=
(5.7) ]
1:=-00
(5.8)
(cf. (A3.1.3)).
The spectral density will have drastically different structures for periodic signals than
for aperiodic signals. (See Appendix AS.1 for an analysis of periodic signals.)
The examples that follow examine the covariance functions and the spectral densities
of some of the signals introduced at the beginning of this chapter.
r (1:) = {a
u
2
1: = 0, M, 2M, ...
-a2/M elsewhere
(S.9a)
The spectral density of the signal can be computed using the formulas derived in
Appendix AS.1; the result is
Section 5.2
Spectral characteristics
103
(S.9b)
1
Co = M
2:
M-l
1( 2
a2 )
a2
ruCt) = M a - (M - 1) M = M2
(S.9c)
,=0
(S.9d)
a2
= M2 M -
a- k - a- M )
1 _ a-k
a2
M2(M + 1)
+ (M +
1)
{;1 0
M-l
k)]
w - 2n M
(S.ge)
ru("t)
'A,zo"o
(S.lOa)
where 0,,0 is Kronecker's delta (0,,0 = 1 if T = 0, (\,0 = 0 if T =F- 0). The spectral density is
easily found from the definition (S.7). It turns out to be constant
<Pu( w)
".1
= 2n
(S.lOb)
Next consider a filtered signal u(t) as in (S.3). Applying (A3. 1. 10), the spectral density in
this case is
(S.lOc)
From this expression it is obvious that the use of the polynomial coefficients {Ci' dj } will
considerably affect the frequency content of the signal u(t). If the polynomial C(z) has
zeros close to the unit circle, say close to e iwo and e -iwo, then <Pu( w) will have large
resonance peaks close to the frequency w = woo This means that the frequency w = Wo
is strongly emphasized in the signal. Similarly, if the polynomial D(z) has zeros close to
Chapter 5
eiUlO and e- iUlo then <Pu(wo) will be small and the frequency component of u(t) corresponding to W = Wo will be negligible. Figures 5.3 and 5.4 illustrate how the polynomial
coefficients {c;, dj } affect the covariance function and the spectral density (5.lOc),
respectively.
Figures 5.3 and 5.4 illustrate clearly that the signals in cases (a) and (b) are resonant
(with an angular frequency of about 0.43 and 0.66 respectively). The resonance is
sharper in (b). Note how these frequencies also show up in the simulations (see Figure
5.la, b). Figures 5.3c, S.4c illustrate the high-frequency character of the signal in
case (c).
Example 5.7 Characterization of a sum of sinusoids
Consider a sum of sinusoids as in Example 5.4,
m
u( t) =
2: aj sine
Wjt
+ CPj)
(S.l1a)
j=l
with
(S.l1b)
To analyze the spectral properties of u(t) consider first the quantity
10~
____________________________________- .
ru(t)
01-------~----------~~----------~--~
-10+-__________________r -________________
10
20
(a)
FIGURE 5.3 The covariance function of some ARMA processes. (a) )..2 = 1,
C(z) = 1 - 1.5z + 0.7Z2, D(z) == 1. (b)..2 = 1, C(z) = 1 - 1.5z + 0.9z 2, D(z) = l.
(c)
= 1, C(z) = 1, D(z) == 1 - 1.5z + 0.7z 2.
,,2
20~
______________________.____________- .
O+---~--------r-------~------~~--~
-20+-________________- r_________________~
10
20
(b)
5,-__________________________________-,
O~~~~-------------------~
-5+-________________- r_________________
10
(c)
20
10
o+-____________
-=~=_
______________.________
(a)
40
cj>,,(w)
o+-____________
~~
__________________ ________
~
o
(b)
FIGURE 5.4 The spectral densities of some ARMA processes. (a) )...2 = 1,
C(z) = 1 - 1.5z + 0.7z 2 , D(z) = 1. (b)...2 = 1, C(z) = 1 - 1.5z + 0.9z 2 , D(z) = 1.
(c) )...2 = 1, C(z) = 1, D(z) = 1 - 1.5z + 0.7z 2 .
Spectral characteristics
Section 5.2
o+-__________
~~
____________________________________
107
o
(c)
SN
sin(wt
(S.12a)
<p)
t=1
If
is a multiple of 2n then SN
S
= -1
1
{.. . 1 __ e iWN}
1m e''Pe 'W
.
N
1 _ e'W
N..
1m '\:'
e,(wt-Hp) = -
LA
1=1
and
ISNI
1 11 - eiwNI
11 - eiwi
~N
1
2
1
1
N ie iw/2 - e-- iwI2 1 - N Isin 0)/21
Hence
S
N""""
{Sin <p
0
if 0) = 2nk (k an integer)
elsewhere
as N ......,.
OQ
(S.12b)
Since u(t) is a deterministic function the definitions (S.6b) of m and ret) must be applied.
From (S.12a, b), the mean value is
= {~1
sin
<PI
if WI = 0
if 0)1 ={ 0
(S.12c)
In the case 0)1 = 0 this is precisely the first term in (S.l1a). Thus the signal u(t) - m
contains only strictly positive frequencies. To simplify the notation in the following
calculations u(t) - m will be replaced by u(t) and it is assumed that WI > O. Then
Chapter 5
1 N
lim -N ""
L.J u(t
N-4OO
+ T)U(t)
1=1
~J
j=1 k=l
LL
1
a;akJ~oo 2N
j=1 k=J
Wkt - CPk)
1=1
1=1
If
Wm
Sin(2!.
2 - (OT
J
III
't'J
1 N
lim N .L.,;
"" cos( w;t
N-4OO
1=1
while for
= n, j = k = m,
Wm
.
1
J~Too N
cos(Wjt
I=J
lim 1
~ sin(-2nt
N L.J
N-">oo
+ 2!.
2 -
T - 2m
)
't'm
1=1
reT) =
2:
j=1
a2
c=-L
J
2
Cj COS(Wj"C)
(5.12d)
Section 5.2
Spectral characteristics
109
If Wm = n the weight Cm should be modified. The contribution of the last sinusoid to ret)
a2
=
=
a2' [cos(nl')
a2
cos(2<Pm)]
(S.12e)
Note that (S.12e) is fairly natural: considering only one sinusoid with a = 1, W = n we
get y(l) = -sin <Pm, y(2) = sin <Pm, and in general yet) = (-lYsin <Pm' This gives
ry(l') = sin2 <Pm(-l)" which is perfectly consistent with (S.12d, e).
It remains to find the spectral density corresponding to the covariance function
(S.12d). The procedure is similar to the calculations in Appendix AS.l. An expression
for <p(w) is postulated and then it is verified by proving that (S.8) holds. We therefore
claim that the spectral density corresponding to (S.12d) is
m
<p(w)
C.
2: ; [6(w -
wJ + 6(w + wJ]
(S.12f)
j=l
.L C"21 [e
j=l
lTW,
+ e-n:wi]
2: C
cos(Wjl')
j=l
In the informal analysis of Chapter 2 a PRBS was approximated by white noise. The
validity of such an approximation is now examined.
First note that only second-order properties (the covariance functions) will be
compared here. The distribution functions can be quite different. A PRBS has a two-
Chapter 5
________________________________________________,
r,,(r)
-34-------------------------------------------------~
50
FIGURE 5.5 Covariance function for the sum of two sinusoids u(t)
sin O.4t + 2 sin 0.7t, Example 5.4.
point distribution (since it can take two distinct values only) while the noise in most cases
has a continuous probability density function.
A comparison of the covariance functions, which are given by (S.9a) and (S.lOa),
shows that they are very similar for moderate values of T provided A2 = a2 and M is large.
It is therefore justifiable to say that a PRBS has similar properties to a white noi.se. Since
a PRBS is periodic, the spectral densities, which are given by (S.ge) and (S.lOb), look
quite different. The density (S.ge) is a weighted sum of Dirac functions whereas (S.lOb)
is a constant function. However, it is not so relevant to compare spectral densities as
such: what matters are the covariances between various filtered input signals (cf.
Chapters 2, 7 and 8). Let
Xl(t) = G 1(q-l)U(t)
X2(t)
G 2(q-l)U(t)
where G1(q-l) and Giq-l) are two asymptotically stable filters. Then the covariance
between Xl(t) and X2(t) can be calculated to be
EXl(t)X2(t) =
f~n Gj(e-i"')G2(eiW)<t>u(w)dw
(cf. Problem 3.6; equation (5.8); and Appendix A3.1). Hence it is relevant to consider
integrals of the form
J=
f~1t f(w)<t>(w)dw
Section 5.2
Z
11 = 2Jt
a
Spectral characteristics
ot
f(w)dw
Z
2Jt
a
Zot
111
(5.13a)
f(w)dw
-ot
aZ [ f(O) + (M + 1) M-l
(
k)]
= MZ
~t f 2Jt M
(5.13b)
Now assume that the integral in (S.13a) is approximated by a Riemann sum. Then the
interval (0, 2Jt) is divided into M subintervals, each of length 2Jt/M, and the integral
is replaced by a sum in the following way:
aZ 2Jt
M-t
2: f
k=O
11 = 2Jt M
k) ~ h
(5.13c)
2Jt M
The approximation error tends to zero as M tends to infinity. It is easy to see that I z and
13 are very similar. In fact,
Iz -
k)]
aZ [
M-l
(
MZ (1 - M)f(O) + ~l f 2Jt M
(S.13d)
(S.14a)
(S.14b)
rz(T) =
I:rt <py,(w)e
Re
'Az
M2
I-rt
ot
i 1: W
dw
<p U 2 (W)I 1 - 1.
aem /Zei'[{Odw
k)]
(S.lSb)
2Jt M
x 1
'A2 [
1
(1 - a)2
M2
M-l
k)]
Chapter 5
r(t)
r2(t), M = 500
M
200
100
O~--------------~~~------------~~M=
50
~----IM=20
-A2+-________________.-________________~
t
o
10
5
FIGURE 5.6 Plots of ,\(t) and '2(t) versus t. The filter parameter is a = 0.9; '2(t) is
plotted for M = 500, 200, 100, 50 and 20.
Figure 5.6 shows the plots of the covariance functions flCt) and r2et) versus t, for
different values of M. It can be seen from the figure that fl(t) and fit) are very close for
large values of M. This is in accordance with the previous general discussion showing that
PRBS and white noise may have similar properties. Note also that for small values of M
there are significant differences between fl(t) and f2(t}.
III
Section 5.3
Lowpass filtering
113
.. Standard filtering.
" Increasing the clock period .
.. Decreasing the probability of level change.
Example 5.9 Standard filtering
This approach was explained in Examples 5.3 and 5.6. The user has to choose the filter
parameters {Ci' dj }. It was seen in Example 5.6 how these parameters can be chosen to
obtain various frequency properties. In particular, u(t) will be a low-frequency signal if
the polynomial C(z) has zeros close to z = 1. Then, for small ill, IC(eiw ) I will be small and
<p( ill) large (see (5.lOc.
III
Example 5.10 Increasing the clock period
Let e(t) be a white noise process. Then the input u(t) is obtained from e(t) by keeping the
same value for N steps. In this case N is called the increased dock period. This means
that
u(t) =
e([t~/l
1)
t = 1,2, ...
(5.16a)
1 M
r(L) = M~oo
lim M""
u(t + L)u(t)
L.J
t=1
Note that this definition of the covariance function is relevant to the asymptotic behavior of most identification methods. This is why (5.6b) is used here even if M(t) is not
a deterministic signal.
Now set M = pN and
1 N
v,(s) = N
2:
u(sN - N + k + L)u(sN - N + k)
k=l
Then
r(L)
)i~ pN
2: 2:
u(sN - N + k + "t)u(sN - N + k)
s=l k=l
= lim -
2:
p-"oo
s=l
vis)
Now V,,(Sl) and V'[(S2) are uncorrelated for lSI - s21 > 2, since they then are formed from
the input of disjoint intervals. Hence Lemma B.1 can be applied, and
3~
______________________
______________________- .
e(k)
-3
k
50
(a)
3
u(k)
-3+-______________________________________________
50
(b)
FIGURE 5.7 The effect of increasing the clock period. (a) {e(k)}. (b) {u(k)};
N = 3.
Section 5.3
Lowpass filtering
1
r(t) = Ev,;{s) = N
2:
115
Eu(sN - N + k + L)u(sN - N + k)
k=l
2:
= N
Eu(k + L)u(k)
k=l
1 [N-T:
~1 Eu(k + L)u(k) +
= N
=N
[N-"
f;1 Ee(1)e(l) + N
N -
+ L)u(k)
k=t:T+l Eu(k
k=t:,+l
Ee(2)e(1)
=-;::;
(5.16b)
The covariance function (5.16b) can also be obtained by filtering the white noise e(t)
through the filter
1.
H(q-l) = - ( 1 + q-l
YN
+ ... +
.
q-N+l)
1-
= -- .
--N
q -1
YN 1 - q
(5.16c)
This can be shown using (A3.1.9). For the present case we have
h( ') = {llYN j
= 0, ... , N
- 1
elsewhere
riL) =
2: 2:
h(j)h(k)re(L - j + k)
j=O k=O
N-l N-j
2: 2:
1
N
O"'-j+k,O
j=O k=O
:=-
N-2:1-,
k=O
N -
=:--
III
Remark. There are some advantages of using a signal with increased clock period over
a white noise filtered by (5.16c). In the first case the input will be constant over long
periods of time. This means that in the recorded data, prior to any parameter estimation,
we may directly see the beginning of the step response. This can be valuable information
per se. The measured data will thus approximately contain transient analysis. Secondly, a
II
continuously varying input signal will cause more wearing of the actuators.
Example 5.11 Decreasing the probability of level change
Let e(t) be a white noise process with zero mean and variance
",z.
Define u(t) as
116
Input signals
u
Chapter 5
( ) =
t
(S.17a)
The probability described by (S.17a) is independent of the state in previous time steps.
It is intuitively clear that if a is chosen close to 1, the input will remain constant over
long intervals and hence be of a low-frequency character. Figure S.8 illustrates such a
signaL
When evaluating the covariance function ruel:) note that there are two random
sequences involved in the generation of u(t). The white noise {e(t)} is one sequence,
while the other sequence accounts for occurrence of changes. These two sequences are
independent. We have
u(t +
Thus for
T) =
u(t)
T ~
0,
e(ll)
for some t1
e(t2)
for some t2
E[e(tl)e(t2) It1
a~)
(S.17b)
Ee(t1)e(t2)
'* t2]P(at least one change has occurred in the interval [t,
+ E[e(tl)e(t2)jtl
= x (1 -
+ T]
TD
+ A2 a'
(S.17c)
u(t)
.--
I--
'--
-1
50
FIGURE 5.8 A signal u(t) generated as in (5.17a). The white noise {e(t)} has a
two-point distribution (P(e(t) = 1) = P(e(t} = -1) = 0.5). The parameter a = O.S.
Persistent excitation
Section 5.4
117
This covariance function would also be obtained by filtering e(t) with a first-order filter
( 1 - a 2 )1/2
H( q -1) - 1aq l
(5.17d)
To see this, note that the weighting sequence of this filter is given by
j ~ 0
h(j) = (1 - a 2)1/2a j
Then from (A3.1.8),
ruCt:) =
2: 2: h(j)h(k)reCT -
+ k)
j=O k=O
)..2
2: h(k)h(k + T)
k=O
= J,.2(l - a 2 )
2:
a 2k + T = j.,?aT
k=O
III
These examples have shown different methods to increase the low-frequency content of a
signal. For the methods based on an increased clock period and a decreased probability
of change we have also given 'equivalent filter interpretations'. These methods will give
signals whose spectral properties could also be obtained by standard lowpass filtering
using the 'equivalent filters'.
The next example illustrates how the spectral densities are modified.
Example 5.12 Spectral density interpretations
Consider first the covariance function (5 .16b). Since it can be associated with the filter
(5.16c), the corresponding spectral density is readily found to be
1 11 1 - e- iroN l2
1 1 2 - 2 cos Nw
1 1 1 - cos Nw
<l>l(m) = 2n YN 1 - e iro
= 2n N 2 - 2 cos w = 2n N 1 - cos m
Next consider the covariance function (5.17c) with 1..2 = 1. From (5.17d) the spectral
density is found to be
1
1 - a2
1
1 - a2
2
<l>2( m) = 2n 11 - ~eirol2 = 2n 1 + a - 2a cos m
These spectral densities are illustrated in Figure 5.9, where it is clearly seen how the
frequency content is concentrated at low frequencies. It can also be seen how the lowfrequency character is emphasized when N is increased or a approaches 1.
IIilI
o+-________________
~~
______
~=_
__
o
(a)
o+-______ __
~
~====~~
__
Ul
o
(b)
FIGURE 5.9 (a) Illustration of <p,(w), N = 3, Example 5.12. (b) Similarly for N = 5.
(c) Similarly for N = 10. (d) Illustration of <P2(W), for ,,2 = 1, a = 0.3,0.7
and 0.9, Example 5.12.
(c)
3
<j>z(w)
a'" 0.9
(d)
Chapter 5
(5.18)
For a unique solution to exist the matrix appearing in (5.18) must be nonsingular. This
leads to the concept of persistent excitation.
Definition 5.1
A signal u(t) is said to be persistently exciting (pe) of order n if:
(i) the following limit exists:
1 N
ruCl:) = N~oo
lim N"'" u(t + 't)uT(t);
.LJ
(5.19)
t=l
and
(ii) the matrix
ru(O)
ru(-l)
ru(l)
ru(O)
(5.20)
is positive definite.
Remark 1. As noted after (5.6), many stationary stochastic processes are ergodic. In
such cases one can substitute
1
lim N L.t
"'"
N~oo
t=l
in (5.19) by the expectation operator E. Then the matrix Ru(n) is the usual covariance
matrix (supposing for simplicity that u(t) has zero mean).
II1II
(JII>
(5.21)
k=t
To see the dose relation between Definition 5.1 and (5.21), note that the matrix (5.20)
can be written as
Persistent excitation
Section 5.4
121
(5.22)
II
As illustration of the concept of persistent excitation, the following example considers
some simple input signals.
Example 5.13 Order of persistent excitation
Let u(t) be white noise, say of zero mean and variance a2. Then rU(L) = a26o,~ and
Ru(n) = a 2 I,,, which always is positive definite. Thus white noise is persistently exciting
of all orders.
Next, let u(t) be a step function of magnitude a. Then ru(L) = a2 for all"t. Hence Ru(n)
is nonsingular if and only if n = 1. Thus a step function is pe of order 1 only.
Finally, let u(t) be an impulse: u(t) = 1 for t = 0, and 0 otherwise. This gives ru(L) = 0
for all Land RuCn) = O. This signal is not pe of any order.
II
mu
J~Too N
2: u(t)
1=1
rueL)
= "~i-Too
1
N
(5.23)
2: [u(t + L) -
mu][uT(t) - m~]
1=1
In this way the order of persistent excitation can be decreased by at most one. This can
be seen to be the case since
RuCn)new
mT
= Ru(n)old - mm T
= (m~
m~)
and therefore rank Ru(n)new ?-: rank Ru(n)old - 1. With this convention a white noise is
still pe of any order. However, for a step function we now get ru("l:) = 0 all L. Then Ru(n)
becomes singular for aU n. Thus a step function is not pe of any order with this alternative
II
definition.
Remark 2. The concept of persistent excitation was introduced here using the
truncated weighting function model. However, the use of this concept is not limited to
the weighting function estimation problem. As will be seen in subsequent chapters, a
necessary condition for consistent estimation of an nth-order linear system is that the
input signal is persistently exciting of order 2n. Some detailed calculations can be found
in Example 11.6. In some cases, notably when the least squares method is applied, it is
sufficient to use an input that is pe of order n. This result explains why consistent
parameter estimates were not obtained in Example 2.6 when the input was an impulse.
II
(See also Complement C5.L)
Chapter 5
II
In what follows some properties of persistently exciting signals are presented. Such an
analysis was originally undertaken by Ljung (1971), while an extension to multivariable
signals was made by Stoica (1981a). To simplify the proofs we restrict them to stationary
ergodic, stochastic processes, but similar results hold for deterministic signals; see Ljung
(1971) for more details.
Property 1
Let u(t) be a multivariable ergodic process of dimension nu. Assume that its spectral
density matrix is positive definite in at least n distinct frequencies (within the interval
(-rt, rt. Then u(t) is persistently exciting of order n.
Proof. Let g = (gT
gDT be an arbitrary n x nu-vector and set G(q-l)
E7=lgiq-i. Consider the equation
0= gTRu(n)g = E[GT(q-l)U(t)][GT(q-l)U(t)y
=
f~J[ GT(e-iW)<l>u(w)G(eiW)dw
where <l>u(w) is the spectral density matrix of u(t). Since <l>u(w) is nonnegative definite,
GT(e-iw)<l>u(w)G(eiW) == 0
Thus G(e- iw ) is equal to zero in n distinct frequencies. However, since G(z) is a (vector)
polynomial of degree n - 1 only, this implies g = O. Thus the matrix Ru(n) is positive
III
definite and u(t) is persistently exciting of order n.
Property 2
An ARMA process is persistently exciting of any finite order.
Proof. The assertion follows immediately from Property 1 since the spectral density
matrix of an ARMA process is positive definite for almost all frequencies in (-rt, rt). III
Persistent excitation
Section 5.4
123
For scalar processes the condition of Property 1 is also necessary for u(t) to be pe of order
n (see Property 3 below). This is not true in the multivariable case, as shown by Stoica
(1981a).
Property 3
Let u(t) be a scalar signal that is persistently exciting of order n. Then its spectral density
is nonzero in at least n frequencies.
Proof. Since
II
Proof. We have
<Piw)
Now <Pu(w) is nonzero in n points; IH(e im )i2 is zero in precisely k points; hence <j>y(m)
is nonzero in m points, where n - k ~ m ~ n. The result now follows from
Property 1.
II
Note that the exact value of m depends on the location of the zeros of IH(e im )12 and
124
Chapter 5
Input signals
whether <PuC (j) is nonzero for these frequencies. If in particular H (q -1) has no zeros
on the unit circle then u(t) and H(q-l)U(t) are pe of the same order.
Property 6
Let u(t) be a stationary process and let
n
"L
z(t) =
Hiu(t - i)
i=1
Then EZ(t)ZT(t)
of order n.
implies Hi
Proof. Set
H
Then z(t)
(HI
=
...
Hnf
<pet)
(uT(t - 1)
uT(t - nT
HT<p(t) and
0= EZ(t)ZT(t)
EHT<p(t)<pT(t)H
HT[E<p(t)<pT(t)]H
= HTRu(n)H
Thus Ez(t)z T(t) = 0 implies H = 0 if and only if Ru(n) is positive definite. However,
this condition is the same as u(t) being pe of order n.
III
The following examples analyze the input signals introduced in Section 5.1 with respect
to their persistent excitation properties.
Example 5.14 A step function
As already discussed in Example 5.13, a step function is pe of order 1, but of no greater
order.
II
Example 5.15 A PRES
We consider a PRBS of period M. Let h be an arbitrary non-zero n-vector, where n:S; M.
Set
e = (1
(nil)
hTRu(n)h = h T
hT [
a2
-a 2/M
-a2/M
-a 2/M
a2
(a + ~)I - ~eeTJh
2
=1=
a2( 1 +
~)hTh - ~(hTe)2
a2hTh
[1 + ~ - ~J > 0
Summary
125
a2
-a2/M
-a2/M
a2
RuCM + 1) =
-a 2/M
a2
-a2/M
Since the first and the last row are equal, the matrix Ru(M + 1) is singular. Hence a
PRBS with period M is pe of order equal to but not greater than M.
III
u(t)
2: aj sin(wjt + CPj)
j=l
o :::s WI
<
W2 ...
<
Wm
:::s n
which was introduced in Example 5.4. The spectral density was given in (5.12f) as
<Pu(w)
C.
j=I
2m
n
= { 2m -
WI, Wm
WI
Wj
<
Jt
or Wm = Jt
and Wm = Jt
(5.24)
It then follows from Properties 1 and 3 that u(t) is pe of order n, as given by (5.24),
III
but not of any greater order.
Summary
Section 5.1 introduced some typical input signals that often are used in identification
experiments. These included PRBS and ARMA processes. In Section 5.2 they were
characterized in terms of the covariance function and spectral density.
Chapter 5
Section 5.3 described several ways of implementing lowpass filtering. This is of interest
for shaping low-frequency inputs. Such inputs are useful when the low frequencies in a
model to be estimated are of particular interest.
Section 5.4 introduced the concept of persistent excitation, which is fundamental when
analyzing parameter identifiability. A signal is persistently exciting (pe) of order n if its
covariance matrix of order n is positjve definite. In the frequency domain this condition is
equivalent to requiring that the spectral density of the signal is nonzero in at least n
points. It was shown that an ARMA process is pe of infinite order, while a sum of
sinusoids is pe only of a finite order (in most cases equal to twice the number of
sinusoids). A PRBS with period M is pe of order M, while a step function is pe of order
one only.
Problems
Problem 5.1 Nonnegative definiteness of the sample covariance matrix
The following are two commonly used estimates of the covariance function of a
stationary process:
1
Rk = N
N-k
2:
y(t)yT(t + k)
1=1
and
1
Rk = Iv _ k
N-k
2:
y(t)yT(t + k)
k~O
1=1
a sine <ilt
+ <fJ)
1,2, ...
+ u(t -- 2) = 0
(i)
Show this in two ways: (a) using simple trigonometric equalities; and (b) using the
spectral properties of sinusoidal signals and the formula for the transfer of spectral
densities through linear systems. Use the property above to conceive a computationally
efficient method for generating sinusoidal signals on a computer. Implement this method
and study empirically its computational efficiency and the propagation of numerical
errors compared to standard procedures.
Problems
127
a2
sin
W2t
where Wj = 2nk/ M (j = 1,2) and kl' k2 are integers in the interval [1, M - 1]. Derive the
spectral density of u(t) using (5.12d), (A5.l.8), (A5.l.15). Compare with (5.12f).
Problem 5.4 Admissible domain for !:h and Q2 ora stationary process
Let {rd denote the covariance at lag k of a stationary process and let Qk = rk/rO be the
kth correlation coefficient. Derive the admissible domain for Q1, Q2'
Hint. The correlation matrix [Qi-j] must be nonnegative definite.
Problem 5.5 Admissible domain of Q1 and Q2 for MA(2) and AR(2) processes
Which is the domain spanned by the first two correlations Q1, Q2 of a MA(2) process?
What is the corresponding domain for an AR(2) process? Which one of these two
domains is the largest?
Problem 5.6 Spectral properties of a random wave
Consider a random wave u(t) generated as follows:
u(t) =
u(t)
= { u(t -
1)
-u(t - 1)
with probability 1 - a
with probability a
where 0 < a < 1. The probability of change at time t is independent of what happened at
previous time instants.
(a) Derive the covariance function.
Hint. Use the ideas of Example S.ll.
(b) Determine the spectral density. Also show that the signal has low-frequency
character (cr(w) decreases with increasing w) if and only if a !S; 0.5.
Problem 5.7 Spectral properties of a square wave
Consider a square wave of period 2n, defined by
u(t) = 1
u(t + n)
t = 0, ... , n - 1
= -u(t)
all t
cruCw)
2: n22
j=l
1 - cos n - n
n-l.
jzl-l
Hint. ~
~
(Jt
d n-l
dz ~
~
.
Zl =
2j _ lOW - -;;(2j .- 1)
zn-1
-n
(1 - z)
1 -
zn
(1 -. zf
Chapter 5
2:
k=l
2: hi ~ 2 ro
(necessary condition)
k=l
[M-l
~ ryu(i) + (N - M + 2)ryu(k)
h(k) = (N + l)(N _ M + 1)
N=k
If N is much larger than M show that this can be simplified to h(k) ~ ryu(k). Next observe
that for N = M, the formula above reduces to h(k) = ryu(k) + E~Ol ryu(i). This might
appear to be a contradiction of the fact that for large N the covariance matrix of a
PRBS with unit amplitude is approximately equal to the identity matrix. Resolve this
contradiction.
e(t)
Ee(t)e(s)
= Of,S
Let
R = E(
y(t :
1)
. ..
(x(t - 1)
yet - n)
-an
-an-l
A=
1
-bm -
-bm
0
B=
0
1
x(t - m))
(cross-covariance matrix)
Appendix AS.1
129
Show that
R - ARBT
!In!l~
O)T.
Bibliographical notes
Brillinger (1981), Hannan (1970), Anderson (1971), Oppenheim and Willsky (1983), and
Aoki (1987) are good general sources on time series analysis.
PRBS and other pseudorandom signals are described and analyzed in detail by Davies
(1970). See also Verbruggen (197S).
The concept of persistent excitation originates from Astrom and Bohlin (196S). A
further analysis of this concept was carried out by Ljung (1971) and Stoica (1981a).
Digitallowpass filtering is treated, for example, by Oppenheim and Schafer (197S),
and Rabiner and Gold (197S).
Appendix AS.1
Spectral properties of periodic signals
Let u(t) be a deterministic signal that is periodic with period M, i.e.
u(t) = u(t - M)
(AS.I.1)
all t
The mean value m and the covariance function r(L) are then defined as
m ~ I~oo
N 2:
u(t)
t=1
2:
u(t)
1=1
1 N
r(L) ~ N~oo
lim N'"
fu(t + L) - m][u(t) - m]T
,L.,,;
(A5.I.2)
1=1
1 M
M
[u(t
2:
+ T) -
mJ[u(t) - mY
1=1
The expressions for the limits in the definitions of m and reT) can be readily established.
For general signals (deterministic or stochastic) it holds that
(AS.I.3)
In addition, for periodic signals we evidently have
reM + L) = reT)
(AS.1.4)
and hence
(AS. 1.5)
Chapter 5
The general relations (A3.L2), (A3.L3) between a covariance function and the
associated spectral density apply also in this case. However, when r(.) is a periodic
function, the spectral density will no longer be a smooth function of w. Instead it will
consist of a number of weighted Dirac pulses. This means that only certain frequencies
are present in the signal.
An alternative to the definition of spectral density for periodic signals in (A3.L2)
= ~ ~ r(.)e-j~w
<p(w)
2n
L.J
T=-OO
M-l
r(.)e-i2JtwIM
n = 0, ... , M - 1
(A5.L6)
-.:=0
r(.)
= J:Jt
ehw<p(w)dw
we have
(A5.L7)
We will, however, keep the original definition of a spectral density. For a treatment of
the OFT, see Oppenheim and Schafer (1975), and Cizek (1986). The relation between
<p(w) and {~n} will become clear later (see (A5.1.8), (A5.Ll5) and the subsequent
discussion) .
For periodic signals it will be convenient to define the spectral density over the interval
[0, 2n] rather than [-n, n]. This does not introduce any restriction since <p( w) is by
definition a periodic (even) function with period 2n.
For periodic signals it is not very convenient to evaluate the spectral density from
(A3.L2). An alternative approach is to try the structure
L-_<P_(_W_)_=_;S_~_:_C_kO_(_w_-
5.~
_ _2_n_!_)________________________(_A__
where Ck are matrix coefficients to be determined. It will be shown that with certain
values of {Cd the equation (A3.L3) is satisfied by (A5.L8).
First note that substitution of (A5.L8) in the relation (A3.L3) gives
r(.)
(
~o Cko
J2Jt M-l
0
W -
k)
2n M e i1;"'dw
(A5.1.9)
Appendix A5.1
131
where
(A5.1.lO)
Note that
=1=
1,
aM =
(A5.1.ll)
It follows directly from (A5.1.ll) that the right-hand side of (A5.1.9) is periodic with
period M. It is therefore sufficient to verify (A5.1.9) for the values 't = 0, ... , M - 1. For
this purpose introduce the (MIM) matrix
U=_l_
YM
1
1
a2
a4
a M- 1
a 2 (M-l)
a M '- l
(A5.1.12)
a(M-l)2
Then note that (A5.1.9) for't = 0, ... , M - 1 can be written in matrix form as
reO)
r(l)
(A5.1.13)
reM - 1)
where @ denotes Kronecker product (see Section A.7 of Appendix A). The matrix U
can be viewed as a vanderMonde matrix and it is therefore easy to prove that it is
nonsingular. This means that {Cdf':Ol can be derived uniquely from {r(t)}~ol. In fact
V turns out to be unitary, i.e.
V-I = U*
(A5.1.14)
1
= M
p=o
M-I
2:
aP{j-k)
p=o
Hence
1
(VV*)jj = M
M-J
2:
P=()
and for j
=1=
(UU*)jk
k
=
1 1 -- aM(j-k)
M 1 _ aj k
132
Input signals
Chapter 5
(U 1)--1 = U* I
Thus
Ck = _1-[U%or(O)
YM
or
(A5.1.15)
Hence the weights of the Dirac pulses in the spectral density are equal to the normalized
DFT of the covariance function.
In the remaining part of this appendix it is shown that the matrix Ck is Hermitian and
nonnegative definite. Introduce the matrix
reO)
r( -1)
R=
r(1)
reM - 1)
reO)
r(1 - M)
=
lim
N-->oo
lN
N {(
.L
t=1
reO)
u(t - 1) - m)
(u T (t -- 1) - mT
"ret - M) -
m T) }
u(t - M) - m
The quadratic form akRaZ is a Hermitian and nonnegative definite matrix by construction. Then (A5.l.5), (A5.1.1O), (A5.1.11) give
Complement C5.1
akRa%
= r(O)M +
M-l
2:
M-l
(M - 1:)r(t)a- h +
-.:=1
r(O)M +
2:
1:=1
M-l
2:
133
M-l
(M - 1:)r(t)a- h +
2:
lr(l)ak(M-l)
This calculation shows that the Cb for k = 0, ... , M - 1, are Hermitian and nonnegative definite matrices. Finally observe that if u(t) is a persistently exciting signal of
order M, then R is positive definite. Since the matrices ak have full row rank, it follows
that in such a case {Ck } are positive definite matrices.
Complement CS.l
Difference equation models with persistently exciting inputs
The persistent excitation property has been introduced in relation to the LS estimation of
weighting function (or all-zeros) models. It is shown in this complement that the pe
property is relevant to difference equation (or poles-zeros) models as well. The output
y(t) of a linear, asymptotically stable, rational filter B(q-l)IA(q-l) with input u(t) can be
written as
(CS.I.1)
where
A(q--l) = 1
cp(t)
(-y(t - 1)
-yet - na)
u(t - 1)
u(t - nb)f
As shown below, the condition R > 0 is intimately related to the persistent excitation
property of u(t). The cases E(t) == 0 and E(t)
0 will be considered separately.
Chapter 5
Noise-free output
na{
cp(t) =
1
A(q 1) U(t - 1)
-b] .. , -b nb
1al
nb{
~
-b nb
...
ana
1
A(q-l) u(t - na - nb)
J(-B, A)cp(t)
Thus
R>
O}
Now J( - B, A) is the Sylvester matrix associated with the polynomials - B(z) and A(z)
(see Definition A.8). It is nonsingular if and only if A(z) and B(z) are coprime (see
Lemma A.30).
Regarding the matrix R, it is positive definite if and only if u(t) is pe of order
(na + nb); see Property 5 in Section 5.4. Thus
L __
nb),
(cs.u)]
Noisy output
Let
x(t)
B( -1)
= A(q-1)
q
u(t)
x(t - 1)
R=E
x(t - na)
u(t - 1)
u(t - nb)
(x(t - 1)
...
x(t - na)
u(t - 1)
...
u(t - nb))
Complement C5.2
135
E(t - 1)
E(t - na)
0
E(l - na)
(E(t - 1)
(C5.1.4)
0)
~ (:T ~) + (; ~)
Clearly the condition C> is necessary for R > O. Under the assumption that
it is also sufficient. Indeed, it follows from Lemma A.3 that if C > 0 then
A - BC-lBT
;?;
rank (A
A> 0,
and
rank R
nb
+A -
BC-lBT )
A > 0,
R = na + nb
Thus, assuming
rank
The conclusion is that under weak conditions on the noise E(t) (more precisely A> 0),
the following equivalence holds:
{R > O}
(C5.1.5ij
Complement C5.2
Condition number o/the covariance matrix o/filtered white noise
As shown in Properties 2 and 5 of Section 5.4, filtered white noise is a persistently
exciting signal of any order. Thus, the matrix
Rm
= E(U U
~ l)(U(t _ 1)
u(t - m)
u(t)
H(q-l)e(t)
(C5.2.1)
u(t - m)
where Ee(t)e(s) = Of,S' and H(q-l) is an asymptotically stable filter, is positive definite
for any m. From a practical standpoint it would be more useful to have information on
the condition number of Rm (cond (Rm rather than to just know that Rm > O. In the
following a simple upper bound on
cond(Rm) ~ AmaxCRm)/Amin(Rm)
is derived (see Stoica et al., 1985c; Grenander and Szego, 1958).
First recall that for a symmetric matrix R the following relationships hold:
(C5.2.2)
_.
I'.min -
In
Chapter 5
f xTRx
-T-
Amax = sup
x
X X
xTRx
-T-
X X
it follows that
Am in(Rm+ 1) ~ Amin(Rm)
Amax(Rm+l) ?: AmaxCRm)
Thus
(CS.2.3)
(CS.2.4)
(CS.2.S)
Qf.
(CS.2.6)
If Q satisfies
IH(eiwW - Q ?:
for
W E
(-Jt, Jt)
(CS.2.7)
then it follows from (CS.2.6) that Rm - Of is the covariance matrix of a process with the
spectral density function equal to the left-hand side of (CS.2.7). Thus
for all m
(CS.2.8)
Gmin
is uniquely defined by
for all m
and
some
Gmin
Complement C5.3
cond(Rm) ::::; sup IH(e iw )12/infIH(e iw )12
OJ
137
(CS.2.9)
with equality holding in the limit as m - ? 00. Thus, if H(q-l) has zeros on or close to the
unit circle, then the matrix Rm is expected to be ill-conditioned for large m.
Complement C5.3
Pseudorandom binary sequences of maximum length
Pseudorandom binary sequences (PRBS) are two-state signals which may be generated,
for example, by using a shift register of order n as depicted in Figure C5. 3.1.
The register state variables are fed with 1 or O. Every initial state vector is allowed
except the all-zero state. When the clock pulse is applied, the value of the kth state is
transferred to the (k + l)th state and a new value is introduced into the first state through
the feedback path.
The feedback coefficients, ah ... , a'l> are either 0 or 1. The modulo-two addition,
denoted by ED in the figure, is defined in Table CS.3.1. The system operates in discrete
time. The clock period is equal to the sampling time.
u(t)
a"
U2
0
1
0
1
0
0
1
1
Ul
EB Uz
0
1
1
0
Chapter 5
The system shown in Figure CS.3.1 can be represented in state space form as
x(t + 1) =
al
1
an
(CS.3.1)
0
u(t) = (0
x(t)
1
0
l)x(t)
where all additions must be carried out modulo-two. Note the similarity with 'standard'
state space models. In fact, with appropriate interpretations, it is possible to use several
'standard' results for state space models for the model (CS.3.1). Models of the above
form where the state vector can take only a finite number of values are often called
finite state systems. For a general treatment of such systems, refer to Zadeh and Polak
(1969), Golomb (1967), Peterson and Weldon (1972).
It follows from the foregoing discussion that the shift register of Figure CS. 3.1 will
generate a sequence of ones and zeros. This is called a pseudorandom binary sequence.
The name may be explained as follows. A PRBS is a purely deterministic signal: given
the initial state of the register, its future states can be computed exactly. However, its
correlation function resembles the correlation function of white random noise. This is
true at least for certain types of PRBS called maximum length PRBS (ml PRBS). For
this reason, this type of sequence is called 'pseudorandom'. Of course, the sequence is
called 'binary' since it contains two states only.
The PRBS introduced above takes the values 0 and 1. To generate a signal that shifts
between the values a and b, simply take
yet) = a + (b - a)u(t)
(CS.3.2)
yet) = -1 + 2u(t)
(CS.3.3)
This complement shows how to generate an m.l. PRBS by using the shift register of
Figure CS.3.1. We also establish some important properties of ml PRBS and in
particular justify the claim introduced above that the correlation properties of ml PRBS
resemble those of white noise.
Maximum length PRBS
For a shift register with n states there is a possible total of 2n different state vectors
composed of ones and zeros. Thus the maximum period of a sequence generated using
such a system is 2n. In fact, 2n is an upper bound which cannot be attained. The reason is
that occurrence of the all-zero state must be prevented. If such a state occurred then the
state vector would remain zero for all future times. It follows that the maximum possible
period is
(CS.3.4)
Complement C5.3
139
where
n
A PRBS of period equal to M is called a maximum length PRBS. Whether or not an nthorder system will generate an m.l. PRBS depends on its feedback path. This is illustrated
in the following example.
Example CS.3.1 Influence of the feedback path on the period of a PRBS
Let the initial state of a three-stage (n = 3) shift register be (1 0 O)T. For the following
feedback paths the corresponding sequences of generated states will be determined.
(a) Feedback from states 1 and 2, i.e. al = 1, a2
sequence of state vectors
al =
1,
a2 =
1, a3 = 0. The corresponding
0,
a3 =
1. The corresponding
III
There are at least two reasons for concentrating on maximum length (ml) PRBS:
II
II
The correlation function of an ml PRBS resembles that of a white noise (see below).
This property is not guaranteed for nonmaximum length PRBSs.
As shown in Section 5.4, a periodic signal is persistently exciting of an order which
cannot exceed the value of its period. Since persistent excitation is a vital condition for
identifiability, a long period will give more flexibility in this respect.
Chapter 5
The problem of choosing the feedback path of a shift register to give an ml PRBS is
discussed in the following.
Let
(C5.3.5)
where q-l is the unit delay operator. The PRBS u(t) generated using (C5.3.1) obeys the
following homogeneous equation:
A(q--l)U(t)
(C5.3.6)
= 1, ... ,
(n - 1)
and
A(q-l)U(t) = xAt) EB alx,,(t - 1) EB ... EB anxn(t - n)
=
The problem to study is the choice of the feedback coefficients {ai} such that the
equation (C5.3.6) has no solution u(t) with period smaller than 2" - 1. A necessary and
sufficient condition on {ai} for this property to hold is provided by the following lemma.
Lemma CS.3.1
The homogeneous recursive equation (C5.3.6) has only solutions of period 2n
only if the following two conditions are satisfied:
1, if and
(i) The binary polynomial A(q-I) is irreducible [i.e. there do not exist any two
polynomials Aj(q-l) and A 2(q-l) with binary coefficients such that A(q-I) =
A l(q-I)A 2(q-l)].
(ii) A(q-I) is a factor of 1 EB q-M but is not a factor of 1 EB q-P for any p < M = 2n - 1.
Proof. We first show that condition (i) is necessary. For this purpose assume that
A(q--l) is not irreducible. Then there are binary polynomials Aj(q-l) and A 2(q-l) with
deg A2 = n2
> 0
and nl
n2 = n
Let
UI
= A 1(q-l)A 2(q-l)
Aj(q-I)UI(t)
A 2(q-l)U2(t)
=
=
(C5.3.7)
0
Then
U(t) =
Uj (t)
EB U2(t)
(C5.3.8)
is a solution to (C5.3.6). Now, according to the above discussion, the maximum period of
uJt) is 2ni - 1 (i = 1,2). Thus, the maximum possible period of u(t), (C5.3.8), is at most
Complement C5.3
141
equal to (2n J - 1)(2"2 - 1), which is strictly less than 2n - 1. This establishes the
necessity of condition (i).
We next show that if A(q--l) is irreducible then the equation (C5.3.6) has solutions of
period p if and only if A(q-l) is a factor of 1 E8 q--p.
The 'if' part is immediate. If there exists a binary polynomial p(q-l) such that
A(q-l)p(q-l) = 1 E8 q-P
then it follows that
(1 E8 q-P)u(t)
= P(q-I)[A(q-l)U(t)]
= ]i
U(z)
(C5.3.9)
u(t)zt
t=O
where
(C5.3.1O)
U(z) =
! [~
aiu(t - i) Jzt =
~ aizi
]i
u(t - i)zt-i
Q(z) E8
[~ aizi] U(z)
(C5.3.11)
where
n
Q(z)
i=1
-1
ai
]i
u(t)zt+i
(C5.3.12)
t=-i
is a binary polynomial of degree (n - 1), the coefficients of which depend on the initial
values u( -1), ... , u( -n) and the feedback coefficients of the shift register. It follows
that
and hence
142
Input signals
Chapter 5
Q(z) = A(z)U(z)
(CS.3.13)
(CS.3.14)
3
4
S
6
7
8
9
10
1 EEl z EEl Z3
1 EEl z EEl Z4
1 EEl Z2EEl ZS
1 EEl z EEl Z6
1 EEl z EEl Z7
1 EEl z EEl Z2 EEl Z7 EEl Z8
1 EEl Z4 EEl Z9
1 EEl Z3 EEl ZIO
1 EElz2EEl Z3
1 E9 Z3 EEl Z4
1 EEl Z3 EEl Z5
1 EEl Z5 EEl Z6
1 EEl Z3 EEl Z7
1 EEl Z EEl Z6 EEl z'7 EEl Z8
1 EEl Z5 EEl Z9
1 EEl Z7 EEl ZlO
Remark. In the light of Lemma CS.3.1, the period lengths encountered in Example
CS.3.1 can be understood more clearly. Case (a) corresponds to the feedback polynomial
Complement C5.3
143
= 1 EB
Property PI
If u(t) is an ml PRBS of period M = 2n - 1, then within one period it contains
(M + 1)/2 = 2n - 1 ones and (M - 1)/2 = 2,,-1 - 1 zeros.
III
Property PI is fairly obvious. During one period the state vector, x(t), will take all
possible values, except the zero vector. Out of the 2n possible state vectors, half of them,
i.e. 2"--\ will contain a one in the last position (giving u(t) = 1).
Property P2
Let u(t) be an ml PRBS of period M. Then for k = 1, ... , M - 1,
u(t) EB u(t - k)
= u(t - I)
(CS.3.1S)
III
Hence u(t) EB u(t - k) is an ml PRBS, corresponding to some initial values. Since all
possible state vectors appear during one period of u(t), the relation (CS.3.1S) follows.
One further simple property is required, which is valid for binary variables.
Property P3
If x and yare binary variables, then
1
xy = 2[x + y - (x EB y)]
(CS.3.16)
III
Property P3 is easy to verify by direct computation of all possible cases; see Table CS.3.3.
Chapter 5
TABLE CS.3.3 Verification of property P3
x
xEF>y
o
o
1
1
1
1
2: [x + y -
(x EF> y) 1
xy
o
o
o
o
o
o
The mean and the covariance function of an ml PRBS can now be evaluated. Let the
period be M = 2n - 1. According to Appendix AS.l,
1
m = M
2: u(t)
1=1
1
ret) = M
(CS.3.17)
2:
[u(l
+ T) -
m][u(t) - m]
t=1
~(M;
1)
~ + 2~
(CS.3.18)
=M
2:
2:
~1
~1
2: u(t) -
(CS.3.19)
m 2 = m - m 2 = m(l - m)
t=l
+ 1M
- 1
=~~=
M2 - 1
4M2
The variance is thus slightly less than 0.2S. Next note that for
properties P1-P3 give
1 M
r('t) = M
[u(t + 't) - mJ[u(t) - m]
't =
1, ... , M - 1,
2:
(=1
u(t
+ 't)u(t) - m 2
t=1
= 2M
2:
+ 't) +
[u(t
u(t) - {u(t
+ 't)
t=l
=m
- -
"" u(t +
2M L..J
't -
I) - m 2
t=1
M+1
=: -
EF> u(t)}] - m 2
(CS.3.20)
Complement CS.3
Considering the signal yet)
obtained:
m y = -1
ry(O)
+ 2m
= 4ru(O)
= -M =
= 1 -
-1
145
(CS.3.21a)
1
M2 = 1
(CS.3.21b)
1
1
ry(t) = 4ri,;) = - M - M2 =
1
M
(CS.3.21c)
The approximations in (CS.3.21) are obtained by neglecting the slight difference of the
mean from zero.
It can be seen that for M sufficiently large the covariance function of yet) resembles
that of white noise of unit variance. The analogy between ml PRBS and white noise is
further discussed in Section S.2.
Due to their easy generation and their convenient properties the ml PRBSs have
been used widely in system identification. The PRBS resembles white noise as far as
the spectral properties are concerned. Furthermore, by various forms of linear filtering
(increasing the clock period to several sampling intervals being one way of lowpass
filtering; see Section S.3), it is possible to generate a signal with prescribed spectral
properties. These facts have made the PRBS a convenient probing signal for many
identification methods.
The PRBS has also some specific properties that are advantageous for nonparametric
methods. The covariance matrix [rei - j)] corresponding to an ml PRBS can be inverted
analytically (see Problem S.9). Also, calculation of the cross-correlation of a PRBS with
another signal needs only addition operations (multiplications are not needed). These
facts make the ml PRBS very convenient for nonparametric identification of weighting
function models (see, for example, Section 3.4 and Problem S.9).
Chapter 6
MODEL
PARAMETRIZATIONS
6.1 Model classifications
This chapter examines the role played by the model structure vi{ in system identification, and presents both a general description of linear models and a number of examples.
In this section, some general comments are made on classification of models. A distinction can be made between:
Analog models, which are based on analogies between processes in different areas. For
example, a mechanical and an electrical oscillator can be described by the same
second-order linear differential equation, but the coefficients will have different
physical interpretations. Analog computers are based on such principles: differential
equations constituting a model of some system are solved by using an 'analog' or
'equivalent' electrical network. The voltages at various points in this network are
recorded as functions of time and give the solution to the differential equations.
Physical models, which are mostly laboratory-scale units that have the same essential
characteristics as the (full-scale) processes they model.
In system science mathematical models are useful because they can provide a description of a physical phenomenon or a process, and can be used as a tool for the design
of a regulator or a filter.
Mathematical models can be derived in two ways:
Modeling, which refers to derivation of models from basic laws in physics, economics,
etc. One often uses fundamental balance equations, for example net accumulation =
input flow - output flow, which can be applied to a range of variables, such as
energy, mass, money, etc.
Identification, which refers to the determination of dynamic models from experimental
data. It includes the set-up of the identification experiment, i.e. data acquisition, the
determination of a suitable form of the model as well as of its parameters, and a
validation of the model.
These methods have already been discussed briefly in Chapter 1.
146
Section 6.1
Model classifications
147
Classification
Mathematical models of dynamic systems can be classified in various ways. Such models
describe how the effect of an input signal will influence the system behavior at subsequent times. In contrast, for static models, which were examplified in Chapter 4, there
is no 'memory'. Hence the effect of an 'input variable' is only instantaneous. The ways of
classifying dynamic models include the following:
Single input, single output (SISO) models-multivariable models. SISO models refer to
processes where a description is given of the influence of one input on one output.
When more variables are involved a multivariable model results. Most of the theory in
this book will hold for multivariable models, although mostly SISO models will be
used for illustration. It should be noted that multi-input, single output (MISO) models
are in most cases as easy to derive as SISO models, whereas multi-input, multi-output
(MIMO) models are more difficult to determine.
Linear models-nonlinear models. A model is linear if the output depends linearly on
the input and possible disturbances; otherwise it is nonlinear. With a few exceptions,
only linear models will be discussed here.
Parametric models-nonparametric models. A parametric model is described by a set of
parameters. Some simple parametric models were introduced in Chapter 2, and we
will concentrate on such models in the following. Chapter 3 provided some examples
of nonparametric models, which can consist of a function or a graph.
Time invariant models-time varying models. Time invariant models are certainly the
more common. For time varying models special identification methods are needed. In
such cases where a model has parameters that change with time, one often speaks
about tracking or real-time identification when estimating the parameters.
Time domain models-frequency domain models. Typical examples of time domain
models are differential and difference equations, while a spectral density and a Bode
plot are examples of frequency domain models. The major part of this book deals with
time domain models.
Discrete time models-continuous time models. A discrete time model describes the
relation between inputs and outputs at discrete time points. It will be assumed that
these points are equidistant and the time between two points will be used as time unit.
Therefore the time t will take values 1, 2, 3 ... for discrete time models, which is the
dominating class discussed here. Note that a continuous time model, such as a
differential equation, can very well be fitted to discrete time data!
Lumped models-distributed parameter models. Lumped models are described by or
based on afinite number of ordinary differential or difference equations. If the number
of equations is infinite or the model is based on partial differential equations, then it is
caned a distributed parameter model. The treatment here is confined to lumped
models .
.. Deterministic models-stochastic models. For a deterministic model the output can be
exactly calculated as soon as the input signal is known. In contrast, a stochastic model
contains random terms that make such an exact calculation impossible. The random
terms can be seen as a description of disturbances. This book concentrates mainly on
stochastic models.
Chapter 6
Note that the term 'linear models' above refers to the way in which yet) depends on
past data. Another concept concerns models that are linear in the parameters 8 to be
estimated (sometimes abbreviated to LIP). Then y(t) depends linearly on 8. The
identification methods to be discussed in Chapter 8 are restricted to such models. The
linear regression model (4.1),
yet)
q:?(t)8
is both linear (since yet) depends linearly on cp( t and linear in the parameters (since yet)
depends linearly on 8). Note however that we can allow cp(t) to depend on the measured
data in a nonlinear fashion. Example 6.6 will illustrate such a case.
The choice of which type of model to use is highly problem-dependent. Chapter 11
discusses some means of choosing an appropriate model within a given type.
(6.1)
= A(8)Ot,s
In (6.1), y(t) is the ny-dimensional output at time t and u(t) the nu-dimensional input.
Further, e(t) is a sequence of independent and identically distributed (iid) random
variables with zero mean. Such a sequence is referred to as white noise. The reason for
this name is that the corresponding spectral density is constant over the whole frequency
range (see Chapter 5). The analogy can be made with white light, which contains 'all'
frequencies. In (6.1) G(q-l; 8) is an (ny Inu)-dimensional filter and H(q-l; 8) an (nylny)dimensional filter. The argument q-l denotes the backward shift operator, so q-1u(t) =
u(t - 1), etc. In most cases the filters G(q-l; 8) and H(q-\ 8) will be of finite order.
Then they are rational functions of q-l. The model (6.1) is depicted in Figure 6.1.
The filters G(q-l; 8) and H(q-l; 8) as well as the noise covariance matrix A(8) are
functions of the parameter vector 8. Often 8 (which we assume to be n8-dimensional) is
restricted to lie in a subset of [lIl n8. This set is given by
u(t)
y(t)
Section 6.2
qz;
149
(6.2)
The reasons for these restrictions in the definition of qz; will become clear in the next
chapter, where it will be shown that when 8 belongs to qz; there is a simple form for the
optimal prediction of yet) given y(t - 1), u(t - 1), yet - 2), u(t - 2), ...
Note that for the moment H(q-l; 8) is not restricted to be asymptotically stable. In
later chapters it will be necessary to do so occasionally in order to impose stationarity of
the data. Models with unstable H(q-l; 8) can be useful for describing drift in the data as
will be illustrated in Example 12.2. Since the disturbance term in such cases is not
stationary, it is assumed to start at t = O. Stationary disturbances, which correspond to an
asymptotically stable H(q-l; 8) filter can be assumed to start at t = -00.
For stationary disturbances with rational spectral densities it is a consequence of the
spectral factorization theorem (see for example Appendix A6.1; Anderson and Moore,
1979; Astrom, 1970) that they can be modeled within the restrictions given by (6.2).
Spectral factorization can also be applied to nonstationary disturbances if H- 1(q-l; 8) is
a stable filter. This is illustrated in Example A6.1.4.
Equation (6.1) describes a general linear model. The following examples describe
typical model structures by specifying the parametrization. That is to say, they specify
how O(q-l; 8), H(q-l; 8) and A(8) depend on the parameter vector 8.
Example 6.1 An ARMAX model
Let yet) and u(t) be scalar signals and consider the model structure
~)y(t)
= B(q-l)U(t)
(6.ij]
+ C(q-l)e(t)
where
A(q-l) = 1
B(q-l) = b1q-l
+ '" + bnbq-nb
(6.4)
+ aly(t -
1)
bluet - 1)
+ bnbu(t -
nb)
(6.6)
Chapter 6
,,? = Ee 2 (t)
(6.7)
so that
G(q-l. 8)
,
B( -1)
q
A(q 1)
H(q-l. 8)
,
C( -1)
q
A(q 1)
A(8)
1..2
The set
q)
q)
(6.8)
is given by
= {81
The polynomial C(z) has all zeros outside the unit circle}
eel)
(a)
u(t)
!!.~q'l)
A(q-l)
(b)
u(t)
yet)
Section 6.2
151
A(q-l)y(t) = e(t)
11
(6.9)
yet)
C(q-l)e(t)
(Cl . . .
nb ::;::: O. Then
(6.10)
Cnc)T
Cl'"
Cnc>T
When A(q-l) is constrained to contain the factor 1 - q-l the model is called
autoregressive integrated moving average (ARIMA). Such models are useful for
describing drifting disturbances and are further discussed in Section 12.3.
A finite impulse response (FIR) model is obtained when na = nc ::::: O. It can also be
caned a truncated weighting function model. Then
yet)
= B(q-l)U(t) + eel)
8 = (b i
...
(6.12)
bnb)T
bl
...
bnb)T
(6.13b)
where
<:pet) ::::: (-y(t - 1) ... -y(t - na)
(6.13c)
Note though that here the regressors (the elements of <:p(t are not deterministic
functions. This means that the analysis carried out in Chapter 4 for linear regression
..
models cannot be applied to (6.13).
Chapter 6
'-
B( -1)
F(q-l)
_A_(_q_-_l)_y(_t)_=
__
q_ _
C( -1)
(6.14)
(6.15)
e=
(al ...
ana
bl
...
bnb
CI'"
Cnc
d l ... d nd
It ...
fnf)T
(6.16)
Block diagrams of the general model (6.14) are given in Figure 6.3.
It should be stressed that it is seldom necessary to use the model structure (6.14) in its
general form. On the contrary, for practical use one often restricts it by setting one or
more polynomials to unity. For example, choosing nd = nf = 0 (i.e. D(q--l) == F(q-l) ==
1) produces the ARMAX model (6.3). The value of the form (6.14) lies in its generality.
It includes a number of important forms as special cases. This makes it possible to
describe and analyze several cases simultaneously.
Clearly, (6.14) gives
(a)
u(t)
(b)
u(t)
yet)
FIGURE 6.3 Equivalent block diagrams of the general SISO model (6.14).
Section 6.2
153
(6.17)
= {81
q;
The polynomials C(z) and F(z) have aU zeros outside the unit circle}
= nf = 0 the
yet) =
B( -1)
q
u(t)
F(q 1)
+ e(t)
(6.19)
B(q-I)
C(q-I)
(6.21)
Chapter 6
be common factors in A(q-I) and B(q-1) as well as in A(q-l) and C(q-I). For
example, if the disturbances enter as white measurement noise then H(q-\ 8) = 1,
which gives A(q-I) = C(q-1). In general A(q-l) must be chosen as the (least)
common denominator of O(q-\ 8) and H(q-\ 8) .
The model (6.21) implies that A(q-I) = 1. Hence the filters 0(q-1; 8) and H(q-I; 8)
are parametrized with different parameters. This model might require fewer
parameters than an ARMAX model. However, if 0(q-1) and H(q-l) do have
common poles, it is an advantage to use this fact in the model description. Such poles
will also be estimated more accurately if an ARMAX model is used .
The general form (6.14) of the model makes it possible to allow O(q-\ 8) and
H(q-I; 8) to have partly the same poles.
Another structure that has gained some popularity is the so-called ARARX structure, for which C(q-I) = F(q-1) = 1. The name ARARX refers to the fact that the
disturbance is modeled as an AR process and the system dynamics as an ARX model.
This model is not as general as those above, since in this case
This identity may not have any exact solution. Instead a polynomial D(q-l) of high
degree may be needed to get a good approximate solution. The advantage of this
structure is that it can given simple estimation algorithms, such as the generalized least
squares method (see Complement C7.4). It is also well suited for the optimal instrul1li
mental variable method (see Example 8.7 and the subsequent discussion).
The next two examples describe ways of generalizing the structure (6.13) to the multivariable case.
Example 6.3 The full polynomial form
For a linear multivariable system consider the model
A(q-l)y(t)
B(q-l)U(t) + e(t)
(6.22)
where now A(q-I) and B(q-I) are the following matrix polynomials of dimension
(nylny) and (nylnu) respectively:
(6.23)
Assume that all elements of the matrices AJ, ... , Ana, B I , . . . , Bnb are unknown. They
will thus be included in the parameter vector. The model (6.22) can alternatively be
written as
yet)
(6.24)
Section 6.2
155
(6.25a)
(6.25b)
(6.25c)
(e?T)
((euy)T
= (AI' .. Ana
B1
...
Bub)
(6.25d)
The set q; associated with (6.22) is the whole space 9'l ne.
For an illustration of (6.25) consider the case ny = 2, nu = 3, na
= 1, nb = 2 and set
Then
<pT(t) = (-Yl(t - 1) -Yz(t - 1)
U2(t - 2)
U3(t -
e1 =
e2 =
(all
a 12
(a 21
a 22
UI(t - 1)
U2(t - 1)
U3(t - 1)
Ul(t -- 2)
iii
The full polynomial form (6.24) gives a natural and straightforward extension of the
scalar model (6.13). Identification methods for estimating the parameters in the full
polynomial form will look simple. However, this structure also has some drawbacks.
The full polynomial form is not a canonical parametrization. Roughly speaking, this
means that there are linear systems of the form (6.22) whose transfer function matrix
A -l(q-l)B(q-l) cannot be uniquely parametrized bye. Such systems cannot be
identified using the full polynomial model, as should be intuitively clear. Note, however,
that these 'problematic' systems are rare, so that the full polynomial form can be used to
represent uniquely almost all linear systems of the form (6.22). The important problem
of the uniqueness of the parametrization is discussed in the next section. A detailed
analysis of the full polynomial model is provided in Complement C6.1.
Example 6.4 Diagonal form of a multivariable system
This model can be seen as another generalization of (6.13) to the multivariable case.
Here the (nylny) matrix A(q-I) is assumed to be diagonal. More specifically,
Chapter 6
(6.26a)
a1(Z; 8)
A(z; 8)
(6.26b)
o
where
ai(Z; 0) = 1
are scalar polynomials. Further,
Bl(Z;
B(z; 0)
=(
0)
(6.26c)
Bny(z; 8)
The integer-valued parameters nai and nbi, i = 1, ... , ny are the structural indices of
the parametrization. The model can be written as
yet)
</> T (t)8
+ e(t)
(6.27a)
where the parameter vector 8 and the matrix </>(t) are given by
0:
<PI (t)
o"y
</>(t) = (
(6.27b)
o
(6.27c)
(6.27d)
The diagonal form model is a canonical parametrization of a linear model of the type of
(6.22) (see, for example, Kashyap and Rao, 1976). However, compared to the full
polynomial model, it has some drawbacks. First, estimation of the parameters in full
polynomial form models may lead to simpler algorithms than those associated with
diagonal models (see Chapter 8). Second, the structure of the diagonal model (6.27)
clearly is more complicated than that of the full polynomial model (6.25). For the model
(6.27) 2ny structure indices have to be determined while the model (6.25) needs
determination of only two structure indices (the degrees na and nb). Determination of a
large number of structural parameters, which in practical applications is most often done
by scanning all the combinations of the structure parameters that are thought to be
possible, may lead to a prohibitive computational burden. Third, the model (6.27) may
often contain more unknown parameters than (6.25), as the following simple example
demonstrates: consider a linear system given by (6.22) with ny = nu and na = nb = 1;
then the model (6.24), (6.25) will contain 2nl parameters, while the diagonal model
III
(6.26), (6.27) will most often contain about nl parameters.
Section 6.2
157
For both the full polynomial form and the diagonal form models the number of
parameters used can be larger than necessary. If the number of parameters in a model is
reduced, the user may gain two things. One is that the numerical computation required
to obtain the parameter estimates will in general terms be simpler since the problem is of
a lower dimension. The other advantage, which will be demonstrated later, is that using
fewer free parameters will in a certain sense lead to a more accurate model (see (11.40)).
The following example presents a model for multivariable systems where the internal
structure is of importance.
Example 6.S A state space model
Consider a linear stochastic model in the state space form
x(t
1) = A(8)x(t)
+ B(8)u(t) + v(t)
(6.28)
Here vet) and e(t) are (multivariate) white noise sequences with zero means and the
following covariances:
EV(t)VT(S)
= R 1(8)b(,s
Ev(t)eT(s) = Rd8)b(,s
(6.29)
Ee(t)eT(s) = R 2(8)O(,s
The matrices A(8), B(8), C(8), R I (8), R12(8) , R2(8) can depend on the parameter
vector in different ways. A typical case is when a model of the form (6.28) is partially
known but some of its matrix elements remain to be estimated. No particular
parametrization will be specified here.
Next consider how the model (6.28) should be transformed into the general form (6.1)
introduced at the beginning of this chapter. The transfer function G(q-\ 8) is easily
found: it can be seen that for the model (6.28) the influence of the input u(t) on the
output yet) is characterized by the transfer function
(6.30)
To find the filter H(q-\ 8) and the covariance matrix A(8) is a bit more complicated.
Equation (6.28) must be transformed into the so-called innovation form. Note that in
(6.1) the only noise source is e(t), which has the same dimension as the output yet).
However, in (6.28) there are two noise sources acting on yet): the 'process noise' vet) and
the 'measurement noise' e(t). This problem can be solved using spectral factorization, as
explained in Appendix A6.1. We must first solve the Riccati equation
P(8)
x [C(8)P(8)CT (8)
+ R2(8)]-1[C(8)P(8)AT(8) + RI2(8)]
(6.31)
taking the symmetric positive definite solution, and then compute the Kalman gain
(6.32)
158
Model parametrizations
Chapter 6
It is obvious that both P and K computed from (6.31) and (6.32) will depend on the
parameter vector 8. The system (6.28) can then be described, using the one-step
predictions xCt + lit) as state variables, as follows:
C(8)x(tlt - 1) + yet)
(6.33)
where
y(t) = yet) - C(8)x(tlt - 1)
(6.34)
is the output innovation at time t. The innovation has a covariance matrix equal to
cov[y(t)]
C(8)P(8)Cr(8) + R2(8)
(6.35)
The innovations yU) play the role of e(t) in (6.1). Hence from (6.33), (6.35), H(q-l; 8)
and A(8) are given by
H(q-l; 8)
J + q8)[qJ - A(8)rlK(8)
+ R2(8)
(6.36)
Now consider the set gzJ, in (6.2). For this case (6.33), (6.34) give
yet)
-C(8)x(tlt - 1) + yet)
H-1(q-\ 8)
H-\q--l; 8)G(q-l; 8)
The poles of H--1(q-l; 8) and H- 1(q-l; 8)G(q-l; 8) are therefore given by the
eigenvalues of the matrix A(8) - K(8)C(8). Since the positive definite solution of the
Riccati equation (6.31) was selected, all these eigenvalues lie inside the unit circle (see
Appendix A6.1; and Anderson and Moore, 1979). Hence the restriction 8 E gzJ, (6.2),
does not introduce any further constraints here: this restriction is automatically handled
as a by-product.
Section 6.2
dx(t)
F(8)x(t)dt
Edw(t)dw(t?'
159
+ G(8)u(t)dt + dw(t)
(6.37a)
R(8)dt
x(t) = F(e)x(t)
EW(t)lVT(S)
G(8)u(t)
wet)
(6.37b)
R(8)o(t - s)
where OCt) is the Dirac delta-function. Equation (6.37b) can be complemented with
(6.38)
showing how certain linear combinations of the state vector are measured in discrete
time.
Note that model parametrizations based on physical insight can often be more easily
described by the continuous time model (6.37) than by the discrete time model (6.28).
The reason is that many basic physical laws which can be used in modeling are given in
the form of differential equations. As a simple illustration let Xl(t) be the position and
X2(t) the velocity of some moving object. Then trivially Xl(t) = X2(t), which must be one
of the equations of (6.37). Therefore physical laws will allow certain 'structural information' to be included in the continuous time model.
The solution of (6.37) is given by (see, for example, Kailath, 1980; Astrom, 1970)
x(t)
= eFCfl)(I-to)x(to) +
eF(fl)(t-s)G(8)u(s)ds
to
eFCfl)(t-s)w(s)ds
(6.39a)
to
where x(to) denotes the initial condition. The last term should be written more formally
as
eF(fl)(t-s)dw(s)
(6.39b)
to
Next, assume that the input signal is kept constant over the sampling intervals. Then
the sampling of (6.37), (6.38) will give a discrete time model of the form (6.28), (6.29)
with
A(O)
B(e) =
RlCO)
Rd8)
eF(fl)h
J:
J:
0
e F (fl)sG(8)ds
(6.40)
eF(fl)sR(O)eFT(fl)sds
Chapter 6
where h denotes the sampling interval and it is assumed that the process noise wet) (or
wet~ and the measurement noise e(s) are independent for all t and s.
To verify (6.40) set to = t and t = t + h in (6.39a). Then
The sequence {vet)} is uncorrelated (white noise). The covariance matrix of vet) is given
by
t+h Jt+h F
t e (8)(f+h-s')w(s') WT(s")e FT(8)(t+h-s")ds' ds"
Ev(t)vT(t) = E J t
= J tt+h Jt+h
t e F (8)(t+h--s') R(8)6(s' - s")e FT (8)(t+h-s")ds'ds"
= r+ h e F (8)(t+h-s) R(8)e FT (8)(t+h-s)ds
= J: e F (8)sR(8)e FT (8)sds
III
So far only linear models have been considered. Nonlinear dynamics can of course
appear in many ways. Sometimes the nonlinearity appears only as a nonlinear transformation of the signals involved. Such cases can be incorporated directly into the previous framework simply by redefining the signals. This is illustrated in the following
example.
Example 6.6 A Hammerstein model
Consider the scalar model
(6.41)
The relationship between u and y is clearly nonlinear. By defining a new artificial input
u(t) =
u(t)
u2 (t)
and setting
B(q-l) = (Bl(q-l)
B 2 (q-l)
A(q-l)y(t) = B(q-l)U(t)
e(t)
(6.42)
Uniqueness properties
Section 6.3
161
However, this is a standard linear model if uCt) is now regarded as the input signal.
One can then use all the powerful tools applicable to linear models to estimate the
II
parameters, evaluate the properties of the estimates, etc.
In general terms it is first necessary to choose a class of model structures and then make
an appropriate choice within this class. The general SISO model (6.14) is one example of
a class of model structures; the state space models (6.28) and the nonlinear difference
equations (6.41) correspond to other classes. For the class (6.14), the choice of model
structure consists of selecting the polynomial degrees na, nb, nc, nd and nf. For a
multivariable model such as the state space equation (6.28), both the structure parameter(s) (in the case of (6.28), the dimension of the state vector) and the parametrization
itself (i.e. the way in which e enters into the matrices A (e), B(e), etc.) should be chosen.
The choice of a class of model structures to a large extent should be made according to
the aim of the modeling (that is to say, the model set should be chosen that best fits the
final purpose). At this stage it is sufficient to note that there are indeed many other
factors that influence the selection of a model structure class. Four of the most important factors are:
.. Flexibility. It should be possible to use the model structure to describe most of the
different system dynamics that can be expected in the application. Both the number of
free parameters and the way they enter into the model are important .
.. Parsimony. The .model structure should be parsimonious. This means that the model
should contain the smallest number of free parameters required to represent the true
system adequately .
.. Algorithm complexity. Some identification methods such as the prediction error
method (PEM) (see Chapter 7) can be applied to a variety of model structures.
However, the form of structure selected can considerably influence the amount of
computation needed .
.. Properties of the criterion function. The asymptotic properties of PEM estimates
depend crucially on the criterion function. The existence of local (i.e. nonglobal)
minima as well as non unique global minima is very much dependent on the model
structure used.
Some detailed discussion of the factors above can be found in Chapters 7, 8 and II.
Specifically, parsimony is discussed in Complement CIL1; computational complexity in
Chapter 6
Sections 7.6, 8.3 and Complements C7.3, C7.4, C8.2 and CS.6; properties of the
criterion function in Complements C7.3, C7.S and C7.6; and the subject of flexibility is
touched on in several examples in Chapters 7 and 11. See also Ljung and Soderstrom
(1983) for a further discussion of these factors and their role in system identification.
General uniqueness considerations
There is one aspect related to parsimony that should be analyzed here. This concerns the
problem of adequately and uniquely describing a given system within a certain model
structure. To formalize such a problem one must of course introduce some assumptions
on the true system (i.e. the mechanism that produces the data u(l), y(l), u(2), y(2), ... ).
It should be stressed that such assumptions are needed for the analysis only. The application of identification techniques is not dependent on the validity of such assumptions.
Assume that the true system J is linear, discrete time, and that its disturbances have
rational spectral density. Then it can be described as
Eelt)e;(t')
= Ai>(,!'
DT(J, .At)
As
A(8)}
The set DT(J, .At) consists of those parameter vectors for which the model structure
.At gives a perfect description of the true system J. Three situations can occur:
The set DT(J,.At) may be empty. Then no perfect description of the system can be
obtained in .At, no matter how the parameter vector is chosen. One can say that the
model structure has too few parameters to describe the system adequately. This is
called underparametrization .
The set DT(J, .At) may consist of one point. This will then be denoted by 80 , This is
the ideal case; 80 is called the true parameter vector.
The set DT(J,.At) may consist of several points. Then there are several models within
the model set that give a perfect description of the system. This situation is sometimes
referred to as overparametrization. In such a case one can expect that numerical
problems may occur when the parameter estimates are sought. This is certainly the
case when the points of DT(J, .At) are not isolated. In many cases, such as those
illustrated in Example 6.7, the set DT(J, .At) will in fact be a connected subset (or
even a linear subspace).
ARMAX models
Example 6.7 Uniqueness properties for an ARMAX model
Consider a (scalar) ARMAX model, (6.3)-(6.4),
Section 6.3
Uniqueness properties
A(q-1)y(t) = B(q-1)U(t)
C(q-1)e(t)
Ee 2 (t)
= )..2
163
(6.45)
= B,(q-1)U(t) +
Cs(q-1)e,(t)
(6.46)
(6.47)
with
A/q-l)
B,(q-l)
Cs(q-l)
Assume that the polynomials As> Bs and Cs are coprime, i.e. that there is no common
factor to all three polynomials.
The identities in (6.44) defining the set DT(./' Jt) now become
BS(q-l) _ B(q-1)
As(q 1) = A(q 1)
Cs(q-l) _ C(q-l)
AsCq 1) = A(q 1)
(6.48)
(6.50)
One must now find the solution of (6.48) with respect to the coefficients of A, B, C and
).,2. Assume that (6.50) holds. Trivially, 1-.2 = A;. To continue, let us first discuss the more
simple case of a pure ARMA process. Then nb = nbs = 0 and the first identity in (6.48)
can be dispensed with. In the second identity note that As and Cs have no common factor
and further that both sides must have the same poles and zeros. These observations
imply
A(q-l)
As(q--l)D(q-l)
(6.51)
where
D(q-l) = 1
d 1q-1
+ ... +
dndq-nd
= deg
Thus if n* > 0 there are infinitely many solutions to (6.48) (obtained by varying d 1 ...
d n*). On the other hand, if n* = 0 then D(q-l) == 1 and (6.48) has a unique solution. The
condition n* = 0 means that at least one of the polynomials A(q-1) and C(q-I) has the
same degree as the corresponding polynomial of the true system.
Next examine the somewhat more complicated case of a general ARMAX structure.
Assume that As(q-l) and C,(q-l) have exactly nl (nl ;:: 0) common zeros. Then there
exist unique polynomials AO(q-l), CO(q--l) and L(q-l), such that
Chapter 6
As(q-I)
== AO(q-l)L(q-l)
Cs(q-l)
== Co(q-l)L(q-l)
(6.52)
== AO(q-l)M(q-l)
C(q-I)
== Co(q-I)M(q-l)
(6.53)
where
M(q-l)
deg Co)
(6.54)
(6.55)
Now cancel the factor AO(q-l). Since Bs(q-l) and L(q-I) are coprime (cf. (6.52)), it
follows from (6.55) that
== Bs(q-l)D(q-l)
B(q--l)
(6.56)
where
(6.57a)
has arbitrary coefficients. Its degree is given by
deg D
nd
n*
(6.57b)
Further (6.52), (6.53) and (6.56) together give the general solution to (6.48):
A(q-l)
As(q-l)D(q-l)
B(q--l)
Bs(q-l)D(q-l)
C(q-I)
Cs(q-l)D(q-l)
(6.58)
Uniqueness properties
Section 6.3
165
where D(q-l) must have all zeros outside the unit circle (d. (6.2 but is otherwise
arbitrary.
To summarize, for an ARMAX model:
.. If n* < 0
.. If n* = 0
.. If n* > 0
The following paragraphs present some results on the uniqueness properties of other
model structures.
(6.59a)
i = 1, ... , ny
nt
(6.59b)
i = 1, ... , ny
= 0
The full polynomial form (6.22) will often but not always give uniqueness. Exact
conditions for uniqueness are derived in Complement C6.1. The following example
demonstrates that nonuniqueness may easily occur for multivariable models (not
necessarily of the full polynomial form).
Example 6.8 Nonuniqueness of a multivariable model
Consider a deterministic multivariable system with ny = 2, nu
yet) -
1) =
(1)1 u(t -
= 1 given by
(6.60a)
1)
G(q
(q -a- a
-2 + a
q - 2 + a
)-1(1)1
q-l
(1)
= 1 - 2q 1 1
(6.60b)
166
Model parametrizations
Chapter 6
which is independent of a. Hence all models of the form (6.60a), for any value of a, will
give the same transfer function operator G(q-I). In particular, this means that the
system cannot be uniquely represented by a first-order full polynomial form model. Note
that this conclusion follows immediately from the result (C6.1.3) in Complement C6.1.
III
+ 1)
Ax(t)
+ Bu(t) + vet)
(6.61a)
where {v(t)} and {e(t)} are mutually independent white noise sequences with zero means
and covariance matrices RI and R 2 , respectively. Assume that all the matrix elements are
free to vary. Consider also a second model
= Ei(t) + e(t)
EV(t)VT(s) = RI O"s
y(t)
(6,61b)
Ev(t)eT(s) = 0
where
= QAQ-I
R=
Ii = QB
QRIQ T
and Q is an arbitrary nonsingular matrix, The models above are equivalent in the sense
that they have the same transfer function from u to y and their outputs have the same
second-order properties. To see this, first calculate the transfer function operator from
u(t) to yet) of the model (6,61b),
= C[Q-l{qJ - QAQ-l}Qr1B
= C[qJ - A]-IB
This transfer function is independent of the matrix Q. To analyze the influence of the
stochastic terms on yet), it is more convenient in this case to examine the spectral density
Section 6.4
Identifiability
167
<j>y(w) rather than to explicitly derive H(q-l) and A in (6.1). Once <j>y(w) is known,
H(q-l) and A are uniquely given, by the spectral factorization theorem. To evaluate
<j>y(w) for the model (6.61b), set u(t) == O. Then
R2
R2
R2
R2
Thus the spectral density (and hence H(q-l) and A) are also independent of Q. Since Q
can be chosen as an arbitrary nonsingular matrix, it follows that the model is not unique
in the sense that DT(J, vii) consists of an infinity of points. To get a unique model it is
necessary to impose restrictions of some form on the matrix elements. This is a topic that
belongs to the field of canonical forms. See the bibliographical notes at the end of this
chapter for some references.
6.4 Identifiability
The concept of identifiability can be introduced in a number of ways, but the following is
convenient for the present purposes.
When an identification method f is applied to a parametric model structure vii the
resulting estimate is denoted by SeN; J, vii, f, ff). Clearly the estimate will depend not
only on f and vii but also on the number of data points N, the true system J, and the
experimental condition ff.
The system J is said to be system identifiable under .jt, f and ff, abbreviated
SI (vii , f, ff), if
SeN;
J, vii, f, ff)
--7
DT(J, v-ft)
as N
00
(6.62)
(with probability one). For SI(vIi, f, &) it is in particular required that the set
DT(J, vii) (introduced in (6.44) is nonempty. If it contains more than one point then
(6.62) must be interpreted as
lim
inf
N->oo OEDT(j,Jt)
I SeN; J, vii,
f, &) -
ell = 0
(6.63)
The meaning of (6.63) is that the shortest distance between the estimate (3 and the set
DT(J, vii) of all parameter vectors describing G(q-l) and H(q-l) exactly, tends to zero
as the number of data points tends to infinity.
We say that the system J is parameter identifiable under jt, f and ff, abbreviated
PI (vii , f, ff), if it is SI(vIi, .:7, ff) and DT(J, <~) consists of exactly one point. This is
the ideal case. If the system is PI (vii , f, ff) then the parameter estimate S will be unique
for large values of N and also consistent (i.e. {) converges to the true value, as given by
the definition of DT(J, vii)).
Here the concept of identifiability has been separated into two parts. The convergence
of the parameter estimate S to the set DT(J, vii) (i.e. the system identifiability) is a
Chapter 6
property that basically depends on the identification method J. This is a most desirable
property and should hold for as general experimental conditions &l'as possible. It is then
'only' the model parametrization or model structure ..d that determines whether the
system is also parameter identifiable. It is of course desirable to choose the model
structure so that the set DT(./" ..d) has precisely one point. Some practical aspects on
this problem are given in Chapter 11.
Summary
In this chapter various aspects of the choice of model structure have been discussed.
Sections 6.1 and 6.2 described various ways of classifying model structures, and a general
form of model structure for linear multivariable systems was given. Some restrictions on
the model parameters were noted. Section 6.2 gave examples of the way in which some
well-known model structures can be seen as special cases of the general form (6.1).
Section 6.3 discussed the important problem of model structure uniqueness (i.e. the
property of a model set to represent uniquely an arbitrary given system). The case of
ARMAX models was analyzed in detail. Finally, identifiability concepts were introduced
in Section 6.4.
Problems
Problem 6.1 Stability boundary for a second-order system
Consider a second-order AR model
y(t)
+ aly(t - 1) + a2y(t - 2)
= e(t)
Derive and plot the area in the (ab a2)-plane for which the model is asymptotically
stable.
Problem 6.2 Spectral factorization
Consider a stationary stochastic process yet) with spectral density
1 5 - 4 cos w
<py(w) = 2n 8.2 - 8 cos w
Shew that this process can be represented as a first-order ARMA process and derive its
parameters.
Problem 6.3 Further comments on the non uniqueness of stochastic state space models
Consider the following stochastic state space model (cf. (6.61)):
x(t + 1) = Ax(t) + Bv(t)
0)
Problems
E
(~gD(VT(S) eT(s
169
ROt,s
It was shown in Example 6.9 that the representation (i) of the stochastic process y(t) is
not unique unless the matrices A, Band C are canonically parametrized.
There is also another type of non uniqueness of (i) as a second-order representation of
y(t), induced by an 'inappropriate' parametrization of R (the matrices A, Band C being
canonically parametrized or even given). To see this, consider (i) with
A = 112
R=
C = 1
B = 1
3
1
10 - -Q 2Q
4
1
2
-Q
(ii)
Q E (0, 10)
Note that the matrix R is positive definite as required. Determine the spectral density
function of yet) corresponding to (i), (ii). Since this function does not depend on Q,
conclude that yet) is not uniquely identifiable in the representation (i), (ii) from secondorder data.
Remark. This exercise is patterned after Anderson and Moore (1979), where more
details on representing stationary second-order processes by state space models may be
found.
n) = e(t)
for
A(z) = 1
alz
Izl : ;: ;
+ 1) =
Ax(t)
Be(t
+ 1)
(i)
yet) = Cx(t)
with the matrix A in the following companion form:
Chapter 6
As(q-I) = 1
Bs(q-I) =
Cs(q-I) = 1
+ clq-I +
A(q-I) = 1
+ alq-l +
+ anaq-na
B(q-I) = bIq-l
C(q-I) = 1
c~c,q-nc,
+ ... + bnbq-nb
CIq-l
+ ... + cncq-nc
(a) Derive sufficient conditions for Dr(J?, j{) to consist of exactly one point.
(b) Give an example where Dr(J', .(4) contains more than one point.
Remark. Note that in the case of overparametrization the set Dr consists of a finite
number of isolated points. In contrast, for overparameterized ARMAX models Dr is a
connected set (in fact a subspace) (see Example 6.7). A detailed study of the topic of this
problem is contained in Stoica and Soderstrom (1982e).
Problem 6.6 Uniqueness properties of a state space model
Consider the state space model
xCt + 1)
e=
a12
a22
(b)0 u(t)
b)T
Ev(t)v(s)
y = (1
O)x
ro(t - s)
Bibliographical notes
171
Yes)
K
K
= '--;U(s) + S2V(S).)
Assume that the gain K is unknown and is to be estimated from discrete time measurements
y(t) = (1 O)x(t)
t =
1, ... , N
Find the discrete time description (6.1) of the system, assuming the input is constant over
the sampling intervals.
Bibliographical notes
Various ways of parametrizing models are described in the survey paper by Hajdasinski
et ai. (1982). Ljung and Soderstrom (1983) give general comments on the choice of
parametrization.
The ARMAX model (6.3) has been used frequently in identification since the seminal
paper by Astrom and Bohlin (1965) appeared. The 'X' in 'ARMAX' refers to the
exogenous (control) variable u(t), according to the terminology used in econometrics.
The general SISO model (6.14) is discussed, e.g., by Ljung and Soderstrom (1983).
Diagonal right matrix fraction description models are sometimes of interest. Identification algorithms for such models were considered by Nehorai and Morf (1984).
For a discussion of state space models such as (6.28), (6.29) see, e.g., Ljung and
Soderstrom (1983). For the role of spectral factorization, the Riccati equation and the
Kalman filter in this context, see Anderson and Moore (1979) or Astrom (1970). The
latter reference also describes continuous time stochastic models of the form (6.37) and
sampling thereof. Efficient numerical algorithms for performing spectral factorization in
polynomial form have been given by Wilson (1969) and Kucera (1979).
Many special model structures for multivariable systems have been proposed in the
literature. They can be regarded as ways of parametrizing (6.1) or (6.28) viewed as blackbox models (i.e. without using a priori knowledge (physical insight. For some examples
of this kind see Rissanen (1974), Kailath (1980), Guidorzi (1975,1981), Hannan (1976),
Ljung and Rissanen (1976), van Overbeek and Ljung (1982), Gevers and Wertz (1984,
1987a, 1987b), Janssen (1987a, b), Correa and Glover (1984a, 1984b), Stoica (1983).
Gevers (1986), Hannan and Kavalieris (1984), and Deistler (1986) give rather detailed
descriptions for ARMA models, while Correa and Glover (1987) discuss specific parametrizations for instrumental variable estimation.
Specific compartmental models, which are frequently used in biomedical applications,
have been analyzed for example by Godfrey (1983), Walter (1982), Godfrey and
DiStefano (1985), and Walter (1987).
Leontaritis and Billings (1985) as well as Carrol and Ruppert (1984) deal with various
forms of nonlinear models.
Nguyen and Wood (1982) give a survey of various results on identifiability.
172
Model parametrizations
Chapter 6
Appendix A6.1
Spectral factorization
This appendix examines the so-called spectral factorization problem. The following
lemma is required.
Lemma A6.1
Consider the function
n
2:
fez) =
fk
fkzk
= f-k (real-valued)
(A6.1.1a)
k=-n
Assume that
f(e ilO ) ~ 0
all
(A6.1.1b)
ill
g(z) = zn
+ glzn-l + ... + gn
with real-valued coefficients and all zeros inside or on the unit circle, and a (realvalued) positive constant C, such that
fez) = Cg(Z)g(Z-l)
all z
(A6.1.2)
(iii) Iff(e iW ) > 0 for all ill, then the polynomial g(z) can be chosen with all zeros strictly
inside the unit circle.
feZ-I)
2:
!ki- k
k=-n
2:
f_k i - k
k=-n
2:
!ki k = f(i)
=0
k=-n
where h(z) has all zeros strictly inside the unit circle (and hence h(Z-l) has all zeros
strictly outside the unit circle) and k(z) all zeros on the unit circle. Hence
f( e iw )
From (A6.1.1b), k(e iw ) ~ 0 for all ill. Therefore all zeros of k(z) must have an even
multiplicity. (If i = ei'l' were a zero with odd mUltiplicity then k(e iW ) would shift sign at
ill = cp.) As a consequence, the zeros off(z) can be written as i1> i 2 , . . . , in, ill, ... ,
Z;;l, where 0 < li;1 ~ 1 for i = 1, ... , n. Set
n
g(z) =
TI (z
i=1
- i;)
Appendix A6.1
173
which gives
I1 Zj- ;=1
k=l
j=l
I1 i
j ;=1
j=l
n
in I1
(z - i;)z-n
;=1
I1
(z - i/;1) = fez)
k=1
which proves the relation (A6.1.2). To complete part (ii), it remains to prove that g; is
real-valued and that C is positive. First note that any complex-valued i; with liil ~ 1
appears in a complex-conjugated pair. Hence g(z) has real-valued coefficients. That the
constant C is positive follows easily from
which precludes that g(z) can have any zero on the unit circle.
The lemma can be applied to rational spectral densities. Such a spectral density is a
rational function (a ratio between two polynomials) of cos w (or equivalently of eim ). It
can hence be written as
m
2:
~keikw
k=-m
<j>( w) = . - - - n
2:
(A6.1.3)
uje ijW
j=-n
By varying the integers m and n and the coefficients {Uj}, Wd a large set of spectral
densities can be obtained. This was illustrated to some extent in Example 5.6. According to a theorem by Weierstrass (see Pearson, 1974), any continuous function can be
approximated arbitrarily closely by a polynomial if the polynomial degree is chosen large
enough. This gives a further justification of using (A6.1.3) for describing a spectral
density. By applying the lemma to the rational density (A6.1.3) twice (to the numerator
and the denominator) it is found that <j>( w) can be factorized as
(A6.1.4)
where '}, is a real-valued number, and A(z), C(z) are polynomials
1+
ajZ
Chapter 6
C(z) = 1 + c,Z +
with all zeros outside, or for C(z) possibly on the unit circle. The polynomial A (z) cannot
have zeros on the unit circle since that would make <p( (j) infinite. If <p( (j) > 0 for all (j)
then C(z) can be chosen with all zeros strictly outside the unit circle.
The consequence of (A6.1.4) is that as far as the second-order properties are
concerned (i.e. the spectral density), the signal can be described as an ARMA process
(A6.1.5)
Indeed the process (A6.1.5) has the spectral density given by (A6.1.4) (cf. (A3.1.1O. In
practice the signal whose spectral density is <p( (j)) may be caused by a number of interacting noise sources. However it is convenient to use the ARMA model (A6.1.5) to
describe its spectral density (or equivalently its covariance function). For many identification problems it is relevant to characterize the involved signals by their second-order
properties only.
Note that another C(z) with some zeros inside the unit circle could have been chosen.
However, the choice made above will turn out to be the most suitable one when deriving
II
optimal predictors (see Section 7.3).
The result (A6.1.4) was derived for a scalar signal. For the multivariable case the
following result holds. Let the rational spectral density <p( (j) be nonsingular (det <p( (j)
0) for all frequencies. Then there are a unique rational filter H(q--l) and a positive
definite matrix A such that
*'
<p( (j))
~
2n
H (q- J) and
H ( e - iUJ)AHT( e iw )
}f-l (q -1)
(A6.1.6a)
=I
H(O)
(A6.1.6b)
(A6.1.6c)
(see, e.g., Anderson and Moore, 1979). The consequence of this result is that the signal,
whose spectral density is <p( (j)), can be described by the model
yet)
H(q-l)e(t)
Ee(t)eT(s)
AO t s
(A6.1.7)
(A6.1.8)
where vet) and eel) are mutually independent white noise sequences with zero means and
covariance matrices Rl and R 2 , respectively. The matrix A is assumed to have all
eigenvalues inside the unit circle. Let P be a positive definite matrix that satisfies the
algebraic Riccati equation
Appendix A6.1
175
(A6.1.9)
Set
K == APCT(CPCT
+ R 2)-1
+ C[ql - Ar l K
CPC T + R2
H(q-l) = 1
A =
(A6. 1. 10)
This has almost achieved the factorization (A6.1.6). It is obvious that H(O) = / and that
H(q-l) is asymptotically stable. To show (A6.1.6a), note from (A3.1.1O) that
+ CPAT(e- iW 1 - AT)-lCT
+ C(e iw 1
=
- A)-l(APA T
+ RJ
P)(e- iw / - AT)-l CT
2:n::Ij>(w)
Hence (A6.1.6a) holds for H(q-l) and A given by (A6. 1. 10). It remains to examine the
stability properties of H-1(q-l). Using the matrix inversion lemma (Lemma A.l in
Appendix A), equation (A6.1.1O) gives
H-\q"-l)
= [I +
= 1 -
=1
C(q/ - A)-lKr"l
C[(q/ - A)
+ KC]--lK
(A6. 1. 11)
- C[ql - (A - KC)]-lK
Hence the stability properties of H- 1(q-l) are completely determined by the location of
the eigenvalues of A - KC. To examine the location of these eigenvalues one can study
the stability of the system
x(t
+ 1) = (A - KC)TX(t)
(A6. 1. 12)
Vex) = XTpX
Chapter 6
+ 1) - V(x(t
(A6.1.13)
+ K(CPCT + R2)KT]
=[-Rl
- 2K(CPCT
= -R 1
+ R 2)KT + KCPCTKT
KR2KT
which is nonpositive definite. Hence H-1(q-l) is at least stable, and possibly asymptotically stable, d. Remark 2 below.
Remark 1. The foregoing results have a very close connection to the (stationary)
optimal state predictor. This predictor is given by the Kalman filter (see Section B.7),
and has the structure
x(t
+ lit)
= AX(llt - 1)
y(t)= Cx(tlt - 1)
where
yet)
+ Ky(t)
(A6.1.14)
+ yet)
cov[y(t)] = CPcT + R2
From (A6.1.14) it follows that
yet)
as output. In
yet) - Cx(tlt - 1)
or
x(t + lit)
= (A - KC)x(tlt - 1) + Ky(t)
yet) = -ci(tlt - 1) + yet)
(A6.1.15)
from which the expression (A6.1.11) for H-1(q-l) easily follows. Thus by solving the
spectral factorization problem associated with (A6.1.8), the innovation form (or equivalently the stationary Kalman filter) ofthe system (A6.1.8) is obtained implicitly.
I11III
Remark 2. H-1(q-l) computed as (A6.1.11) will be asymptotically stable under weak
conditions. In case H-1(q-l) is only stable (having poles on the unit circle), then
A - KC has eigenvalues on the unit circle. This implies that det H(q-l) has a zero on
the unit circle. To see this, note that by Corollary 1 of Lemma A.5,
Appendix A6.1
det H(e- iw )
177
- A)-I]
Hence by (A6.1.6a) the spectral density matrix <1>( w) is singular at some w. Conversely,
if <1>( w) is singular for some frequencies, the matrix A - KC will have some eigenvalues
on the unit circle and H-\q-I) will not be asymptotically stable.
III
Remark 3. The Riccati equation (A6.1.9) has been treated extensively in the literature.
See for example Kucera (1972), Anderson and Moore (1979), Goodwin and Sin (1984)
and de Sousa et ai. (1986). The equation has at most one symmetric positive definite
solution. In some 'degenerate' cases there may be no positive definite solution, but a
positive semidefinite solution exists and is the one of interest (see Example A6.1.2 for an
illustration). Some general sufficient conditions for existence of a positive definite
solution are as follows. Factorize Rl as RI = BBT. Then it is sufficient to require:
(i) R2 > 0;
(ii) (A, B) controllable (i.e. rank (B AB ... An-IB) = n = dim A);
(iii) (C, A) observable (i.e. rank (C T ATCT ... (ATt-1CT ) = n = dim A).
Some weaker but more technical necessary and sufficient conditions are given by de
III
Sousa et ai. (1986).
Remark 4. If vet) and eet) are correlated so that
Ev(t)eT (s) = R 120t ,s
the results remain valid if the Riccati equation (A6.1.9) and the gain vector K (A6. 1. 10)
are changed to, see Anderson and Moore (1979),
P
= APA T +
Rl - (APC T
R 12 )(CPCT
R 2)-\CPAT
RT2)
(A6.1.16)
lal <
1, c
"* 0
(A6.1.17a)
0t,s
(A6.1.17b)
(cf. A3.1.10). If lei ~ 1 the representation (A6.1.17a) is the one described in (A6.1.6).
For completeness consider also the case lei> 1. The numerator of the spectral density is
equal to fez) = cz + (1 + ~) + ez- 1 = (1 + ez)(1 + ez-- 1). Its zeros are i1 = -lie and
Chapter 6
lei:::::;
Case 0):
H( -1) = 1 + cq -1
q
1 + aq 1
A = 1
(A6.1.17c)
lei
Case (ii):
H( -1) = 1 + (lIC)q-l
q
1 + aq 1
~ 1
A = c2
II
1) =
(~a
b) x(t) + G) e(t + 1)
(A6.1.18a)
yet) = (1 O)x(t)
Here
Since the second row of A is zero it follows that any solution of the algebraic Riccati
equation must have the structure
p =
(1 + a c)
c
(A6.1.18b)
c2
(A6.1.18c)
which is rewritten as
(a + l)[a(l - a2) - (c - a)2] + [-aa + (c - a)f = 0
a 2 + a[ -(c -- af + (1 - a2) ._- 2a(c -- a)] = 0
a(a
1 - c2 )
This equation has two solutions, al = 0 and a2 = c2 - 1. We will discuss the choice of
solution shortly. First note that the gain K is given by
K =
(1)o c -1 a+ -a aa
(A6.1. 18d)
Appendix A6.1
H(q-l)
1 + (1
O)(q; a -1)-1
q
1+C:a
q - 1)
K---'----,,--'-
+ aq
A=1+a
179
(A6.1.18e)
(A6.1. 18f)
The remaining discussion considers separately the two cases introduced in Example
A6.1.1.
fel : :; :;
Icl
1, for which a1
a2)' Hence
(A6.1. 18g)
A=1
Case (ii). Assume
fel
~ 1. Then
with P2 positive definite, whereas PI is only positive semidefinite. There are two ways to
conclude that az is the required solution in this case. First, it is P 2 that is the positive
definite solution to (A6.1.9). Note also that P 2 is the largest nonnegative definite
solution in the sense P2 - PI ~ O. Second, the filter H--I(q-I) will be stable for a = a2
but unstable for a = al' Inserting a2 = c2 - 1 into (A6.1.18e, f) gives
H(
-1-
q)
1 + (lIc)q-l
1 + aq-l
(A6.1.18h)
As expected, the results (A6.1.18g, h) coincide with the previously derived (A6.1.17c).
= -ax(t) + (c
yet) = x(t) + e(t)
x(t + 1)
- a)e(t)
(A6.1.19a)
In this case
R12
(c - a)
p = a2p
Chapter 6
(c _ a)2 _ (-ap
+ c -- a)2
+1
(A6.1.19b)
Note that this equation is equivalent to (A6.1.18c). The solutions are hence Pl = 0
(applicable for lei ~ 1) and P2 = c
1 (applicable for lei > 1). The gain K becomes,
according to (A6.1.16),
2-
-ap
K =
+ c - --a
+1
Then
H( -1) _ 1 + (K + a)q-l _
q
+ aq
1 + _c_ -1
p + 1q
+ aq
A=p+1
(A6.1.19c)
x(t + 1)
= x(t) + vet)
Ev(t)v(s)
= A~(\,s
(A6.1.20a)
Assume that x(t) is measured with some observation noise which is independent of vet)
y(t)
= x(t) + e(t)
(A6.1.20b)
Since the system (A6.1.20a) has a pole on the unit circle, y(t) is not a stationary process.
However, after differentiating it becomes stationary. Introduce
z(t)
(1 - q-l)y(t) = vet - 1)
+ e(t) - e(t - 1)
(A6.1.20c)
The right-hand side of (A6.1.20c) is certainly stationary and has a covariance function
that vanishes for lags larger than 1. Hence z(t) can be described by an equivalent MA(l)
model:
(A6.1.20d)
Comparison of the covariance functions of (A6.1.20c) and (A6.1.20d) gives
(1 +
C2
)A; = A~ + 2A~
(A6.1.20e)
A;
r ~ ( ~2)1I2J
A~ L1 + 2 + ~ + 4
lei <
(A6.1.20f)
Appendix A6.1
181
yet) = H(q-l)c(t)
where H(q-l) = (1 + cq-l)/(1 - q-l) has an asymptotically stable inverse.
II1II
Wilson (1969) and Kucera (1979) have given efficient algorithms for performing spectral
factorization. To describe such an algorithm it is sufficient to consider (A6.1.1a),
(A6.1.2). Rescale the problem by setting C = 1 and let go be free. Then the problem is to
solve
fez) = g(Z)g(Z-l)
(A6.1.21a)
where
for the unknowns {gi}' Note that to factorize an ARMA spectrum, two problems of the
above type must be solved. The solution is required for which g(z) has all zeros inside or
on the unit circle. Identifying the coefficients in (A6.1.21a),
n-k
fk =
k = 0, ... , n
gigi+k
(A6.1.21c)
i=O
This is a nonlinear system of equations with {gi} as unknowns, which can be rewritten in
a more compact form as follows:
go gl ... gn
go
( fe
gl
1 ~ )n
gl
go
gn-l
gn
=
go
... gn-}
T(g)g
gn
gn
go
gl
(A6.1.22)
~
H(g)g
gn
1
2[T(g) + H(g)]g
Observe that the matrices T and H introduced in (A6.1.22) are Toeplitz and Hankel,
respectively. (T is Toeplitz because Tij depends only on i - j. H is Hankel because Hij
depends only on i + j.) The nonlinear equation (A6.1.22) can be solved using the
Newton-Raphson algorithm. The derivative of the right-hand side of (A6.1.22), with
respect to g, can readily be seen to be [T(g) + H(g)]. Thus the basic iteration of the
Newton-Raphson algorithm for solving (A6.1.22) is as follows:
182
Model parametrizations
gk+!
gk + [T(gk) + H(gk)]-l{J _
Chapter 6
~[T(gk) + H(gk)]gk}
(A6.1.23)
!gk + [T(gk) + H(gk)]-lJ
2
where the superscript k denotes the iteration number.
Merchant and Parks (1982) give some fast algorithms for the inversion of the Toeplitzplus-Hankel matrix in (A6.1.23), whose use may be indicated whenever n is large.
Kucera (1979) describes a version of the algorithm (A6.1.23) in which explicit matrix
inversion is not needed. The Newton-Raphson recursion (A6.1.23) was first derived in
Wilson (1969), where it was also shown that it has the following properties:
If gk(z) has all zeros inside the unit circle, then so does gk+\z) .
The recursion is globally convergent to the solution of (A6.1.21a) with g(z) having all
zeros inside the unit circle, provided gO(z) satisfies this stability condition (which can
be easily realized). Furthermore, close to the solution, the rate of convergence is
quadratic.
=
Complement C6.1
Uniqueness of the full polynomial form model
Let O(z) denote a transfer function matrix. The fun polynomial form model (6.22),
(6.23) corresponds to the following parametrization of O(z):
O(z)
= A -1(z)8(z)
A(z)
1+ A1z
+ ... + Anazna
(nylnu)
(nylny)
(C6.1.1)
(nylnu)
where all the elements of the matrix coefficients {Ai, B j } are assumed to be unknown.
The parametrization (C6.1.1) is the so-called full matrix fraction description (FMFD),
studied by Hannan (1969, 1970, 1976), Stoica (1983), S()derstrom and Stoica (1983).
Here we analyze the uniqueness properties of the FMFD. That is to say, we delineate the
transfer function matrices O(z) which, for properly chosen (na, nb), can be uniquely
represented by a FMFD.
Lemma C6.1.1
Let the strictly proper transfer function matrix O(z) be represented in the form (C6.1.1)
(for any such G(z) there exists an infinity of representations of the form (C6.1.1) having
various degrees (na, nb); see Kashyap and Rao (1976), Kailath (1980)). Then (C6.1.1) is
unique in the class of matrix fraction descriptions (MFD) of degrees (na, nb) if and only
if
A(z), B(z) are left coprime
(C6.1.2)
rank (Ana
(C6.1.3)
B nb ) = ny
Complement C6.2
183
Proof. Assume that the conditions (C6.1.2), (C6.1.3) hold. Let A-l(Z)B(z) be
another MFD of degrees (na, nb) of G(z). Then
A(z) = L(z)A(z)
B(z) = L(z)B(z)
(C6.1.4)
But this implies Ll = 0 in view of (C6.1.3). Thus the sufficiency of (C6.1.2), (C6.1.3) is
proved.
The necessity of (C6.1.2) is obvious. For if A(z) and B(z) are not left coprime then
there exists a polynomial K(z) of degree nk ;?: 1 and polynomials A(z), B(z) such that
A(z) = K(z)A(z)
(C6.1.6)
B(z) = K(z)B(z)
Replacing K(z) in (C6.1.6) by any other polynomial of degree not greater than nk leads
to another MFD (na, nb) of G(z). Concerning the necessity of (C6.1.3), set L(z) =
I + LIZ in (C6.1.4), where L1 =1= 0 satisfies (C6.l.S). Then (C6.1.4) is another MFD
(na, nb) of G(z), in addition to {A (z), B(z)}.
There are G(z) matrices which do not satisfy the conditions (C6.1.2), (C6.1.3) (see
Example 6.8). However, these conditions are satisfied by almost all strictly proper G(z)
matrices. This is so since the condition rank (Ana B nb ) < ny imposes some nontrivial
restrictions on the matrices Ana, Bnb, which may hold for some, but only for a 'few'
systems. Note, however, that if the matrix (Ana B nb ) is almost rank-deficient, then use
of the FMFD for identification purposes may lead to ill-conditioned numerical problems.
Complement C6.2
Uniqueness of the parametrization and the positive definiteness of the
input-output covariance matrix
This complement extends the result of Complement CS.l concerning noise-free output,
to the multivariable case. For the sake of clarity we study the full polynomial form model
only. More general parametrizations of the coefficient matrices in (6.22), (6.23) can be
analyzed in the same way (see Soderstrom and Stoica, 1983). Consider equation (6.22) in
the noise-free case (e(t) == 0). It can be rewritten as
yet) = <pT(t)8
(C6.2.1)
where <p(t) and e are defined in (6.25). As in the scalar case, the existence of the least
squares estimate of 8 in (C6.2.1) will be asymptotically equivalent to the positive
definiteness of the covariance matrix
Chapter 6
R = E<(t)<T(t)
It is shown below that the condition R > 0 is intimately related to the uniqueness of the
Proof. Let 8 be a vector of the same dimension as 8, and consider the equation
(C6.2.2)
with 8 as unknown. The following equivalences can be readily verified (A(z) and B(z)
being the polynomials constructed from 0 exactly in the way A(z) and B(z) are obtained
from 8):
(C6.2.2) ~ <T(t)8 = 0 all t ~ yet)
<T(t)(8 - 8) all t
{A -l(q-l)B(q-l) - (A(q-l)
+ 1- A(q-l)]-l[B(q-l) - B(q-l)]}U(t)
== 0 all t
(C6.2.3)
(C6.2.4)
Hence the matrix R is positive definite if and only if the only solution of (C6.2.4) is
A(z) - 1== 0, B(z) == 0 (i.e. 8 = 0).
II
Chapter 7
PREDICTION ERROR
METHODS
(7.1a)
where
A(q-I)
B(q-l)
(7.1b)
In (7.1a) the term E(t) denotes the equation error. As noted in Chapter 6 the model (7.1)
can be equivalently expressed as
y(t)
= cpT(t)8 + E(t)
(7.2a)
where
cpT(t)
(-y(t - 1)
(3 =
-yet - na)
(7.2b)
hi'"
bl/b)T
The model (7.2) has exactly the same form as considered in Chapter 4. Hence, it is
already known that the parameter vector which minimizes the sum of squared equation
errors,
(7.3)
is given by
(7.4)
185
186
Chapter 7
This identification method is known as the least squares (LS) method. The name
'equation error method' also appears in the literature. The reason is, of course, that E(t),
whose sample variance is minimized, appears as an equation error in the model (7.1).
Note that all the discussions about algorithms for computing (see Section 4.5) will
remain valid. The results derived there depend only on the 'algebraic structure' of the
estimate (7.4). For the statistical properties, though, it is of crucial importance whether
<pet) is an a priori given quantity (as for static models considered in Chapter 4), or
whether it is a realization of a stochastic process (as for the model (7.1. The reason why
this difference is important is that for the dynamic models, when taking expectations of
various quantities as in Section 4.2, it is no longer possible to treat <P as a constant matrix.
Analysis
Consider the least squares estimate (7.4) applied to the model (7.1), (7.2). Assume that
the data obey
(7.5a)
or equivalently
(7.5b)
Here 60 is called the true parameter vector. Assume that vet) is a stationary stochastic
process that is independent of the input signal.
If the estimate &in (7.4) is 'good', it should be close to the true parameter vector 60 ,
To examine if this is the case, an expression is derived for the estimation error
&-
60 =
-l
[~ ~ <p(t)<pT(t) J ~ ~ <p(t)y(t)
-
{~ ~ <p(t)<pT(t) } 8 J
(7.6)
N ~ <p(t)<p TU)
IN
J-l[IV
1
~ <p(t)v(t)
N
Under weak conditions (see Lemma B.2) the sums in (7.6) tend to the corresponding
expected values as the number of data points, N, tends to infinity. Hence is consistent
(that is, tends to 80 as N tends to infinity) if
E<p(t)<pT(t) is nonsingular
(7.7a)
=0
(7.7b)
E<p(t)v(t)
Section 7.1
187
Explanations as to why (7.7a) does not hold under these special circumstances are given
in Complements C6.1, C6.2 and in Chapter 10.
Unlike (7.7a), condition (7.7b) is in most cases not satisfied. An important exception is
when vet) is white noise (a sequence of uncorrelated random variables). In such a case
vet) will be uncorrelated with aU past data and in particular with <pet). However, when
v(t) is not white, it will normally be correlated with past outputs, since yet) depends
(through (7.5a)) on v(s) for s ~ t. Hence (7.7b) will not hold.
Recall that in Chapter 2 it was examplified in several cases that the consistent
estimation of 8 requires vet) to be white noise. Examples were also given of some of the
exceptions for which condition (7.7a) does not hold.
Modifications
The least squares method is certainly simple to use. As shown above, it gives consistent
parameter estimates only under rather restrictive conditions. In some cases the lack of
consistency may be tolerable. If the signal-to-noise ratio is large, the bias will be small. If
a regulator design is to be based on the identified model, some bias can in general be
acceptable. This is because a reasonable regulator should make the closed loop system
insensitive to parameter variations in the open loop part.
In other situations, however, it can be of considerable importance to have consistent
parameter estimates. In this and the following chapter, two different ways are given
of modifying the LS method so that consistent estimates can be obtained under less
restrictive conditions. The modifications are:
.. Minimization of the prediction error for other 'more detailed' model structures. This
idea leads to the class of prediction error methods to be dealt with in this chapter.
Modification of the normal equations associated with the least squares estimate. This
idea leads to the class of instrumental variable methods, which are described in
Chapter 8.
It is appropriate here to comment on the prediction error approach and why the LS
method is a special case of this approach. Neglecting the equation error E(t) in the model
(7.1a), one can predict the output at time t as
yet)
+ b1u(t - 1)
(7.8a)
Cf,T(t) 0
Hence
E(t)
= yet) - yet)
(7.8b)
188
Chapter 7
= yet) - y(tlt - 1; 8)
(7.9)
is small. In (7.9), y(tlt - 1; 8) denotes a prediction of yet) given the data up to and
including time t - 1 (i.e. yet - 1), u(t - 1), yet - 2), u(t - 2), ... ) and based on the
model parameter vector 8.
To formalize this idea, consider the general model structure introduced in (6.1):
yet)
Ee(t)eT(s)
A(8)ol.s
(7.10)
Assume that C(O; 8) = 0, i.e. that the model has at least one pure delay from input to
output. As a general linear predictor, consider
(7.lla)
which is a function of past data only if the predictor filters L1 (q-l; 8) and L 2 (q--l; 8) are
constrained by
(7.llb)
The predictor (7.11) can be constructed in various ways for any given model (7.10). Once
the model and the predictor are given, the prediction errors are computed as in (7.9). The
parameter estimate is then chosen to make the prediction errors (1, 8), ... , (N, 8)
small.
To define a prediction error method the user has to make the following choices:
Choice of model structure. This concerns the parametrization of C(q-I; 8), H(q-I; 8)
and A(8) in (7.10) as functions of 8.
e Choice of predictor. This concerns the filters L 1(q-l; 8) and L 2 (q-J; 8) in (7.11), once
the model is specified .
.. Choice of criterion. This concerns a scalar-valued function of all the prediction errors
(1, 8), ... , (N, 8), which will assess the performance of the predictor used; this
criterion is to be minimized with respect to 8 to choose the 'best' predictor in the class
considered.
e
Section 7.2
189
The choice of model structure was discussed in Chapter 6. In Chapter 11, which deals
with model validation, it is shown how an appropriate model parametrization can be
determined once a parameter estimation has been performed.
The predictor filters L 1(q-l; 8) and L 2(q-l; 8) can in principle be chosen in many
ways. The most common way is to let (7.11a) be the optimal mean square predictor. This
means that the filters are chosen so that under the given model assumptions the prediction errors have as small a variance as possible. In Section 7.3 the optimal predictors are
derived for some general model structures. The use of optimal predictors is often
assumed without being explicitly stated in the prediction error method. Under certain
regularity conditions such optimal predictors will give optimal properties of the parameter estimates so obtained, as will be shown in this chapter. Note, though, that the
predictor can also be defined in an ad hoc nonprobabilistic sense. Problem 7.3 gives
an illustration. When the predictor is defined by deterministic considerations, it is
reasonable to let the weighting sequences associated with L 1(q-\ 8) and L 2(q-l; 8) have
. fast decaying coefficients to make the influence of erroneous initial conditions insignificant. The filters should also be chosen so that imperfections in the measured data are
well damped.
The criterion which maps the sequence of prediction errors into a scalar can be chosen
in many ways. Here the following class of criteria is adopted. Define the sample covariance matrix
(7.12)
where N denotes the number of data points. If the system has one output only (ny = 1)
then E(t, 8) is a scalar and so is RN(8). In such a case RN(O) can be taken as a criterion to
be minimized. In the multivariable case, RN(O) is a positive definite matrix. Then the
criterion
(7.13)
is chosen, where h(Q) is a scalar-valued function defined on the set of positive definite
matrices Q, which must satisfy certain conditions. VN(O) is frequently called a loss
function. Note that the number of data points, N, is used as a subscript for convenience
only. The requirement on the function h(Q) is that it must be monotonically increasing.
More specifically, let Q be positive definite and L\Q nonnegative definite. Then it is
required that
h(Q + L\Q)
h(Q)
(7.14)
Chapter 7
tr SQ
(7.15a)
+ ilQ) - h1(Q)
= GGT
for some
Thus the condition (7.14) is satisfied. Note also that if ilQ is nonzero then the inequality
will be strict.
Another possibility for h(Q) is
[
h 2 (Q) = det Q
Let Q
h 2 (Q
(7.15b)
ilQ) - h 2 (Q)
det[Q(I
Q-1ilQ)] - det Q
= det Q[det(I
+ G-TG- 1ilQ) - 1]
+ G-1ilQC- T )
det Q[det(I
1]
+ AB)
det(I
+ BA)
(see the Corollary 1 to Lemma A.S). Let G-1ilQG- T have eigenvalues At, ... , Any.
Since this matrix is symmetric and nonnegative definite, Ai :::::: O. As the determinant of
a matrix is equal to the product of its eigenvalues, it follows that
h 2 (Q
UJ
(1
+ Ai) - I
= 0 for all i.
J:: 0
O.
III
Remark. It should be noted that the criterion can be chosen in other ways. A more
general form of the loss function is, for example,
1
VN(B) =
N 2:
l(t, 8, E(l, 8
(7.16)
1=1
where the scalar-valued function let, 0, E) must satisfy some regularity conditions. It is
also possible to apply the prediction error approach to nonlinear models. The only
requirement is, naturally enough, that the models provide a way of computing the
Section 7.2
191
prediction errors E(t, 8) from the data. In this chapter the treatment is limited to linear
models and the criterion (7.13).
III
Some further comments may be made on the criteria of Example 7.1:
.. The choices (7 .15a) and (7 .ISb) require practically the same amount of computation
when applied off-line, since the main computational burden is to find RN(8). However,
the choice (7 .15a) is more convenient when deriving recursive (on-line) algorithms as
will be shown in Chapter 9.
e The choice (7 .ISb) gives optimal accuracy of the parameter estimates under weak
conditions. The criterion (7.15a) will do so only if S = A -1. (See Section 7.5 for proofs
of these assertions.) Since A is very seldom known in practice, this choice is not useful
in practical situations .
., The choice (7 .15b) will be shown later to be optimal for Gaussian distributed
disturbances. The criterion (7.16) can be made robust to outliers (abnormal data) by
appropriate choice of the function 1(, " .). (To give a robust parameter estimate,
1(', " .) should increase at a rate that is less than quadratic with E.)
The following example illustrates the choices of model structure, predictor and criterion in a simple case.
Example 7.2 The least squares method as a prediction error method
Consider the least squares method described in Section 7.1. The model structure is then
given by (7.1a). The predictor is given by (7.8a):
yet) =
[1 - A(q-l)]y(t) + B(q-l)U(t)
Note that the condition (7 .11b) is satisfied. Finally, the criterion is given by (7.3).
III
(7.17)
To evaluate the loss function at any value of 8, the prediction errors {E(t, 8)}~1 are
determined from (7.9), (7.11). Then the sample covariance matrix RN(8) is evaluated
according to (7.12).
Figure 7.1 provides an illustration of the prediction error method.
192
Chapter 7
u(t)
yet)
Process
2:
E(t, 0)
Predictor
with adjustable
parameters 0
Y(t, 0)
Algorithm for
minimizing some
function of E(I, 0)
yet) + ay(t - 1)
bu(t - 1)
e(t)
ce(t - 1)
e=
(a
(7.18)
C)T
Assume that u(t) and e(s) are independent for t < s. Hence the model allows feedback
from y(.) to u(). The output at time t satisfies
yet)
= [-ay(t -
1)
bu(t - 1)
ce(t - 1)]
[e(t)]
(7.19)
The two terms on the right-hand side of (7.19) are independent, since e(t) is white noise.
If y*(t) is an arbitrary prediction of yet) (based on data up to time t - 1) it therefore
follows that
E[y(t) - y*(tW
E[-ay(t - 1) + bu(t -- 1)
ce(t - 1) - y*(t)f
'}..2,
+ ,,? ;;:
(7.20)
'}..2
Section 7.3
Optimal prediction
193
predictor y(tlt - 1; 8) is one which minimizes this variance. Equality in (7.20) is achieved
for
y(tlt -
1; 8)
(7.21)
The problem with (7.21) is, of course, that it cannot be used as it stands since the term
e(t - 1) is not measurable. However, e(t - 1) can be reconstructed from the measurable
data yet - 1), u(t - 1), yet - 2), u(t - 2), ... , as shown below. Substituting (7.18)
in (7.21),
y(tlt -
1; 8)
= -ay(t -
1)
- ee(t - 2)]
=
(e - a)y(t - 1)
(7.22)
1-1
2: (e -
i= 1
t
+b
2: (--ey-1u(t -
i) - (-eye(O)
i=!
Under the assumption that lei < 1, which by definition is true for 8 E q; (see Example 6.1),
the last term in (7.22) can be neglected for large t. It will only have a decaying transient
effect. In order to get a realizable predictor this term will be neglected. Then
I-I
y(tlt -
1; 8)
2: (e -
a)(--e)i-Iy(t - i) - a(-ey-ly(O)
i=J
(7.23)
+b
2: (-e)i-Iu(t '- i)
i=1
The expression (7.23) is, however, not well suited to practical implementation. To derive
a more convenient expression, note that (7.23) implies
(7.24)
which gives a simple recursion for computation of the optimal prediction. The prediction
error E(t, 8), (7.9), will obey a similar recursion. From (7.9) and (7.24) it follows that
(7.25)
The recursion (7.25) needs an initial value 10(0, 8). This value is, however, unknown. To
overcome this difficulty, 10(0,8) is in most cases taken as zero. Since lei < 1 by
Chapter 7
assumption, the effect of 10(0, 8) will only be a decaying transient: it will not give any
significant contribution to E(t, 8) when t is sufficiently large. Note that setting 10(0, 8) =
in (7.25) is equivalent to using y(OI-I; 8) = yeO) to initialize the predictor recursion
(7.24). Further discussions on the choice of the initial value 10(0, 8) are given in Complement C7.7 and Section 12.6.
Note the similarity between the model (7.18) and the equation (7.25) for the
prediction error E(t, 8): e(t) is simply replaced by ECt, 8). This is how to proceed in
practice when calculating the prediction error. The foregoing analysis has derived the
expression for the prediction error. The calculations above can be performed in a more
compact way using the polynomial formulation. The output of the model (7.18) can then
be written as
bq-1
cq-1
= [. bq-l_ u(t)
1 + aq
=[
-1
- bq-1U(t)}]
( ) -1
[(1 +
aq
+ c - a q _ _ _ {(l +
1 + aq -1 1 + cq -1
q
u(t)
1 + aq -1
a -1) (t)
q
Y
[e(t)]
bl~~11 + cq
b -1
q
1 + cq
u(t)
1
) -1
J
+(
c - a q yet) +
1 + cq 1
[e(t)]
Thus
y(tlt - 1; 8) =
-1
1 + cq
u(t)
1
( ) -1
+ c - a q yet)
1 + cq 1
(7.26)
which is just another form of (7.24). When working with filters in this way it is assumed
that data are available from the infinite past. This means that there is no transient effect
due to an initial value. Since data in practice are available only from time t = 1
onwards, the form (7.26) of the predictor implicitly introduces a transient effect. In
(7.22) this transient effect appeared as the last term. Note that even if the polynomial
formalism gives a quick and elegant derivation of the optimal predictor, for practical
implementations a difference equation form like (7.24) must be used. Of course, the
result (7.26) can easily be reformulated as a difference equation, thus leading to a
II
convenient form for implementation.
Next consider the general linear model
Optimal prediction
Section 7.3
195
H(q-\ 8)e(t)
(7.27) I
Ee(t)eT(s) = A(8)o(,s
which was introduced in (6.1); see also (7.10). Assume that C(O; e) = 0, H(O; 8) = I and
that H- I(q-I; e) and H- I(q-l; 8)C(q-l; 8) are asymptotically stable (i.e. e E q;,
(6.2. Assume also that u(t) and e(s) are uncorrelated for t < s. This condition holds if
either the system operates in open loop with disturbances uncorrelated with the input,
or the input is determined by causal feedback.
The optimal predictor is easily found from the following calculations:
yet)
= C(q-I;
= [C(q-\
8)u(t)
- C(q-I; 8)u(t)}]
e(t)
I}H-1(q--l; 8){y(t)
(7.28a)
e(t)
I}e(t)
eel)
Note that z(t) and e(t) are uncorrelated. Let y*(t) be an arbitrary predictor of yet) based
on data up to time t - 1. Then the following inequality holds for the prediction error
covariance matrix:
E[y(t) - y*(t)][y(t) - y*(t)]T
= E[z(t)
eel) - y*(t)][z(t)
e(t) - y*(t)JT
+ A(e)
(7.28b)
~ A(e)
Hence z(t) is the optimal mean square predictor, and e(t) the prediction error. This can
be written as
y(tlt -
Note that the assumption C(O; 8) = 0 means that the predictor y(tlt - 1; 8) depends only
on previous inputs (i.e. u(t - 1), u(t - 2), ... ) and not on u(t). Similarly, since
H(O; 8) = I and hence H-I(O, 8) = I, the predictor does not depend on yet) but only on
former output values yet - 1), yet -- 2) ...
Further, note that use of the set q;, introduced in (6.2), means that 8 is restricted to
those values for which the predictor (7.29) is asymptotically stable. This was in fact the
reason for introducing the set q; .
For the particular case treated in Example 7.3,
b -1
C(q-I. e) =
,q
,
1 + aq-l
H(
-1.
q,
8)
= 1 + cq-l
1 + aq-l
196
Y'(tlt - 1; 8)
1++
1
aq
cq
-I
Chapter 7
--1
+ aq
u(t)
[1
+
1
+
-
a-I]
q
(t)
cq 1 Y
bq-l
(c - a)q-l
1 + cq-I u(t) + 1 + cq-l yet)
x(t + 1)
yet)
C(8)x(t) + e(t)
(7.30)
where vet) and e(t) are mutually uncorrelated white noise sequences with zero means and
covariance matrices R I (8) and R 2 (8), respectively. The optimal one-step predictor of yet)
is given by the Kalman filter (see Section B.7 in Appendix B; ct. also (6.33,
i(t + lit)
y(tlt -- 1)
C(8)i(tlt - 1)
(7.31)
K(8)
A(8)P(8)CT(8)[C(8)P(8)CT(8) + R 2 (8)]-1
(7.32a)
and where P(8) is the solution of the following algebraic Riccati equation:
P(8)
(7 .32b)
This predictor is mean square optimal if the disturbances are Gaussian distributed. For
other distributions it is the optimal linear predictor. (This is also true for (7.29).)
Remark. As noted in Appendix A6.1 for the state space model (7.30), there are strong
connections between spectral factorization and optimal prediction. In particular, the
factorization of the spectral density matrix of the disturbance term of yet), makes it
possible to write the model (7.30) (which has two noise sources) in the form of (7.27), for
III
which the optimal predictor is easily derived.
As an illustration of the equivalence between the above two methods for finding the
optimal predictor, the next example reconsiders the ARMAX model of Example 7.3 but
uses the results for the state space model (7.30) to derive the predictor.
Example 7.4 Prediction for a first-order ARMAX model, continued
Consider the model (7.18)
yet)
lei
ay(t - 1)
< 1
bu(t - 1)
+ e(t) + ce(t - 1)
(7.33)
Optimal prediction
Section 7.3
x(t + 1)
(~a ~) x(t)
yet)
(1
(~) u(t)
G)
e(t + 1)
197
(7.34)
O)x(t)
(ef. (A6.1.18a. As in Example A6.1.2 the solution of the Rieeati equation has the form
'2(I+a c)
P=A
c2
This gives a
K
(c - a - aa)2
+ a2a - --'--------'1+ a
(c - a)2
=
0 (since
lei
(~) c~ : -a aa
(c ~ a)
(7.35)
i(t + lit)
y(tlt - 1)
(~a ~) i(tlt -
1)
(c ~ a) {yet) -
(1
(1
O)i(tlt - 1)
(~) u(t)
O)i(tlt - I)}
(7.36)
i(t + lit)
(~c ~)i(tlt -
y(tlt - 1)
(1
1)
(~)U(t)
(c ~ a)y(t)
O)x(tlt -- 1)
y(tlt - 1)
(1
= _1_[bu(t) +
q + c
bq-l
1 + cq
(c - a)y(t)]
(c - a)q-l
u(t)
+ 1 + cq
yet)
198
Chapter 7
Thus the model assumption should be regarded as a tool to construct the predictor
(7.29). Note, however, that if e(t) is not white but correlated, the predictor (7.29)
will no longer be optimal. The same discussion applies to the disturbances in the state
space model (7.30). Complement C7.2 describes optimal multistep prediction of ARMA
processes.
(7.38)
There is a further important issue to mention in connection with PEMs, namely the
relation to the maximum likelihood (ML) method.
For this purpose introduce the further assumption that the noise in the model (7.10) is
Gaussian distributed. The maximum likelihood estimate of 8 is obtained by maximizing
the likelihood function, i.e. the probability distribution function (pdf) of the observations conditioned on the parameter vector 8 (see Section B.3 of Appendix B). Now there
is a 1-1 transformation between {y(t)} and {e(t)} as given by (7.10) ifthe effect of initial
conditions is neglected; see below for details. Therefore it is equally valid to use the pdf
of the disturbances. Using the expression for the multivariable Gaussian distribution
function (see (B.12), it is found that the likelihood function is given by
1 A(8)t/2 exp [ - 2:1 ~
~
L ( 8 ) -- (2n)Nny/2[det
(7.39)
Section 7.4
199
Taking natural logarithms of both sides of (7.39) the following expression for the loglikelihood is obtained:
log L(8)
2:N
= - 2"
+ constant
(7.40)
1=1
Strictly speaking, L(8) in (7.39) is not the exact likelihood function. The reason is that
the transformation of {yet)} to {E(t, O)} using (7.29) has neglected the transient effect
due to the initial values. Strictly speaking, (7.39) should be called the L-function
conditioned on the initial values, or the conditional L-function. Complement C7.7
presents a derivation of the exact L-function for ARMA models, in which case the initial
values are treated in an appropriate way. The exact likelihood is more complicated to
evaluate than the conditional version introduced here. However, when the number of
data points is large the difference between the conditional and the exact L-functions is
small and hence the corresponding estimates are very similar. This is illustrated in the
following example.
Example 7.5 The exact likelihood function for a first-order autoregressive process
y(t) + ay(t - 1)
lal
= e(t)
< 1
(7.41)
where e(t) is Gaussian white noise of zero mean and variance A,z. Set 8 = (a
evaluate the likelihood function in the following way. Using Bayes' rule,
For k
[D2 p(y(k)ly(k -
2,
p(e(k
1_ exp[-2~2(Y(k)
= __
V(2n)"
"
and
()
P y(1)
[1
LeC8)
I1
N
k=2
[1
+ ay(k -
l)?j
AY and
200
+ ay(k - 12]}
Chapter 7
1
eX P [-212i (1)]
\I(2n)o
. 0
k=2
!Og{
+ 109{
1
\I(2n)"-
exp[-2~2(Y(k)
1
exp [-- 21
\I(2n)o
0
-"2 log
+ ay(k - 12]}
2i(1)]}
1
2n - (N - 1) log "- - 2,,-2
2:
k=2
1
- 2 log
,,-2
1 - a2 2
1 _ a2 - ~y (1)
For comparison, consider the conditional likelihood function. For the model (7.41),
-2" log 2n
2:
k=2
(cf. (7.40. Note that log Lie) and log L(e) both are of order N for large N. However, the difference log Le(S) - log L(S) is only of order 0(1) as N tends to infinity.
Hence the exact and the conditional estimates are very close for a large amount of data.
With some further calculations it can in fact be shown that
8cxact =
8 cond
(~)
In this case the conditional estimate 8cond will be the LS estimate (7.4) and is hence easy
to compute. The estimate 8exact can be found as the solution to a third-order equation,
= -
1 N 1
A "2 N
2:
1=1
E2(t, e) -
Section 7.4
+ log A] + constant
201
(7.42)
a log
L(8, A) _ _ !i
aA
2
a2 log L(8,
A) _
aA 2
(_ RN(8)
A2
!i (_ 2RN(8)
2
A3
l.)
.l.-)
+
+
A2
Furthermore, the second-order derivative is negative at this point. Hence it is a maximum. The estimate of A is thus found to be
(7.43)
where 8 is to be replaced by its optimal value, which is yet to be determined. Inserting
(7.43) into (7.42) gives
A
=-
log L(8, A)
+ constant
so 8 is obtained by minimizing RN(8). The minimum point will be the estimate 8 and the
minimal value RN(8) will become the estimate A. So the prediction error estimate can be
interpreted as the maximum likelihood estimate provided the disturbances are Gaussian
distributed.
Next consider the multivariable case (dimy(t) = ny;::;: 1). A similar result as above for
scalar systems holds, but the calculations are a little bit more complicated. In this case
=-
log L(8, A)
2"[tr RN(8)A
_
1
(7.44)
The statement that L(8, A) is maximized with respect to A for A = RN(8) is equivalent
to
tr RA -1
+ log det
tr RA -1
+ log[det Ndet
tr RA -1
A ;::;: tr 1
log det RA -1
+ log det
R ~
R] ;::;: ny ~
;::;:
(7.45)
ny
Next write R as R = GGT and set Y = G TA -IG. The matrix Y is symmetric and positive definite, with eigenvalues Al ... Any. These clearly satisfy Ai > O. Now (7.45) is
equivalent to
tr GGTA- 1
;::;:
ny ~
202
2: Ai ;=1
Chapter 7
ny
log
I1 Ai
;:?;
fly ~
;=1
ny
2: [A; -
log A; - 1]
;:?;
;=1
Since for any positive A, log A ~ A-I, the above inequality is obviously true.
Hence the likelihood function is maximized with respect to A for
(7.46)
where e is to be replaced by its value 0 which maximizes (7.44). It can be seen from
(7.44) that 0 is determined as the minimizing element of
Vee) = det RNce)
(7.47)
This means that here one uses the function h2 ( Q) of Example 7.1.
The above analysis has demonstrated an interesting relation between the PEM and the
ML method. If it is assumed that the disturbances in the model are Gaussian distributed,
then the ML method becomes a prediction error method corresponding to the loss
function (7.47). In fact, for this reason, prediction error methods have often been known
as ML methods.
Theoretical analysis
Section 7.5
203
For part of the analysis the following additional assumption will be needed:
AS. The set DTV?, ,At') introduced in (6.44) consists of precisely one point.
Note that we do not assume that the system operates in open loop. In fact, as will be seen
in Chapter 10, one can often apply prediction error methods with success even if the data
are collected from a closed loop experiment.
Asymptotic estimates
When N tends to infinity, the sample covariances converge to the corresponding expected
values according to the ergodicity theory for stationary signals (see Lemma B.2; and
Hannan, 1970). Then since the function h() is assumed to be continuous, it follows that
(7.48)
where
(7.49)
It is in fact possible to show that the convergence in (7.48) is uniform on compact (i.e.
closed and bounded) sets in the 8-space (see Ljung, 1978).
If the convergence in (7.48) is uniform, it follows that
converges to a minimium
point of V",(8) (d. Problem 7 .15). Denote such a minimum by 8*. This is an important
result. Note, in particular, that Assumption AS has not been used so far, so the stated
result does not require the model structure to be large enough to cover the true system.
If the set DT(J, Jt) is empty then an approximate prediction model is obtained. The
approximation is in fact most reasonable. The parameter vector 8* is by definition such
that the prediction error E(t, 8) has as small a variance as possible. Examples were given
in Chapter 2 (see Examples 2.3, 2.4) of how such an approximation will depend on the
experimental condition. It is shown in Complement C7.1 that in the multivariable case 8*
will also depend on the chosen criterion.
eN
Consistency analysis
Next assume that the set DT(J, Jt) is nonempty. Let 8 0 be an arbitrary element of
DT(Jl, Jll). This means that the true system satisfies
(7.50)
where e(t) is white noise. If the set DT(J,Jlt) has only one point (Assumption AS) then
we can call 80 the true parameter vector. Next analyze the 'minimum' points of Roo(8). It
follows from (7.29), (7.50) that
H- 1(q-l; 8)[G(q-l; 8 0 )
G(q-l; 8)u(t)]
(7.51)
E(l, 8)
e(t)
Chapter 7
= J
for all 8,
Thus
(7.52)
This is independent of the way the input is generated. It is only necessary to assume that
any possible feedback from y(.) to u(-) is causal. Since (7.52) gives a lower bound that is
attained for 8 = 80 it follows that 8* = 8 0 is a possible limiting estimate. Note that
the calculations above have shown that 80 'minimizes' Rx/8), but they did not establish
that no other minimizers of Reo exist. In the following it is proved that for systems
operating in open loop, all the minimizers of Reo are the points 80 of D T . The case of
closed loop systems is more complicated (in such a case there may also be 'false'
minimizers not belonging to D T ), and is studied in Chapter 10.
Assume that the system operates in open loop so that u(t) and e(s) are independent for
all t and s. Then Roo(8) = A(8 0 ) implies (cf. (7.51:
H-I(q-I; 8)[C(q-l; 80 )
==
eo) ==
C(q-I; 8)]u(t)
H-1(q-l; 8)H(q-l;
==
H(q-I; 80 )
Using Assumption A2 and Property 5 of persistently exciting signals (see Section 5.4)
one can conclude from the first identity that
C(q-I; 8)
==
C(q-I; 80 )
eN
eN
Approximation
There is an important instance of approximation that should be pointed out. 'Approximation' here means that the model structure is not rich enough to include the true
system. Assume that the system is given by
(7.53)
and that the model structure is
(7.54)
Here the parameter vector has been split into three parts, 8 1 , 8 2 and 8 3 , It is crucial for
what follows that C and H in (7.54) have different parameters. Further assume that there
is a parameter vector 810 such that
Section 7.5
Theoretical analysis
Gs(q-l)
==
G(q-l; 8 10)
205
(7.55)
Let the input u(t) and the disturbance e(s) in (7.53) be independent for all t and s. This is
a reasonable assumption if the system operates in open loop. Then the two terms of
(7.51) are uncorrelated. Hence
Roo(8)
EE(t, 8)!7(t, 8)
(7.56)
Equality holds in (7.56) if 8 1 = 8 10 , This means that the limiting estimate 8* is such that
= 8 10 , In other words, asymptotically a perfect description of the transfer function
G,(q-l) is obtained even though the noise filter H may not be adequately parametrized.
Recall that G and H have different parameters and that the system operates in open
loop. Output error models are special cases where these assumptions are applicable,
since H(q-l; 8 2 ) == 1.
8t
eN,
eN
o=
VN(eN)T = VN(8 0 f
= VN(8 0)T
+ VN(8 0 )(eN - 80 )
+ V~(80)(eN - 80 )
(7.57)
The second approximation follows since V N(8 0) - ? V~(80) with probability 1 as N - ? 00.
Here V' denotes the gradient of V, and V" the Hessian. The approximation in (7.57) has
an error that tends to zero faster than I eN - 80 II . Since eN converges to 80 as N tends
to infinity, for large N the dominating term in the estimation error eN - 8 0 can be written
as
(7.58)
The matrix V~(eo) is deterministic. It is nonsingular under very general conditions when
Dr(J, Jt) consists of one single point eo. However, the vector V NVN(8 0 ? is a random
variable. It will be shown in the following, using Lemma B.3, that it is asymptotically
Gaussian distributed with zero mean and a covariance matrix denoted by Po. Then
Lemma BA and (7.58) give
,
VN(8 N
80 )
dist
--;.
JV(O, P)
(7.59)
with
(7.60)
206
Chapter 7
The matrices in (7.60) must be evaluated. First consider scalar systems (dim yet)
For this case
VN (8)
2:
1).
(7.61)
(t, 8)
1=1
'\)J (t,
8)
= _
(8E(1, 8)T
80
(7.62)
Vrv(8)
= -
2:
E(l, 8)'lVT(t, 8)
1=1
/I
V N(8)
2
N
f!:.
L.., '\)J(t, 8)'\)J
f!:.
(t, 8) + N L..,
~1
8
E(l, 8) 882 E(t, 8)
(7.63)
~J
For the model structures considered the first- and second-order derivatives of the
prediction error, i.e. '\)J(t, 8) and 82E(t, 8)/88 2 , will depend on the data up to time t - l.
They will therefore be independent of e(t) == E(t, 80), Thus
_ 2f!:.
L.., e(t)'\)!
V N(8 o) - - N
(t, 80 )
(7.64a)
1=1
(7.64b)
Since e(t) and '\)J(t, ( 0 ) are uncorrelated, the result (7.59) follows from Lemmas B.3, B.4.
The matrix Po can be found from
Po
lim ENVrv(8oyrVrv(80)
N-->oo
To evaluate this limit, note that e(t) is white noise. Hence eel) is independent of e(s) for
all s
t and also independent of '\)J(s, 8 0 ) for s < t. Therefore
4 N 1-1
Po = lim { N)'
Ee(t)E'\)J(t, 8 0)e(s)'\)!T(s, 8 0)
"*
N~oo
+N
~_
s=l
L L
1=1 s=l+l
(7.65)
Section 7.5
Finally, from (7.60), (7.64b), (7.65) the following expression for the asymptotic
normalized covariance matrix is obtained:
(7.66)
Note that a reasonable estimate of P can be found as
P=
A[ ~~
1jJ(t,
eN)1jJT(t, eN)]-l
(7.67)
eN
(aE~;)!!2) T=
cp(t)
= A[Ecp(t)cpT(t)]-l
(7.68)
It is interesting to compare (7.68) with the corresponding result for the static case (see
Lemma 4.2). For the static case, with the present notation, for a finite data length:
(a) {) is unbiased
(b) V N(e - 8 0 ) is Gaussian distributed fiCO, P) where
P
1
A ( N ~ cp(t)cpT(t)
)-1
(7.69)
In the dynamic case these results do not hold exactly for a finite N. Instead the following
asymptotic results hold:
(a) {) is consistent
(b) V N(e - 80 ) is asymptotically Gaussian distributed fiCO, P) where
P = A[Ecp(t)cpT(t)rl
II
ay(t - 1) = e(t)
ce(t - 1)
8 = (a
C)T
(7.70)
208
Chapter 7
Then
E(t, 8)
1 + aq-1
1 + cq--ly(t)
aE
q-l
aa (t, 8) = 1 + cq-ly(t)
aE
1 +
--1
-- (t 8) = aq
q-ly(t)
ac '
(1 + cq-1)2
Thus
P =
A(
E.l-(t - 1)2
_EyF(t - l)EF(t - 1)
-1
E(t 8)
1 + cq-l
,
= -
where
1
y (t) = 1 + cq-ly(t)
10
1
(t) = 1 + cq-1 E(t, 8)
10
(t)
= -1.
+ cq -1 e(t)
2
A ('A/(l
_. a )
-A/(l -- ac)
-A/(l - ac) ) - 1
A/(l - c2)
1
(
(1 -- a 2 )(1 - ac)2
(1 - a2)(1 - ac)(l - C2
- (c - a)2 (1 - a 2 )(1 - ac)(l - c2)
(1 - acf(l - c2)
(7.71)
The matrix P is independent of the noise variance A. Note from (7.71) that the
covariance elements all increase without bound when c approaches a. Observe that when
c = a, i.e. when the true system is a white noise process, the model (7.70) is overparametrized. In such a situation one cannot expect convergence of the estimate
to a
certain point. For that case the asymptotic loss function V 00(8) will not have a unique
minimum since all 8 satisfying a = c will minimize V 00(8).
II
eN
So far the scalar case (ny = 1) has been discussed in some detail. For the multivariable
case (ny ;?: 1) the calculations are a bit more involved. In Appendix A 7.1 it is shown that
(7.72)
Here 'l\J(t) denotes
'l\J(t)
= _
(acCt,
as
8))TI
(7.73)
8=80
Section 7.5
Theoretical analysis
209
'vet)
(7.74)
Recall that h(Q) defines the loss function (see (7.15. The matrix H is constant. The
derivative in (7.74) should be interpreted as
Optimal accuracy
It is also shown in Appendix A 7.1 that there is a lower bound on P:
(7.75)
with equality for
(7.76)
or a scalar multiple thereof.
This result can be used to choose the function h(Q) so as to maximize the accuracy. If
h(Q) = h1(Q) = tr SQ, it follows from (7.74) that H = S and hence the optimal accuracy for this choice of h( Q) can be obtained by setting S = A-I. However, this
alternative is not realistic since it requires knowledge of the noise covariance matrix A.
Next consider the case h( Q) = h 2 ( Q) = det Q. This choice gives H = ad j (Q) =
Q-l det Q. When evaluated at Q = A it follows from (7.76) that equality is obtained in
(7.75). Moreover, this choice of h(Q) is perfectly realizable. The only drawback is that it
requires somewhat more calculations for evaluation of the loss function. The advantage
is that optimal accuracy is obtained. Recall that this choice of h(Q) corresponds to the
ML method under the Gaussian hypothesis.
Underparametrized models
Most of the previous results are stated for the case when the true system belongs to the
considered model structure (i.e. J E j { , or equivalently DT(J, ,At) is not empty).
Assume for a moment that the system is more complex than the model structure. As
stated after (7.48), the parameter estimates will converge to a minimum point of the
asymptotic loss function
eN
-'?
-'? 00
(7.77~
210
Chapter 7
dis!
8 .) -- ,%(0, P)
VN(8 N
(7.78a)
(7. 78b)
N->oo
(cf. (7.59), (7.60. To evaluate the covariance matrix P given by (7.78b) it is necessary
Statistical efficiency
It will now be shown that for Gaussian distributed disturbances the optimal PEM (then
identical to the ML method) is asymptotically statistically efficient. This means (cf.
Section B.4 of Appendix B) that the covariance matrix of the optimal PEM is equal to
the Cramer- Rao lower bound. Note that due to the general consistency properties of
PEM the 'bias factor' 8y(8)/8e in the Cramer-Rao lower bound (B.31) is asymptotically
equal to the identity matrix. Thus the Cramer-Rao lower bound formula (B.24) for
unbiased estimators asymptotically applies also to consistent (PEM or other) estimators.
That the ML estimate is statistically efficient for Gaussian distributed data is a classical
result for independent observations (see, for example, Cramer, 1946). In the present
case, {yet)} are dependent. However, the innovations are independent and there is a
linear transformation between the output measurements and the innovations. Hence the
statistical efficiency of the ML estimate in our case is rather natural.
The log-likelihood function (for Gaussian distributed disturbances) is given by (7.44)
log L(8, A)
~
= -21 L...
l'
N
(t, 8)A - 1 (t, 8) - "2
log det A
+ constant
1=1
Assuming that (t, 8) and A are independently parametrized and that all the elements of
A are unknown,
8 log
~~8,
A) =
(7.79a)
1=1
8 log Le8, A)
8A
1~
= 2 L...
iJ
(t, 8)A
l'
-1
e;ej
-1
(t, 8) -
"2 det A
.
[adJ(A)h
1=1
(7.79b)
N [-1
="2
A RN ()
8 A -1] ij
--
N [ A -1] ij
"2
( 0 log L(8,
oA
l'
A))T
Computational aspects
Section 7.6
211
where the derivative with respect to A is expressed in an informal way. It follows from
(7.79) that the information matrix is block diagonal, since E(t, 00 ) and 1.jJ(t, 00 ) are
mutually uncorrelated and {E(t, Oo)} is an un correlated sequence. This can be shown as
follows:
2E
aA ij
8=8(1,;\=;\(1
2: 2:
1= 1 s= 1
1-1
2: 2:
I=J .1'=1
=0
2: .2:
1=1 s=1
We also have
E
= E
t=l
s~-l
2: 2:
8=00 ,;\=;\(1
1-1
2: 2:
(=
ao
.1'=
=,0
(7.81)
2:
1=1
eN
212
Chapter 7
Optimization algorithms
In most cases the minimum of VN(O) cannot be found analytically. For such cases the
minimization must be performed using a numerical search routine. A commonly used
method is the Newton-Raphson algorithm:
(7.82)
Here e(k) denotes the kth iteration point in the search. The sequence of scalars Uk in
(7.82) is used to control the step length. Note that in a strict theoretical sense, the
Newton-Raphson algorithm corresponds to (7.82) with Uk = 1 (see Problem 7.5).
However, in practice a variable step length is often necessary. For instance, to improve
the convergence of (7.82) one may choose Uk so that
Uk =
arg min
n
vNce(k) -
(7.83)
U[VN(e(kr1v,v(e(k)Yf )
Note that Uk can also be used to guarantee that e(k+l) E qz; for all k, as required.
The derivatives of V(8) can be found in (7.63), (A7.1.3), (A7.1.4). In general,
VIv(O)
= -
(7.84)
T(t, O)H't(!T(t, 8)
t=1
and
II
2
V N(O) = N
L
N
2 N
N
L
t= 1
t=1
2 ~
- N L..
()T't(! (t, 8)
aH
(t, 0) 00
(7.85)
TaT
t=l
Here oH/08 and 01j!/00 are written in an informal way. In the scalar case (i.e. ny = 1)
explicit expressions of these factors are easily found (see (7.63)).
At the global minimum point (t, 0) becomes asymptotically (as N tends to infinity)
white noise which is independent of
0). Then
VN(-)o) =
'vet,
't(!(t, OO)H1j!T(t, 60 )
(7.86)
t=l
(cf. (A7.1.4)). It is appealing to neglect the second and third terms in (7.85) when
using the algorithm (7.82). There are two reasons for this:
.. The approximate VN(O) in (7.86) is by construction guaranteed to be positive definite.
Therefore the loss function will decrease in every iteration if Uk is appropriately chosen
(cf. Problem 7.7) .
.. The computations are simpler since oHIaS and o1j!(t, 0)1a6 need not be evaluated.
The algorithm obtained in this way can be written as
Section 7.6
e(k+l)
N
Uk [ ~
t')<k
J-1
(7.87)
This is called the Gauss-Newton algorithm. When N is large the two algorithms (7.82)
and (7.87) will behave quite similarly if e(k) is close to the minimum point. For any N the
local convergence of the Newton- Raphson algorithm is quadratic, i.e. when e(k) is close
to (the minimum point) then 118(k+l) - 811 is of the same magnitude as 118(k) - 8112.
This means roughly that the number of significant digits in the estimate 8(k) is doubled in
every iteration. The Gauss-Newton algorithm will give linear convergence in general. It
will be quadratically convergent when the last two terms in (7.85) are zero. In practice
these terms are small but nonzero; then the convergence will be linear but fast (so-called
superlinear convergence; cf. Problem 7.7).
It should be mentioned here that an interesting alternative to (7.82) is to apply a
recursive algorithm (see Chapter 9) to the data a couple of times. The initial values for
one run of the data are then formed from the final values of the previous run. This will
often give considerably faster convergence and will thus lead to a saving of computer
time.
Evaluation of gradients
For some model structures the gradient '\jJ(t, 8) can be computed efficiently. The follow ..
ing example shows how this can be done for an nth-order ARMAX model.
Example 7.8 Gradient calculation for an ARMAX model
Consider the model
(7.88)
C(q-l)E(t, 8)
A(q-1)y(t) - B(q-l)U(t)
(7.89)
ai
to get
-1) GE(t, 8)
(
.)
C(.q.
Ba. = y t - I
l
Thus
GE(t, 8) ___1_._ (
oa; - C (q -1) Y t -
.)
I
(7.90a)
1, ... , n, it is sufficient to compute
214
Chapter 7
[lIC(q-I)]y(t). This means that only one filtering procedure is necessary for all the
elements aE(t, 8)/aai' i = 1, ... , n.
Differentiation of (7.89) with respect to hi (1 ~ i ~ n) gives similarly
(7.90b)
ab i
Ci
(1
which gives
aECt, 8) ____1_ ( _. 8)
aCi
C(q-l)
(7.90c)
l,
To summarize, for the model structure (7.88) the vector 'IjJ(t, 8) can be efficiently
computed as follows. First compute the filtered signals
u (t)
C(q-I) u(t)
F
1
E (t) = C(q-l) E(t)
-f
However, there is often no practical use for the numerical value of 'IjJ(t, 8) as such.
(Recursive algorithms are an important exception, see Chapter 9.) Instead it is important
to compute quantities such as
VIv(8)
= -
2:
E(t, 8)'4)T(t, 8)
1=1
As an example, all the first n elements of VIv(8) are easily computed from the signals
E(t, 8) and yF(t) to be
lVlv(8)]i =
E(t, 8)yF(t - i)
1, ... , n
t=l
Section 7.6
= q:?'(t)e + e(t)
(7.91b)
with
e=
cn)T
(7.91c)
(al ... an
bl
...
bn
Cl'"
(7.92)
To make (7.92) realizable, e(t - i) in (7.91c) must be replaced by some estimate e(t - i),
i = 1, ... , n. Such estimates can be found using a linear regression model
A(q-l)y(t)
= B(q-l)U(t) +
e(t)
(7.93)
of high order, say n. The prediction errors obtained from the estimated model (7.93) are
then used as estimates of e(t - i) in <:pet). For this approach to be successful one must
have
(7.94)
These approximations will be valid if n is sufficiently large. The closer the zeros of C(z)
are located to the unit circle, the larger is the necessary value of n.
II
Example 7.10 Initial estimates for increasing model structures
When applying an identification method the user normally starts with a small model
structure (a low-order model) and estimates its parameters. The model structure is
successively increased until a reasonable model is obtained. In such situations the
'previous' models may be used to provide appropriate initial values for the parameters of
higher-order models.
To make the discussion more concrete, consider a scalar ARMA model
(7.95a)
where
A(q--l) = 1 + a,q-l
C(q-l)
1+
Clq-l
+
+
(7.95b)
Suppose the parameters of a model of order n have been estimated, but this model is
216
Chapter 7
found to be unsatisfactory when assessing its 'performance'. Then one would like to try
an ARMA model of order n + 1. Let the parameter vectors be denoted by
en
nn+l=
IJ
(a] ... an
(al
C1'"
cSr
an+ 1 c1 ..
(7.96)
C
n+)1T
for the two model structures. A very rough initial value for
However, the vector
on+1
would be 0,,+1
O.
(7.97)
will give a better fit (a smaller value of the criterion). It is in fact the best value for the
nth-order ARMA parameters. Note that the class of nth-order models can be viewed as
a subset of the (n + 1)th-order model set. The initial value (7.97) for the optimization
algorithm will be the best that can be selected using the information available from en. In
particular, observe that if the true ARMA order is n then 8n + 1 given by (7.97) will be
very close to the optimal value for the parameters of the (n + 1)th-order model.
III
Summary
In Section 7.1 it was shown how linear regression techniques can be used for identification of dynamic models. It was demonstrated that the parameter estimates so obtained
are consistent only under restrictive conditions. Section 7.2 introduced prediction error
methods as a generalization of the least squares method. The parameter estimate is
determined as the minimizing vector of a suitable scalar-valued function of the sample
covariance matrix of the prediction errors. Section 7.3 gave a description of optimal
predictors. In particular, it was shown how to compute the optimal prediction errors
for a general model structure. In Section 7.4 the relationship of the PEM to other
identification methods was discussed. In particular, it was shown that the PEM can be
interpreted as a maximum likelihood method for Gaussian distributed disturbances.
The PEM parameter estimates were analyzed in Section 7.5, where it was seen that,
under weak assumptions, they are consistent and asymptotically Gaussian distributed.
Explicit expressions were given for the asymptotic covariance matrix of the estimates
and it was demonstrated that, under the Gaussian hypothesis, the PEM estimates are
statistically efficient (i.e. have the minimum possible variance).
Finally, in Section 7.6 several computational aspects on implementing prediction error
methods were presented. A numerical search method must, in general, be used. In
particular various Newton methods were described, including efficient ways for
evaluating the criterion gradient and choosing the initial value.
Problems
Problem 7.1 Optimal prediction of a nonminimum phase first-order MA model
Consider the process given by
Problems
217
YA(t + lit)
1- a
1 - aq
yet)
= (b J
b nb
(:1."
Cnc
d1
...
d nd
fl ... fnJ)T
V(O) =
.2: f2(t, 0)
1=1
where f(', 8) is a differentiable function of the parameter vector 8, can be obtained from
the Newton-Raphson algorithm by making a certain approximation on the Hessian
matrix (see (7.82)-(7.86)). It can also be obtained by 'quasilinearization' and in fact it is
sometimes called the quasilinearization minimization method. To be more exact, let e(k)
Chapter 7
denote the vector 8 obtained at iteration k, and let E(t, 8) denote the linear approximation of E(t, 8) around e(k):
where
1./1(t, 8)
_ ((1(t,
a8
8)T
Determine
N
2: e2(t, 8)
0)
t=l
and show that the recursion so obtained is precisely the Gauss-Newton algorithm.
Problem 7.7 Convergence rate for the Newton-Raphson and Gauss-Newton algorithms
Consider the algorithms
= X(k)
- as[V'(x(k)rr
Show that this algorithm gives a decreasing sequence of funCtion values (V(X(k+l ~
V(X(k) if a is chosen small enough.
(b) Apply the algorithms to the quadratic function Vex) = ~xTAx + xTb + c, where A is
a positive definite matrix. The minimum point of Vex) is x* = -A -lb. Show that
for A j ,
(Hence, Al converges in one step for quadratic functions.) Show that for A z,
[X(k+l) - x*] = [I - SA ][x(k) - x*]
(Assuming I - SA has all eigenvalues strictly inside the unit circle X(k) will converge,
and the convergence will be linear. If in particular S = A -1 + Q and Q is small, then
the convergence will be fast (superlinear).)
Problem 7.8 Derivative of the determinant criterion
Consider the function h( Q) = det Q. Show that its derivative is
Problems 219
Problem 7.9 Optimal predictor for a state space model
Consider a state space model of the form (7.30). Derive the mean square optimal
predictor present in the form (7.11).
(a) Use the result (7.31).
(b) Use (6.33) to rewrite the state space form into the input-output form (7.27). Then
use (7.29) to find the optimal filters Lj(q-l; fJ) and L 2(q-l; 8).
Problem 7.10 Hessian of the loss function for an ARMAX model
Consider the loss function
N
V(8) =
2:
2(t, fJ)
1=1
Derive the Hessian (the matrix of second-order derivatives) of the loss function.
Problem 7.11 Multistep prediction of an AR(J) process observed in noise
Let the process yet) be defined as
1
y(t) = -1- - - - 1 e(/) + Ct)
aq
where lal < 1, Ee(t)e(s) = A;O/,,, E(/)(s) = A~Ot,s and Ee(t)(s) = 0 for all t, s. Assume
that e(t) and (t) are Gaussian distributed. Derive the mean-square optimal k-step
predictor of yet + k). What happens if the Gaussian hypothesis is relaxed?
Problem 7.12 An asymptotically efficient two-step PEM
Let (t, 8) denote the prediction errors at time instant t of a general PE model with
parameters e. Assume that(t, 8) is an ergodic process for any admissible value of e and
that it is an analytic function of fl. Also, assume that there exists a parameter vector 8 0
(the 'true' parameter vector) such that {(t, flo)} is a white noise sequence. Consider the
following two-step procedure:
S1: Determine a consistent estimate 8 of 80
S2:
oCt,
8)
08
Assume that 8 has a covariance matrix that tends to zero as liN, as N tends to infinity.
Show that the asymptotic covariance matrix of 8 is given by (7.66). (This means that the
220
Chapter 7
(1
G(q-l)U(t)
eel)
where the noise e(t) is not correlated with the input signal u(t). Suppose the purpose of
identification is the estimation of the transfer function G(q-l), which may be of infinite
order. For simplicity, a rational approximation of G(q-l) of the form B(q-l)/A(q-l),
where A(q-l) and B(q-l) are polynomials, is sought. The coefficients of B(q-l) and
A(q-l) are estimated by using the least squares method as well as the output error
method. Interpret the two approximate models so obtained in the frequency domain,
assuming that the number of data points N tends to infinity and that u(t) and e(t) are
stationary processes.
Problem 7.14 An indirect PEM
Consider the system
yet)
bou(t - 1)
+1
+ aoq
e(t)
where u(t) and eel) are mutually independent white noise sequences with zero means and
variances OZ and "Z, respectively.
(a) Consider two ways of identifying the system.
Case (i): The model structure is given by
~4tI: yet)
bu(t - 1)
+1
1
1 e(t)
+ aq
A z: y(t) + ay(t - 1)
8z = (a
bl
= bluet - 1) + bzu(t
- 2)
+ eel)
b 2?
Problems
221
Por
(pTQP)-lpTQP e2 QP(pTQP)-1
82 and
Hint. Let 8 10 , 8 20 denote the true parameter vectors. Then by a Taylor series
expansion
dV(8 1 )\
=
d8 l o,=ej
-[82
f(SnFQP
Evaluate the right-hand side explicitly for the system above. Compare with the
covariance matrix for S1'
Remark. The choice Q = Pel is not realistic in practice. Instead Q = Pe 2 can be used.
(P e2 is an estimate of P e2 obtained
by
replacing E by liN r.~l in the expression for Pe2 .)
"
"
With this weighting matrix, 81' = 8] as shown in Complement C7.4.
Problem 7.15 Consistency and uniform convergence
Assume that V N (8) converges uniformly to V co(8) (as Ntends to infinity) in a compact set
Q, and that V 00(8) is continuous and has a unique global minimum point 8* in Q. Prove
that for any global minimum point of V N(O) in Q, it holds that S~ 8* as N tends to
infinity.
Hint. Let > 0 be arbitrary and set Q* = {8; 118 - 8* II < E}. Next choose a number
222
Chapter 7
IvN(8) -
+ ay(t -
1) = hu(t - 1)
+ e(t) + ce(t -
1)
where e(t) is white noise with zero mean and variance A? Assume that u(t) is white noise
of zero mean and variance 0 2 , and independent of the e(t) , and that the system is
identified using a PEM. Show that the normalized covariance matrix of the parameter
estimates is given by
p=
A2
b2 a2 (1 + ac)
A2
-+-(I - a2 )(1 - ac)(l - c2 ) 1 - a 2
bca 2
---(1 - ac)(l - c 2)
bca2
(1 - ac)(l - c2 )
02
A2
1 - c2
A2
--1 - ac
-1
--1 - ac
A2
1 - c2
(1 - a 2 )(1 - ac)2(1 - c 2 )
a2 [b 2 a2 + A?(C -- afl
ifA2
bC02 A2
----(1 - ac)(l - c 2 )
1-
bca2 A2
(1 - ac)(l - c2 )
x -
a 2A2
(1 - ac)
(J2A2
bca2 A2
(1 - ac)
(1 - ac)2
bC(J2A2
(1 - ac)2
+ Al(1 - acn
(1 - a2 )(1 - ac)2
(J2{b 2 (J2
Problem 7.17 Covariance matrix for the parameter estimates when the system
does not belong to the model set
Let yet) be a stationary Gaussian process with covariance function rk = Ey(t) yet - k) and
assume that rk tends exponentially to zero as k tends to infinity (which means that there
are constants C > 0, 0 < a < 1, such that hi < Calk I). Assume that a first-order
AR model
y(t) + ay(t - 1)
e(t)
YN(a - a*)
where a*
dis!
---7
JV(O, P)
as
N --';
00
= -rl/rO and
Problems 223
Hint. Use Lemma B.9.
(b) Show how a* and P simplify if the process yet) is a first-order AR process.
Remark. For an extension of the result above to general (arbitrary order) AR models,
see Stoica, Nehorai and Kay (1987).
Problem 7.18 Estimation of the parameters of an AR process observed in noise
Consider an nth-order AR process that is observed with noise,
AO(q-l)X(t) = v(t)
yet) = x(t)
+ e(t)
v(t), e(t) being mutually independent white noise sequences of zero mean and variances
AV 2 , Ae 2 .
(a) How can a model for yet) be obtained using a prediction error method?
(b) Assume that a prediction error method is applied using the model structure
A(q-l)X(t) = vet)
e = (a}
... an
A~
A;) T
Show how to compute the prediction errors E(t, 8) from the data. Also show that the
criterion
N
Vee) =
:2: E2(t, 8)
t=1
has many global minima. What can be done to solve this problem and estimate
8 uniquely?
Remark. More details on this type of estimation problem and its solution can be found
in Nehorai and Stoica (1988).
Problem 7.19 Whittle's formula for the Cramer-Rao lower bound
Let yet) be a Gaussian stationary process with continuous spectral density <p( ill)
completely defined by the vector 8 of (unknown) parameters. Then the Cramer-Rao
lower bound (CRLB) for any consistent estimate of 8 is given by Whittle's formula:
p
=
w
dZ}-l
Let yet) be a scalar ARMA process. Then an alternative expression for the CRLB is
given by (7.66). Prove the equivalence between the expression above and (7.66).
Remark. The above formula for the covariance matrix corresponding to the CRLB is
due to Whittle (1953).
Chapter 7
Problem 7.20 A sufficient condition for the stability of least squares input-output models
Let yet) and u(t) denote the (ny-dimensional) output and (nu-dimensional) input,
respectively, of a multivariable system. The only assumption made about the system is
that {u(t)} and {yet)} are stationary signals. A fun polynomial model (see 6.22),
A(q-l)y(t) = B(q-l)U(t)
A(q-l) = I
+ E(t)
B(q-l) = B1q-l
+ ... + Bnq-n
is fitted to input-output data from the system, by the least squares (LS) method. This
means that the unknown matrix coefficients {Ai, Bj } are determined such that
asymptotically (i.e. for an infinite number of data points)
(i)
where 1111 denotes the Euclidean norm. Since the model is only an approximation of the
identified system, one cannot speak about the consistency or related properties of the
estimates {Ai, Bj }. However, one can speak about the stability properties of the model,
as for some applications (such as controller design, simulation and prediction) it may be
important that the model is stable, i.e.
det[A(z)]
'* 0
for
Izl
~ 1
(ii)
The LS models are not necessarily stable. Their stability properties will depend on the
system identified and the characteristics of its input u(t) (see Soderstrom and Stoica,
1981b). Show that a sufficient condition for the LS model (i) to be stable is that u(t) is a
white signal. Assume that the system is causal, i.e. yet) depends only on the past values
of the input u(t), u(t - 1), etc.
Hint. A simple proof of stability can be obtained using the result of Complement CS.6
which states that the matrix polynomial
(mlm)
given by
where x(t) is any (m-dimensional) stationary signal, is stable (i.e. satisfies a condition
similar to (ii)).
Problem 7.21 Accuracy of noise variance estimate
Consider prediction error estimation in the single output model
Problems 225
Show that
lim NE(~2 - ,,2)
_,,2 dim El
N--.oo
= f.l
- ,,4
N-->oo
Hint. Write ~2 - ,,2 = V N(8 N) - V NCElO) + lINr.;:l e2 (t) - ,,2 and make a Taylor series
expansion of V N(El O) around 8N .
Remark. For Gaussian distributed noise, f.l = 3,,4. Hence in that case
= 2,,4
N--.oo
yet) =
B( -1)
q
u(t)
A(q-1)
+ E(t)
where deg A = deg B = n. The following iterative scheme is based on successive linear
least squares fits for determining A(q-l) and B(q-1)
.
(A" k+ 1, B k+ 1) = arg mm
L
N
(A,B) 1=1
{I
A(q -1 ) -;c--(_l)y(t)
Ak q
}
(i)
where AO(q-l) and BO(q-l) are coprime and of degree n, u(t) is persistently exciting of
order 2n and vet) is a stationary disturbance that is independent of the input. Consider
the asymptotic case where N, the number of data, tends to infinity.
(a) Assume that v(t) is white noise. Show that the only possible stationary solution of (i)
is given by A(q-l) = AO(q-l), B(q-l) = BO(q-l).
(b) Assume that vet) is correlated noise. Show that A(q-l) = AO(q-l), B(q-l) =
BO(q-l) is generally not a possible stationary solution to (i).
Remark. The method was proposed by Steiglitz and McBride (1965). For a deeper
analysis of its properties see Stoica and Soderstrom (1981b), Soderstrom and Stoica
(1988). Note that it is a consequence of (b) that the method does not converge to
minimum points of the output error loss function.
Chapter 7
Bibliographical notes
(Sections 7.2, 7.4). The ML approach for Gaussian distributed disturbances was first
proposed for system identification by Astrom and Bohlin (1965). As mentioned in the
text, the ML method can be seen as a prediction error identification method. For a
further description of PEMs, see Ljung (1976, 1978), Caines (1976), Astrom (1980), Box
and Jenkins (1976). The generalized least squares method (see (7.37 was proposed by
Clarke (1967) and analyzed by Soderstrom (1974). For a more formal derivation of the
accuracy result (7.59), (7.60), see Caines and Ljung (1976), Ljung and Caines (1979).
The output error method (i.e. a PEM applied to a model with H(q-l; 8) == 1) has been
analyzed by Kabaila (1983), SOderstrom and Stoica (1982). For further treatments of the
results (7.77), (7.78) for underparametrized models, see Ljung (1978), Ljung and Glover
(1981). The properties of underparametrized models, such as bias and variance of
the transfer function estimate G(e iW ; eN) for fixed ill, have been examined by Ljung
(1985a, b), Wahlberg and Ljung (1986).
(Section 7.3). Detailed derivations of the optimal prediction for ARMAX and state
space models are given by Astrom (1970), and Anderson and Moore (1979).
(Section 7.6). For some general results on numerical search routines for optimization,
see Dennis and Schnabel (1983) and Gill et al. (1981). The gradient of the loss function
can be computed efficiently by using the adjoint system; see Hill (1985) and van Zee and
Bosgra (1982). The possibility of computing the prediction error estimate by applying a
recursive algorithm a number of passes over the data has been described by Ljung
(1982); see also Young (1984), Solbrand et at. (1985). The idea of approximately estimating the prediction errors by fitting a high-order linear regression (cf. Example 7.9) is
developed, for example, by Mayne and Firoozan (1982) (see also Stoica, Soderstrom,
Ahlen and Solbrand (1984, 1985.
Appendix A7.1
Covariance matrix of PEM estimates for multivariable systems
It follows from (7.58) and Lemmas B.3 and B.4 that the covariance matrix P is given by
(A7.1.1)
Hence the primary goal is to find V N(8 o) and V::aC8 o) for a general multivariable model
structure and a general loss function
1
RN(O) = N
E;(t)
= a8
e(t, 8)
l
T (t,
8)
(A7.1.2)
Appendix A7.1
227
HN(8) = aQ h(Q)=RN(O)
H = Hoo(8 o)
ah aR
a8. V N (8)
j,k
Njk
2: ~ ~
Njk
= tr ( HN(8)
2
~ ~ {Ei(t)ET(t)
+ E(t)ET(t)})
2:
=N
ET(t)HN(8)Ei(t)
1=1
a2
2[
EET(t)R,,(8)Ei(t) + EET(t)
a~e;S) Ei(t)
Thus
Viv(8)e=o" = -
N 2:
eT(t)H1jJT(t)
(A7.1.3)
1=1
(A7.1.4)
The expression for the Hessian follows since E(t)O=oo = e(t) is uncorrelated with Ei(t) and
E;j(t). The central matrix in (A7.1.1) is thus given by
Po
~~-Too N ~2 E
tf,(t)He(t)
1=1
eT(s)H1jJT(s)
s=1
It can be evaluated as in the scalar case (see the calculations leading to (7.65. The result
is
Po
= 4[ E1jJ(t)HAlhj;'l'(t)]
(A7.1.5)
E (1jJ(t) HAHtf)T(t)
1jJ(t)H1pT(t)
1jJ(t)H1jJ'J~(t)
1jJ(t)A -l1jJr(t)
E (1jJ(t)HA) A -l(AH1jJT(t)
1jJ(t)
tflT(t
228
Chapter 7
p-l
which in turn is equivalent to (7.75). It is trivial to verify that (7.76) gives equality in
(7.75).
Complement C7.1
Approximation models depend on the loss function used in estimation
In Section 2.4 and 7.5 it has been demonstrated that if the system which generated the
data cannot be described exactly in the model class, then the estimated model may
depend drastically on the experimental condition even asymptotically (i.e. for N ~ 00, N
the number of data points).
For multivariable systems an additional problem occurs. Let E(t) denote the residuals
of the model with parameters 8, with dim E(t) > 1. Let the parameters 8 be determined
by minimizing a criterion of the form (7.13). When the system does not belong to the
model class considered, the loss function h() used in estimation will significantly
influence the estimated model even for N -~ 00. To show this we present a simple
example patterned after Caines (1978).
Consider the system
J: yet)
= e(t)
t = 1,2, ...
V(1 0_
8) yet -
1)
+ E(t)
Clearly j ' f Jt. That is to say, there is no 8 such that j ' == Jt(S). For e E (0,1) the
model is asymptotically stable and as a consequence e(t) is an ergodic process. Thus
1
lim N
N~oo
2: E(t)eT(t) = EE(t)eT(t) ~ Q
(=1
+ (48 2
(2~
COS
V(1
0_S)e(t -
0)
1 - 8
Therefore
h 1(Q) ~ tr Q
h2 (Q) ~ det
= 3 - 8 + 48 2
Q = (1 + 4( 2 )(2
- 8)
2 - 8
+ 88 2
48 3
1)f
Complement C7.2
229
and
ah 1 = 88 - 1
a8
ah 2
a8
-128 2
+ 168 - 1
which give
Complement C7.2
Multistep prediction of ARMA processes
Consider the ARMA model
(C7.2.1)
where
A(q--I) = 1
aIq-1
C(q-I) = 1 + Clq-l +
and where {e(t)}, t = 0, 1, ... is a sequence of uncorrelated and identically distributed
Gaussian random variables with zero mean and variance ').,2. According to the spectral
factorization theorem (See Appendix A6.1) one can assume without introducing any
restriction that A(z)C(z) has all zeros outside the unit cirde:
A(z)C(z)
=>
Izl
> 1
One can also assume thatA(z) and C(z) have no common zeros. The parameters {ai, Cj}
are supposed to be given. Let yt denote the information available at the time instant t:
yet + k) -
yet +
kit)
230
Chapter 7
l = max(na - 1, nc -- k)
through
C(z) == F(z)A(z)
+ zkG(z)
(C7.2.2)
Due to the assumption that A (z) and C(z) are coprime polynomials, this identity
uniquely defines F and G. (If 1< 0 take G(z) == 0.) Inserting (C7.2.2) into (C7.2.1) gives
yet
k)
= F(q-l)e(t +
k)
G( -1)
+ C(q-I)
q yet)
The first term in the right-hand side of the above relation is independent of yt. Thus
E[y(t + k) -
yCt
+ klt)P
E(~~:-:jy(t)
- yet + klt)r
(C7.2.3)
+ E[F(q-l)e(t + k)P
which shows that the optimal k-step predictor is given by
,------------------------_._------------,
(C7.2.4)
yet
+ k) - yet + kit)
F(q-l)e(t + k)
(C7.2.5)
with variance
(1
(C7.2.6)
must be uncorrelated for (C7.2.4) to hold. This condition is satisfied in the following
situations:
Assume that {e(t)} is a sequence of independent (not only uncorrelated) variables.
Note that this is the case if the sequence of uncorrelated random variables {e(t)} is
Gaussian distributed (see e.g. Appendix B.6). Then the two terms in (C7.2.6) will be
independent and the calculation (C7.2.3) will hold.
Fk-1(Z)
+ h_lZk- 1
(C7.2.7)
From (C7 .2. 7), (C7 .2.2) for k and k - 1, it follows that
A(z)Fk-1(Z)
which reduces to
(C7.2.8)
Assume for simplicity that na = nc = n and let
Gk(z)
= gZ +
glz
+ ... + gk-1z(n-l)
By equating the different powers of z in (C7.2.8) the following formulas for computing
{gDi~J for k = 2, 3, ... are obtained:
fk-1>
(C7.2.9)
io
1
i
= 0, ... , n - 1
Using the method presented above for calculating yet + kit) one can develop k-step
PEMs for estimating the ARMA parameters; i.e. PEMs which determine the parameter
estimates by minimizing the sample variance of the k-step prediction errors:
Ek(t) ~ yet + k) -
yet +
kit)
1,2, ...
(C7.2.1O)
(see Astrom, 1980; Stoica and Soderstrom, 1984; Stoica and Nehorai, 1987a). Recall that
in the main part of this chapter only one-step PEMs were discussed.
Multistep optimal prediction
In some applications it may be desirable to determine the parameter estimates by
minimizing a suitable function of the prediction errors {Ek(t)} for several (consecutive)
232
Chapter 7
values of k. For example, one may want to minimize the following loss function:
m
~1
[1 ~
N-k
(C7.2.11)
Ek(t)
In other words, the parameters are estimated by using what may be called a multistep PEM. An efficient method for computing the prediction errors {Ek(t)}k=l (for
t = 1, ... , N - k) needed in (C7.2.11) runs as follows (Stoica and Nehorai, 1987a):
Observe that (C7.2.7) implies
Ek(t) ~ Fk(q-l)e(t + k)
= Ek-l(t + 1) + h-lEl(t)
(k ;?; 2)
(C7.2.12)
where by definition El(t) = e(t + 1). Thus El(t) must first be computed as
El(t)
A(q-l)
= C(q l)y(t + 1)
= 1,2, ...
Next (C7.2.12) is used to determine the other prediction errors needed in (C7.2.11).
In the context of multistep prediction of ARMA process, it is interesting to note that
the predictions yet + kit), yCt + k - lit), etc., can be shown to obey a certain recursion.
Thus, it follows from Lemma B.16 of Appendix B that the optimal mean-square k-step
prediction of yet + k) under the Gaussian hypothesis is given by
yet +
kit)
E[y(t
+ k)1 yt]
(C7.2.13)
where
for i > 0
for i ~ 0
and where it is assumed that k ~ nco If k > nc then the right-hand side of (C7.2.13) is
zero. Note that for AR processes, the right-hand side of (C7.2.13) is zero for all k;?; 1.
Thus for pure AR processes, (C7.2.13) becomes a simple and quite intuitive multistep
prediction formula.
The relationship (C7.2.13) is important from a theoretical standpoint. It shows that for
an ARMA process the predictor space, i.e. the space generated by the vectors
( yet1+2lit)
t="
) (
... '
y(t +
21t) )
is finite dimensional. More exactly, it has dimension not larger than nco (See Akaike,
1981, for an application of this result to the problem of stochastic realization.) The
recursive-in-k prediction formula (C7.2.13) also has practical applications. For instance,
it is useful in on-line prediction applications of ARMA models (see Holst, 1977).
Complement C7.3
233
Complement C7.3
Least squares estimation of the parameters offull polynomialform models
Consider the following full polynomial form model of a MIMO system (see Example
6.3):
yet) + A 1y(t - 1) + ... + Anay(t - na)
B1U(t - 1)
+ ...
(C7.3.1)
where dim y = ny, dim u = nu, and where all entries of the matrix coefficients {Ai, B j }
are assumed unknown. It can be readily verified that (C7.3.1) can be rewritten as
= Effcp(t) + e(t)
yet)
(C7.3.2)
where
(nyl(na . ny + nb . nu))
cpT(t)
na . ny + nb . n~)11)
Note that (C7.3.2) is an alternative form to (6.24), (6.25).
As shown in Lemma A.34, for matrices A, B of compatible dimensions,
vec(AB) = (J @ A)vec B
(BT @ l)vec A
(C7.3.3)
where for some min matrix H with the ith column denoted by hi, vec H = (hI ... h~Y.
Using this result, (C7.3.2) can be rewritten in the form of a multivariate linear
regression:
= vec[cpT(t)8] + f(t)
= <j>T(t)8 + f(t)
yet)
(C7.3.4)
= vec 8
(ny(na ny + nb nu)11)
Q=
2:
E(t)ET(t)
1=1
Chapter 7
R =
2: cp(t)q?(t)
1=1
2: cp(t)yT(t)
1=1
Q =
2:
1=1
2: y(t)yT(t) -
(C7.3.5)
1=1
[e - R- 1rfR[e - R- 1r] +
2: y(t)yT(t) -
rTR-lr
1=1
Since the matrix R is positive definite, and the second and third terms in (C7.3.5) do not
depend on e it follows that
Q;;:: Qle=~
where
(C7.3.6)
The LS estimate (C7. 3.6) is usually derived by minimizing tr Q. However, as shown
above it 'minimizes' the whole sample covariance matrix Q. As a consequence of this
strong property,
will minimize any non decreasing function of Q, such as det Q or
tr WQ, with W any positive definite weighting matrix. Note that even though such a
function may be strongly nonlinear in e, it can be shown that (C7.3.6) is its only
stationary point (Soderstrom and Stoica, 1980).
Next consider the model (C7.3.4). The LS estimate of 8 in this model is usually defined
as
1=1
R=
2: <p(t)W<pT(t)
1=1
N
2: <pCt) Wy(t)
t=l
Then
Complement C7.3
N
2:
235
ET(t)WE(t)
2:
t=1
t=1
2: yT(t)Wy(t) - r
T8
- 8T
+ STRS
1=1
N
= [8 - k- 1r]TR[S - R-1r] +
2: yT(t)Wy(t)
- rTR-if
1=1
Since R is a positive definite matrix it is readily concluded from the above equation that
(C7.3.7)
The LS estimate (C7.3.7) seems to depend on W. However, this should not be so. The
two models (C7.3.2) and (C7.3.4) are just two equivalent parametrizations of the same
model (C7.3.1). Also, the loss function minimized by can be written as tr WQ and
hence is in the class of functions minimized bye. Thus must be the minimizer of the
whole covariance matrix Q, and indeed
e
e
e = vec e
(C7.il}]
This can be proved algebraically. Note that by using the Kronecker product (see
Definition A.1O and Lemma A.34) we can write
8 = R- 1r =
= {W
2:
cp(t)][W l]y(t)}
~ CP(t)cpT(t)} {~ (W CP(t)]y(t)}
-1
[/
t=l
2:
[I R-lcp(t)]y(t)
(C7.3.9)
t=l
e = vec (R- r) = (I R1
= (I R- 1)
N
2:
1)
vee
2: vec [cp(t)yT(t)]
t=l
[I R-1][1 cp(t)]y(t)
1=1
2:
[I K-lcp(t)]y(t)
t=l
(C7.3.1O)
236
Chapter 7
The conclusion is that there is no reason for using the estimate (C7.3. 7), which is much
more complicated computationally than the equivalent estimate (C7.3.6). (Note that the
matrix R has dimension na' ny + nb nu while R has dimension ny . (na ny + nb . nu).)
Finally, note that if some elements of the matrix coefficients {Ai, B j} are known, then
the model (C7.3.4) and the least squares estimate of its parameters can be easily adapted
to use this information, while the model (C7.3.2) cannot be used in such a case.
Complement C7.4
The generalized least squares method
Consider the following model of a MIMO system:
A(q-l)y(t) = B(q-l)U(t)
+ vet)
(C7.4.1)
D(q-l)V(t) = E(t)
where
A(q-l)
= J
B(q-l)
B 1q-l
D(q-l) = I
+ ... +
D1q-l
Bnbq-nb
+ ... +
Dndq-nd
(nylny)
(nylnu)
(C7.4.2)
(nylny)
Thus the equation errors {v(t)} are modeled as an autoregression. All the elements of
the matrix coefficients {Ai, B j , Dd are assumed to be unknown. The problem is to
determine the PE estimates {A;, Bj , Dd of the matrix coefficients. Define these
estimates as
where
N
= L.,
'"
t=l
and where N denotes the number of data points, and the dependence of the prediction
errors E on {A;, B j , Dd was stressed by notation. The minimization can be performed in
various ways.
A relaxation algorithm
The prediction errors E(t) of the model under study have a special feature. For given D
they are linear in A and B, and vice versa. This feature can be exploited to obtain a
simple relaxation algorithm for minimizing tr Q. Specifically, the algorithm consists of
iterating the following two steps until convergence.
Complement C7.4
Sl: For given
A, E =
237
D determine
D)
A,B
D=
A and E determine
~he results of Complement C7.3 can be used to obtain explicit expressions for
D.
For step Sl let
..
A, E and
Bnb)T
<j)T(t) = I <pT(t)
Then
E(t, A, B,
D)
= D(q-l)[A(q-l)y(t) - B(q-l)U(t)]
= D(q-l)[y(t) - <j)T(t)8]
= YF(t) -- <j)J(t)8
where
YF(t) = D(q-l)y(t)
<j)J(t) = D(q-l)<j)T(t)
D)
(C7.4.3)
1jJ(t)
= -
A(q-l)y(t) - E(q-l)U(t)
(OT(r - 1) ... OT(t - ncT
= (D! ... D nd )
(Cl.4.4) I
238
Chapter 7
Thus each step of the algorithm amounts to solving an LS problem. This gave the name
of generalized LS (GLS) to the algorithm. It was introduced for scalar systems by Clarke
(1967). Extensions to MIMO systems have been presented by Goodwin and Payne
(1977) and Keviczky and Banyasz (1976), but those appear to be more complicated than
the algorithm above. Note that since relaxation algorithms converge linearly close to the
minimum, the convergence rate of the GLS procedure may be slow. A faster algorithm
is discussed below. Finally note that the GLS procedure does not need any special parametrization of the matrices {Ai, B j , Dd. If, however, some knowledge of the elements
of A, B or D is available, this can be accommodated in the algorithm with minor
modifications.
An indirect GLS procedure
Consider the model (C7.4.1) in the scalar case (lty
A(q-I)y(t) = B(q--l)U(t) +
D(~-l)
nu
= 1)
E(t)
(C7.4.5)
A(q-l)D(q-l)
1 + !lq-l + ... +
G(q-l) = B(q-l)D(q-l)
glq-l
F(q-l)
f~jq-nf
(n!
(ng
+ nd)
= nb + nd)
na
(C7.4.6)
EaCt, a)
F(q--l)y(t) - G(q-I)U(t)
yet) - rpT(t)a
a=
arg min
a
E;(t, a)
t=l
satisfies
LN
aEa(t. a) 'IT
En(t, a) -- aa -- a=a = -
1=1
LN
Ea(t, a)Cf!(t) = 0
(C7.4.7)
.=1
and is given by
a=
With 8 denoting the vector which contains the (unknown) coefficients of A(q-l), B(q-l)
and D(q-l), let a(8) denote the function induced by (C7.4.6). For example, let
na = nb = nd = 1. Then the function a(8) is given by
a(8)
(al
+ d 1 aI d 1 hi bId l )T
y(t) - cpT(t)a(8)
E2(t) =
t=l
E~(t, a)
+ [a - a(8)]TP[a - a(8)]
t=l
(C7.4.8)
+ 2[a - a(8)]T
Ea(t, a)cp(t)
(=1
where
N
cp~t)cpT(t)
t=l
The third term in (C7.4.8) is equal to zero (see (C7.4.7, and the first one does not
depend on 8. Thus the PE estimate of 8 which minimizes the left-hand side of (C7.4.8)
can alternatively be obtained as
8=
(C7.4.9)
_ 8 _ _ _ _ _- - - - - - - '
Note that the data are used only in the calculation of a. The matrix P follows as a
byproduct of that calculation. The loss function in (C7.4.9) does not depend explicitly on
the data. Thus it may be expected that the 'indirect' GLS procedure (C7.4.9) is more
efficient computationally than the relaxation algorithm previously described. Practical
experience with (C7.4.9) has shown that it may lead to considerable computer time
savings (Soderstrom, 1975c; Stoica, 1976). The basic ideas of the indirect PEM described
above may be applied also to more general model structures (Stoica and Soderstrom,
1987).
Complement C7.5
The output error method
Consider the following SISO system:
yet)
= GO(q-l)U(t) +
HO(q-l)e(t)
where GO(q-l) is stable, and HO(q-l) is stable and invertible. Furthermore, let
GO(q-1) =
SO( -1)
q
AO(q-1)
b~q-1
+ ... +
b~bq-nb
-:--~"--'-----=:"7r---=
1 + a~q 1 + ... + a~aqna
240
Chapter 7
and
Vo -_
l)
bO1 .
al .. . ana
bO)T
nb
..
Assume that B(-) and AO(o) have no common zeros, and that the system operates in
open loop so that Eu(t)e(s) = 0 for all t and s.
The output error (OE) estimate of 80 is by definition
N
8=
arg min
8
2.: f2(t, 8)
(C7:Dj
B( -1)
f(t, 8) = yet) - A(:-l) u(t)
1=1
where N is the number of data points, and 8, B and A are defined similarly to 8o , BO and
AO. For simplicity, na and nb are assumed to be known. In the following we derive the
asymptotic distribution of 8, compare the covariance matrix of this distribution with that
corresponding to the PEM (which also provides an estimate of the noise dynamics), and
present sufficient conditions for the unimodality of the loss function in (C7.5.1).
Asymptotic distribution
First note that 8 will converge to 80 as N tends to infinity (cf. (7.53)-(7.56.
Further it follows from (7.78) that
,
VN(e _. 80)
dis!
-0>
N-,>oo
JV(O,
(C7.5.2)
POEM)
POEM
(C7.5.3)
N-;.oo
where
{VN(8)}TI8=Oo =
~ -
f(t, 80 )1jJ(t)
1=1
where
1jJ(t)
-l)T
_ (BO(q-l)
BO(q-l) .
-1
- AO(q-l?U(t-l) ... AO(q 1)2 u (t-na) AO(q l)u(t-l) ... AO(q l)u(t-nb)
it follows that
Complement C7.5
241
(C7.5.4)
Next note that e(t) and '\jJ(s) are independent for all t and s. Using the notation
2: hjq-i
HO(q-l) =
i=O
N E
2:
HO(q-l)e(p)'\jJ(p)
~l
p=l
= ~~oo
4
N
2: HO(q-l)e(t)'\jJT(t)
(C7.5.5)
2: 2: [E'\jJ(p)1jJ T(t)][EW(q-l)e(p)HO(q-l)e(t))
p=l 1=1
~~oo ~
(N - ILI)r(L)E'\jJ(t)'\jJT(t
+ L)
T=-N
The term containing ILl in (C7.5.5) tends to zero as N tends to infinity (see Appendix
AS.1 for a similar discussion). The remaining term becomes (set hi = 0 for i < 0)
't=-co
i=~co
(C7.5.6)
i=-oo j=-OO
(C7.5.S)
242
Chapter 7
a:~) I9
and
0 ,0.0
a:~) I13
(C7.5.9)
( notice that
aE(t)laeloo,uo == HO(~
1) '\jJ T(t)
Next compare the accuracies of the OEM and PEM. One may expect that
(C7.S.lO)
The reason for this expectation is as follows. For Gaussian distributed data PPEM is equal
to the Cramer-Rao lower bound on the covariance matrix of any consistent estimator of
e. Thus (C7.S.lO) must hold in that case. Now the matrices in (C7.S.1O) depend on the
second-order properties of the data and noise, and not on their distributions. Therefore
the inequality (C7.S.l0) must continue to hold for other distributions.
A simple algebraic proof of (C7.S.lO) is as follows. Let <P1jJ(w) denote the spectral
density matrix of '\jJ(t). The matrix
J"
-"
(I
HOC
'\jJ(t)'\jJT(t)
HO(q--l )'\jJ(t)HO(q-l)'\jJT(t)
~iUlW <P1jJ( w)
<P1jJ( w)
<P1jJ( CD)
dw
.2-1
= A.
~
PPEM
[E1j!(t}'4,T'(t)]
Complement C7.5
243
2:
u(t) =
ak sine OJk t
CjJk)
k=1
ak, CjJk E
R;
+ nb)/2
(na
OJk E
(0,
Jt)
(C7.5.11)
it holds that
(C7.5.12)
Observe that in (C7.5.11) we implicitly assumed that na + nb is even. This was only done
to simplify to some extent the following discussion. For na + nb odd, the result (C7.5.12)
continues to hold provided we allow OJk = Jt in (C7.5.11) such that the input still has a
spectral density of exactly na + nb lines.
The result above may be viewed as a special case of a more general result presented
in Grenander and Rosenblatt (1956) and mentioned in Problem 4.12. (The result of
Grenander and Rosenblatt deals with linear regressions. Note, however, that POEM and
PPEM can be viewed as the asymptotic covariance matrices of the LSE and the Markov
estimate of the parameters of a regression model with '4J(t) as the regressor vector at time
instant t and HOC q -1) as the noise shaping filter.)
There follows a simple proof of (C7.5.12) for the input (C7.5.11). The spectral density
of (C7.5.11) is given by
(BO(q-1) -1
BO(q-l) -na
_q-l
_q-nb
AO(q I? q ... AO(q I? q
AO(q 1) ... AO(q 1)
)T
Then
'4J(t) = F(q-l)U(t)
and thus
EH O(q--l)'4J(t) . HO(q-l)'4JT(t)
=
2:
iW )
aiIHO(eiwk)i2{Re[F(eiWk)FT(e-iWk)]}/2
k=l
2:
k=l
(C7.5.13)
244
Chapter 7
gI
fI
g~
f~
(C7.S.13)
where gk and h denote the real and imaginary parts, respectively, of GkF(e iwk )/Y2.
Similarly, one can show that
1
()
1
E HO(q 1) 'IjJ t HO(q 1)
'tjJ
T( )
-1
MH M
E'IjJ(t)'ljJT(t) = MMT
Since A o and Bo are coprime and the input (C7.S.11) is pe of order na + nb, the matrix
MHMT and hence M must be nonsingular (see Complement CS.1). Thus
POEM
= ",ZM-TM-1MHMTM-TM--l
PPEM
",2M- T HM- 1
and
optimization problem (C7.S.1) may occur. For N < 00 the loss function is a random
variable whose shape depends on the realization. Consider the asymptotic (for IV --'> (0)
loss function
W(8)
= E [yet)
_ A(ql)
B(q-1) U(t)]2
The second term above does not depend on 8. To find the stationary points of W(8),
differentiate the first term with respect to ai and bj This gives
Complement C7.5
i
j
245
= 1, ... , na
(C7.S.14a)
= 1,
(C7.S.14b)
... , nb
k=l
V =
L
00
-1)
-1)] }2
{[BO(
B(
Ao(q -1) - A(q -1) u(t)
k=l
where Eu(t)u(s) = Ot,s' Therefore, the result stated above is relevant also for the
problem of rational approximation of an estimated weighting sequence.
To prove that (C7.5.14) has a unique solution, it will be shown first that there is no
solution with B(z) == O. If B(z) == 0 then (C7.5.14b) implies that
. BO(q-l)
j = 1, ... , n
(C7.S.1S)
BO(q-1)
Let u(t)
= AO(q I)
u(t). Then
= E
=0
l)BO(q-l)u(t)] - a?E
u(t - l)u(t)
A(;-l) u(t -
n)u(t)
246
Chapter 7
The first term is zero since the input u(t) is white noise. The other terms are also zero, by
(C7.5.15). Similarly,
1
_
E A(q-l) u(t - k)u(t)
=0
all k
Thus
(C7.5.16)
= B(z)L(z)
liz
+ ... +
Imzm
mE
[0,
n -
1]
Since the polynomial L (including its degree) is arbitrary, any pair of polynomials A, B
can be written as above. Inserting (C7.5.16) into (C7.5.14) gives
0
n{
n{
u(t - 1)
hI ... h n- m
0
0 0 hi ... hn-m
1 {l1 ... {In-m 0
L
1
E A(q l)A (q 1)
at ... an - m
=v _
'/(B, A)
aCt)
== 0
u(t -- 2n + m)
where
Since A and B have no common zeros, the Sylvester matrix ./(13, A), which has
has full rank (see Lemma A.30 and its corollaries). Hence
dimension (2nl(2n-
m,
i = 1, ... , 2n - m
= 2n - m)
l)~(q
= E A(q
m - l)u(t)
1) u(t - 2n + m - l)[ll(q-l)U(t)
=0
In a similar manner it can be shown that
all i
J2 _
- E
[1l(q-l)A(q-l)A(q-1)
1
G(q 1)
A(q l)A(q 1) u(t)
=0
Thus, ll(z) == 0, or equivalently A(z)BO(z) == AO(z)B(z). Since AO(z) and SO(z) have no
common zeros, it is concluded that A(z) == A o(z) and B(z) == BO(z).
The above result on the unimodality of the (asymptotic) loss function associated with
the OEM can be extended slightly. More exactly, it can be shown that the result
continues to hold for a class of ARMA input signals of 'sufficiently low order' (compared
to the order of the identified system) (see Soderstrom and Stoica, 1982).
In the circuits and systems literature the result shown above is sometimes called
'Stearns' conjecture' (see, for example, Widrow and Stearns, 1985, and the references
therein). In that context the result is relevant to the design of rational infinite impulse
response filters.
Complement C7.6
Unimodality of the PEM loss function for ARMAprocesses
Consider an ARMA process y(t) given by
AO(q-1)y(t)
CO(q-l)e(t)
Ee(t)e(s) = j,?Ot,s
where
AO(q-l) = 1
CO(q-l) =
248
Chapter 7
According to the spectral factorization theorem (see Appendix A6.1), there is no restriction in assuming that AO(z) and CO(z) have no common roots and that AO(z)Co(z) =I=- 0
for Iz I ::::; 1. The prediction error (PE) estimates of the ( unknown) coefficients of A o( q -1)
and CO(q-I) are by definition
(C7.6.1)
where N denotes the number of data points, A(q-l) and C(q-I) are polynomials defined
similarly to AO(q-l) and CO(q-l), and 8 denotes the unknown parameter vector
Cnc)T
Cl'"
For simplicity we assume that na and nc are known. The more general case of unknown
na and nc can be treated similarly (Astrom and Soderstrom, 1974; Stoica and
Soderstrom, 1982a).
In the following it will be shown that for N sufficiently large, the loss function in
(C7.6.1) is unimodaL The proof of this neat property follows the same lines as that
presented in Complement C7.5 for OEMs. When N tends to infinity the loss function in
(C7.6.1) tends to
W(8) ~ Ef?(t, 8)
The stationary points of W(8) are the solutions of the following equations:
1
EE(t, 8) A(q 1) E(t - i, 8)
i = 1, ... , na
(C7.6.2)
EE(t, 8)
C(~
1) t(t - j, 8)
j = 1, ... , nc
Since the solutions A(q-l) and C(q-I) of (C7.6.2) are not necessarily coprime
polynomials, it is convenient to write them in factorized form as
A(z) = A(z)D(z)
C(z)
(C7.6.3)
C(z)D(z)
d1z
+ ... +
dndz nd
nd
1 Cl ...
na{
Cnc--nd
0
1 Cl ...
-
Cnc-nd
1 al ... ana-nd
nc{
0
-
1 al ...
J(C,A)
--
ana-nd
E{
E(t,8) =0
1
A(q-l)C(q-l) E(t - na -- nc + nd, 8)
(C7.6.4)
1, ... , na
+ nc - nd
(C7.6.5)
Introduce
ng
na
+ nc - nd
A( -l)CO( -I)
q q . e(t)
G(q-l)
(C7.6.6)
and
A(q~ l)lC(q
= E ACq-l)lC(q-l) E(t - ng -
1, 8)A(q-l)CO(q-l)e(t)
=0
1
E A(q l)C(q 1) E(t - k, 8)E(t, 8) = 0
for all k ?: 1
EE(t - k, 8)E(t, 8)
= E{A(q-l)C(q-l)[i1(q_l)lC(q_l) E(t - k,
=
8)
J E(t, 8)}
for an k ?: 1
Thus E(t, 8), with 8 satisfying (C7.6.2), is a sequence of independent random variables.
In view of (C7.6.6) this is possible if and only if
(C7.6.7)
Since AO(z) and CO(z) are coprime polynomials, the only solution to (C7.6.7) is A(z)
A Cf') , C(z) = CO(z), which concludes the proof.
Complement C7.7
Exact maximum likelihood estimation of AR and ARMA parameters
The first part of this complement deals with ARMA processes. The second part will
specialize to AR processes, for which the analysis will be much simplified.
Chapter 7
Ee(t)e(s)
= "..2o/,s
(C7.7.1)
where
2~2 yTQ-1y)
(C7.7.2)
where
1
Q ~ 'A 2 cov(Yle)
The maximum likelihood estimates (MLE) of e and 'A are obtained by maximizing
(C7.7.2). A variety of search procedures can be used for this purpose. They all require
the evaluation of p(yle, 'A) for given e and 'A. As shown below, the evaluation of the
covariance matrix Q can be made with a modest computational effort. A direct
evaluation of p (yl e , 'A) would, however, require O( N 2 ) arithmetic operations if the
Toeplitz structure of Q (i.e. Qij depends only on Ii - jj) is exploited, and O(N3)
operations otherwise. This could be a prohibitive computational burden for most
applications. Fortunately, there are fast procedures for evaluating p( Yle, 'A), which need
only O(N) arithmetic operations. The aim of this complement is to describe the
procedure of Ansley (1979), which is not only quite simple conceptually but seems also to
be one of the most efficient numerical procedures for evaluating p( Yle, 'A). Other fast
procedures are presented in Dent (1977), Gardner et al. (1980), Gueguen and Scharf
(1980), Ljung and Box (1979), Newbold (1974), and Dugre et al. (1986).
Let
(C7.7.3)
m = max(na, nc)
and let
1
2: = 'A2 cov(ZI8)
{)Z
{)y
1
1
ana .. al
0
ana' .. al
Hence the Jacobian of (C7.7.3) (i.e. the determinant of the matrix aZlay) is equal to
one. It therefore follows that
(C7.7.4)
Next consider the evaluation of 2: for given 8 and 'A. Let s
(a) For 1
m, t + s
m,
O. Then:
m,
;:?:
(C7.7.5a)
+ s > m,
Ez(t)z(t + s) = Ey(t)C(q--l)e(t + s) ~ as
(C7.7.5b)
Ez(t)z(t + s) = EC(q-l)e(t)C(q--l)e(t + s)
(C7.7.5c)
O s > nc
The co variances {rs} of yet) can be evaluated as follows. Let
fJk
= Ey(t)e(t - k)
252
Ey(t - k)C(q-l)e(t) =
for k
Clearly ak =
produces
Qj
: : ;
Chapter 7
CkQO
+ ... +
cncQnc-k
(k
(C7.7.7a)
0)
+ alQj--l + ... +
anaQj--na =
{o2
"A
j> nc
(C7.7.7b)
0::::; J ::::; nc
Cj
Since Qj = for j < 0, the sequence {Q j} can be obtained easily from (C7. 7. 7b). Note that
to compute the sequence {ad requires only {Q j} 7~o. Thus the following procedure can
be used to evaluate {rs }, for given 8 and "A:
(a) Determine {QJ, j E [0, nc] (Qj = 0, j < 0) from (C7.7.7b).
Evaluate {ad, k E [0, nc] (ak = 0, k > nc) from (C7.7.7a).
(b) Solve the following linear system (cf. (C7.7.6) to obtain {rd, k
I
al
+
ana
ana--l
. _. I
al
[0, na):
. . . ana
a2
ana
ana
The other covariances {rd, k> na, can then be obtained from (C7.7.6).
Note that the sequence {a,J is obtained as a by-product of the procedure above. Thus
there is a complete procedure for evaluation of 2:.
Next note that 2: is a banded matrix, with the band width equal to 2m. Now the idea
behind the transformation (C7. 7.3) can be easily understood. Since in general m ~ N,
the lower triangular Cholesky factor L of 2:,
(C7.7.8)
is also a banded matrix with the band width equal to m, and it can be determined in D(N)
arithmetic operations (see Lemma A.6 and its proof; see also Friedlander, 1983). Let the
(Nil) vector e be defined by
Le = Z
(C7.7.9)
Computation of e can be done in D(N) operations. Evaluation of det L also needs D(N)
operations.
Inserting (C7.7.8) and (C7.7.9) in (C7.7.4) produces
p(yle, "A) = (2JtA?)-NI2(det
= canst
~ log "A2
- log(det L) -
2~2 eTe
(C7.7.1O)
oLe8, 'A.)
o'A. 2
N 1
~ ~ ~_e_T_e
_________
= const
AR processes
For AR processes C(z) == 1, and the previous procedure for exact evaluation of the
likelihood function simplifies considerably. Indeed, for nc = 0 the matrix 2: becomes
where
R
= :2covCYI8)
Chapter 7
Both R- 1 and det R can be computed in O(na 2 ) arithmetic operations from A and 8 by
using the Levinson-Durbin algorithm (see, e.g., Complement C8.2).
Note that it is possible to derive a closed form expression for R- 1 as a function of 8. At
the end of this complement a proof of this interesting result is given, patterned after
Godolphin and Unwin (1983) (see also Gohberg and Heinig, 1974; Kailath, Vieira and
Morf, 1978).
Let
A=
B=
o
Also introduce {ad through
L.J ai z
i/',_l_
=
(C7.7.13)
A(z)
,=0
with ao = 1 and ak = 0 for k < O. This definition is slightly different from (C7. 7. 5b) (ai in
(C7.7.13) is equal to a-i in (C7.7.5b). Similarly, introduce ao = 1 and ak = 0 for k < O.
Multiplying both sides of
A(q-l)y(t) = e(t)
by yet - k), k;::::, 1, and yet
anark-na =
anark+na = akA 2
k ;::::, 1
(C7.7.14)
k ;::::, 0
Thus
na
2: ajak-i =
for k ;::::, 1
(C7.7.15)
;=0
A-I =
(C7.7.16)
ak-iUj-k
na
apUj-i-p
p=l-i
k=l
= {1
L
p=O
which proves that the matrix in (C7.7.16) is really the inverse of A. Let
r2
RAT
+ RB
= A-I
(C7.7.17a)
RBT
+ RA
== 0
(C7.7.17b)
na
"
L.. ri-kak-j
na
2:
ri-kak-j
k=j
2:
rna+k-jana+k-j
k=l
na-j
rna+k-iana+k-j
k=l
k=l
==
"
L..
2:
na
ri-p-jap
p=O
2:
rj_p_jap
p=na-j+l
na
2:
ri-p-jap
p=O
If j
i this is equal to
fla
2: 'p
_i+jap
= A2 Uj_i
= A2[A -lk
p=O
according to (C7.7.14). Similarly, for i > j the first part of (C7.7.14) gives
na
2:
ri-p-jap
= 0 = A2 [X- 1k
p=O
Hence (C7.7.17a) is verified. The proof of (C7.7.17b) proceeds in a similar way. The i, j
element of the left-hand side is given by
256
Chapter 7
k=l
k=l
na
ri-kana+j-k
k=j
rna+k-iaj-k
k=l
na
=L
2.:
j-I
ri--na-j+pap
p=j
2.:
rna-i+j-pap
p=o
na
2.:
rna-i+j-p ap
p=o
=0
The last equality follows from (C7.7.14) since na - i
The relations (C7. 7 .17) give easily
RAT - RBTA -lB = A- 1
+j
or
It is easy to show that the matrices A and B commute. Hence it may be concluded from
the above equation that
(C7.7.18)
Complement C7.B
ML estimation from noisy input-output data
Consider the system shown in Figure C7.S.1; it can be described by the equations
xCt)
L-.--~r"J-_ _ _
yet)
(Measured)
u(t)
(Measured)
(C7.8.1)
where A * (q-l) etc. are polynomials in the unit delay operator q-l, and
Eei(t)ei(s) = ATOt,s
Ee2(t)e3(s) = 0
= 2, 3
for all t,
The special feature of the system (C7.8.1) is that both the output and the input
measurements are corrupted by noise. The noise corr~pting the input may be due, for
example, to the errors made by a measurement device. The problem of identifying
systems from noise-corrupted input-output data is often caned errors-in-variables.
In order to estimate the parameters of the system, the noise-free input x(t) must be
parametrized. Assume that x(t) is an ARMA process,
G*(q-l)
x(t)
(C7.8.2)
= H*(q-l) eI(t)
Eel(t)el(s) = MOt,s
Eej(t)ei(S)
i = 2, 3, for all t, s
The input signals occurring in the normal operating of many processes can be well
described by ARMA models. Assume also that all the signals in the system are
stationary, and that the description (C7.8.1) of the system is minimal.
Let 8 denote the vector of unknown parameters
(3 =
(coefficients of {A *, B*, C*, D*, G*, H*, K*, L *}; AI, A2,
)'3)T
and let U and Y denote the available measurements of the input and output, respectively,
U
(C7.8.3)
= C(q-l)eI(t) + D(q-l)e2(t)
B(q-l)U(t) = G(q-l)eI(t) + H(q-l)e3(t)
A(q-l)y(t)
(C7.8.4)
258
Chapter 7
The results of Complement C7. 7 could be used to derive an 'exact algorithm' for the
evaluation of (C7.8.3). The algorithm so obtained would have a simple basic structure
but the detailed expressions involved would be rather cumbersome. In the fonowing, a
much simpler approximate algorithm is presented. The larger N, the more exact the
result obtained with this approximate algorithm will be.
Let
t(t)
= (A(q-I)y(t)
(C7.8.5)
B(q-l)U(t)
Clearly t is a bivariate MA process of order not greater than m (see (C7.8.4)). Thus, the
covariance matrix
is banded, with the band width equal to 4m + 1. Now, since t(t) is an MA(m) process it
follows from the spectral factorization theorem that there exists a unique (212)
polynomial matrix Seq-I), with S(O) = I, deg S = m, and det S(z) =1= 0 for Izl ~ 1,
such that
t(l)
= S(q-l)c(t)
(C7.8.6)
where
A>O
(see, e.g., Anderson and Moore, 1979).
From (C7.8.5), (C7.8.6) it follows that the prediction errors (or the innovations) of the
bivariate process (yT(t) UT(t))T are given by
(C7.8.7)
The initial conditions needed to start (C7.8.7) may be chosen to be zero. Next, the
likelihood function can be evaluated approximately (cf. (7.39 as
There remains the problem of determining S(q-I) and A for a given 8. A numerically
efficient procedure for achieving this task is the Cholesky factorization of the infinitedimensional covariance matrix ~.
First note that ~ can be readily evaluated as a function of 8. For s ;?: 0,
Complement C7.8
Ai(cog, + ...
+ cng-,gng)
259
where Ck = 0 for k < 0 and k > nc, and similarly for the other coefficients. A numerically
efficient algorithm for computation of the Cholesky factorization of i:,
i: = LLT
is presented by Friedlander (1983); see also Lemma A.6and its proof. L is a lower
triangular banded matrix. The (212) block entries in the rows of L converge to the
coefficients of the MA description (C7.8.6). The number of block rows which need
to be computed until convergence is obtained is in general much smaller than N.
Computation of a block row of L requires O(m) arithmetic operations. Note that
since Seq-i) obtained in this way is guaranteed to be invertible (i.e. det S(z) =I=- 0
for Izl : : : ; 1), the vector e need not be constrained in the course of optimization.
Finally note that the time-invariant innovation representation (C7 .8. 7) of the process
(yT(t) 'UT(tT can also be found by using the stationary Kalman filter (see Soderstrom,
1981).
This complement has shown how to evaluate the criterion (C7.8.3). To find the
parameter estimates one can use any standard optimization algorithm based on criterion
evaluations only. See Dennis and Schnabel (1983), and Gill et ai. (1981) for examples
of such algorithms.
For a further analysis of how to identify systems when the input is corrupted by
noise, see Anderson and Deistler (1984), Anderson (1985), and Stoica and Nehorai
(1987c).
Chapter 8
INSTRUMENTAL VARIABLE
METHODS
8.1 Description of instrumental variable methods
The least squares method was introduced for static models in Section 4.1 and for dynamic
models in Section 7.1. It is easy to apply but has a substantial drawback: the parameter
estimates are consistent only under restrictive conditions. It was also mentioned in
Section 7.1 that the LS method could be modified in different ways to overcome this
drawback. This chapter presents the modification leading to the class of instrumental
variable (IV) methods. The idea is to modify the normal equations.
This section, which is devoted to a general description of the IV methods, is organized
as follows. First the model structure is introduced. Next, for completeness, a brief review
is given of the LS method. Then various classes of IV methods are defined and it is shown
how they can be viewed as generalizations of the LS method.
Model structure
The IV method is used to estimate the system dynamics (the transfer function from the
input u(t) to the output yet~. In this context the model structure (6.24) will be used:
(8.1)
where yet) is the ny-dimensional output at time t, cpT(t) is an (nylnG) dimensional matrix
whose elements are delayed input and output components, G is a n8-dimensional
parameter vector and E(t) is the equation error. It was shown in Chapter 6 that the model
(8.2)
where A(q-l), B(q-l) are the polynomial matrices
A(q-l)
B(q-l)
(8.3)
can be rewritten in the form (8.1). If all entries of matrix coefficients {Ai, Bj } are
unknown, then the so-called full polynomial form is obtained (see Example 6.3). In this
case
260
Section 8.1
(S.4a)
<pT(t)
(_yT(t - 1)
((8?T)
8= (~1)
8 ny
_yT(t - na)
B1
...
B nb )
(S.4b)
(S.5)
(ony)T
In most of the analysis which follows it is assumed that the true system is given by
(S.6)
where v(t) is a stochastic disturbance term and 8 0 is the vector of 'true' parameters.
(3
(S.7)
Using the true system description (S.6), the difference between the estimate (3 and the
true value 80 can be determined (cf. (7.6)):
(3 - 80 =
[1~
N
<j>(t)<j>T(t)
J-1[1 ~
N
<j>(t)v(t) ]
(S.S)
(3 - 80
= [E<j>(t)<pT(t)r1[E<p(t)v(t)]
(S.9)
Thus the LS estimate 8, (8.7), will have an asymptotic bias (or expressed in another way,
it will not be consistent) unless
E<p(t)v(t)
(S.lO)
However, noting that <j>(t) depends on the output and thus implicitly on past values of
vet) through (S.6), it is seen that (S.10) is quite restrictive. In fact, it can be shown
that in general (S.10) is satisfied if and essentially only if v(t) is white noise. This
was illustrated in Examples 2.3 and 2.4. This disadvantage of the LS estimate can be
seen as the motivation for introducing the instrumental variable method.
Basic IV methods
The idea underlying the introduction of the IV estimates of the parameter vector 80 may
be explained in several ways. A simple version is as follows. Assume that Z(t) is a
Chapter 8
(nzlny) matrix, the entries of which are signals uncorrelated with the disturbance v(t).
Then one may try to estimate the parameter vector 8 of the model (8.1) by exploiting
this property, which means that 8 is required to satisfy the following system of linear
equations:
1
N
2: Z(t)(t)
(=1
1
N
2:
Z(t)[y(t) - <j>T(t)8] = 0
(8.11)
(=1
(8.12)
where the inverse is assumed to exist. The elements of the matrix Z(t) are usually called
the instruments. They can be chosen in different ways (as exemplified below) subject to
certain conditions guaranteeing the consistency of the estimate (8.12). These conditions
will be specified later. Evidently the IV estimate (8.12) is a generalization of the LS
estimate (8.7): for Z(t) = <j>(t), (8.12) reduces to (8.7).
Extended IV methods
8=
arg mJn
(8.13)
Here Z(t) is the IV matrix of dimension (nzlny) (nz;::: nS), F(q-l) is an (nylny)
dimensional, asymptotically stable (pre-)filter and Ilxll~ = x T Qx, where Q is a positive
definite weighting matrix. When F(q-l) == I, nz = nS, (Q = I), the basic IV estimate
(8.12) is obtained. Note that the estimate (8.13) is the weighted least squares solution of
an overdetermined linear system of equations. (There are nz equations and n8
unknowns.) The solution is readily found to be
T
-1 T
S = (RNQRN)
RNQrN
A
(8.14a)
where/
RN
Z(t)F(q-l)<j>T(t)
(8.14b)
Z(t)F(q-l)y(t)
(8.14c)
1=1
1
rN = N
2:
t=l
Section 8.1
u(t)
yet)
rZ E
denotes
This form of the solution is merely of theoretical interest. It is not suitable for implementation. Appropriate numerical algorithms for implementing the extended IV
estimate (8.13) are discussed in Sections 8.3 and A.3.
Figure S.l is a schematic illustration of the basic principle of the extended IVM.
Choice of instruments
The following two examples illustrate some simple possibilities for choosing the IV
matrix Z(t). A discussion of the conditions which these choices must satisfy is deferred
to a later stage, where the idea underlying the construction of Z(t) in the following
examples also will be clarified. In Complement C8.1 it is described how the IV
techniques can be used also for pure time series (with no input present) and how they
relate to the so-called Yule- Walker approach for estimating the parameters of ARMA
models.
Example 8.1 An IV variant for SISO systems
One possibility for choosing the instruments for a SISO system is the following:
Z(t)
(S.15a)
D(q-l)U(t)
(S.15b)
The coefficients of the polynomials C and D can be chosen in many ways. One special
choice is to let C and D be a priori estimates of A and B, respectively. Another special
Chapter 8
(8.1Sc)
Note that a reordering or, more generally, a nonsingular linear transformation of the
instruments Z(t) has no influence on the corresponding basic IV estimate. Indeed, it can
easily be seen from (8.12) that a change of instruments from Z(t) to TZ(t) , where T is a
II
nonsingular matrix, will not change the parameter estimate 8 (cf. Problem 8.2).
Example 8.2 An IV variant for multivariable systems
One possible choice of the instrumental variables for an MIMO system in the full
polynomial form is the following (cf. (8.4:
Z(t)
Z(t)
( 0
(8.16a)
with
(8.I6b)
vet) = H(q--l)e(t)
(8.17)
where the matrix filter H(q-l) is asymptotically stable and invertible, H(O) = I,
and Ee(t)eT(s) = Aol,s with the covariance matrix A positive definite. (See, for
example, Appendix A6.1.)
A4: The input u(t) and the disturbance vet) are independent. (The system operates
in open loop.)
AS: The model (8.1) and the true system (8.6) have the same transfer function
Section 8.2
Theoretical analysis
265
matrix (if and) only if 8 = 8 0 , (That is, there exists a unique member in the
model set having the same noise-free input-output relation as the true system.)
A6: The IV matrix Z(t) and the disturbance v(s) are uncorrelated for all t and s.
The above assumptions are mild. Note that in order to satisfy AS a canonical parametrization of the system dynamics may be used. This assumption can be rephrased as:
the set DT(J?, vIt) (see (6.44 consists of exactly one point. If instead a pseudo-canonical
parametrization (see the discussion following Example 6.3) is used, some computational advantages may be obtained. However, AS will then fail to be satisfied for some
(but only a few) systems (see Complement C6.1, and Soderstrom and Stoica, 1983, for
details). Concerning assumption A6, the matrix Z(t) is most commonly obtained by
various operations (filtering, delaying, etc.) on the input u(t). Provided the system
operates in open loop, (see A4), assumption A6 will then be satisfied. Assumption A6
can be somewhat relaxed (see Stoica and Soderstrom, 1983b). However, to keep the
analysis simple we will use it in the above form.
Consistency
2: Z(t)F(q-l)V(t)
=N
(8.18)
t=1
Then it is readily established from the system description (8.6) and the definitions
(8.l4b, c) that
rN
= RN8 0 + qN
(8.19)
e- 8
= (Rf.;QRN)-lRf.;QqN
(8.20)
= EZ(t)F(q-l)<j>T(t)
--?
00,
and
(8.2la)
N-,,"oo
lim qN = EZ(t)F(q-l)V(t) ~ q
(8.2lb)
N->oo
= EZ(t)F(q-l)<i>T(t)
(8.22)
where <i>(t) is the noise-free part of <j>(t). To see this, recall that the elements of <j>(t)
consist of delayed input and output components. If the effect of the disturbance v()
(which is uncorrelated with Z(t is subtracted from the output components in <j>(t), then
the noise-free part <i>(t) is obtained. More formally, <i>(t) can be defined as a conditional
mean:
<i>(t)
= E[<j>(t)lu(t
(8.23)
Chapter 8
As an illustration the following example shows how ~(t) can be obtained in the SISO
case.
Example 8.3 The noise-free regressor vector
Consider the SISO system
(S.24a)
for which
<l>(t)
(S.24b)
(S.24c)
A(q-l)X(t) = B(q-l)U(t)
(S.24d)
Return now to the consistency analysis. From (S.20) and (S.21) it can be seen that the IV
estimate (S.13) is consistent (Le. lim 8 = 00 ) if
N-~oo
(S.25a)
EZ(t)F(q-l)V(t) = 0
(S.25b)
It follows directly from assumption A6 that (S.25b) is satisfied. The condition (S.25a) is
more intricate to analyze. It is easy to see that (S.25a) implies that
nz
rank EZ(t)ZT(t) ?= nO
?= nO
(S.26)
= {gig
Q,
s is not true}
Section 8.2
The Lebesgue measure is discussed, for example, in Cramer (1946) and Pearson (1974).
In loose terms, we require M to have a smaller dimension than Q. If s is generically true
with respect to Q and Q E Q is chosen randomly, then the probability that Q E M is
zero, i.e. the probability that s is true is one. (In particular, if s is true for all Q E Q,
it is trivially generically true with respect to Q.) Thus the statements's is generically
true' and's is true with probability one' are equivalent.
The idea is now to consider the matrix Rasa function of a parameter vector Q. A
possibility for doing this is illustrated by the following example.
Example 8.4 A possible parameter vector Q
Consider the IV variant (8.15). Assume that F(q-l) is a finite-order (rational) filter
specified by the parameters fl' ... ,Inf' Further, let the input spectral density be specified
by the parameters VI, ... , Vnv ' For example, suppose that the input u(t) has a rational
spectral density. Then {Vi} could be the coefficients of the ARMA representation of u(t).
Now take
Q
= (al
... ana
bI
...
bnb
fl ... fn!
Cl'"
Cnc
d1
...
dnd
VI ... Vnv)T
Additional and similar constraints may be added on Vb ... , Vnv , depending on how
these parameters are introduced.
Note that all elements of the matrix R will be analytic functions of every element of Q.
II
E Q
(8.27)
Note that for the example above,
Q* =
hI ... b nb 0 ... 0
al'"
ana
bI
..
b nb
VI". Vnv)T
which gives F(q-l) == 1 and Z(t) = 4>(t). The matrix in (8.27) is nonsingular if the model
is not overparametrized and the input is persistently exciting (see Complement C6.2). It
should be noted that the condition (8.27) is merely a restriction on the structure of the IV
matrix Z(t). For some choices of Z(t) (and Q) one cannot achieve Z(t) = 4>(t), which
implies (8.27) (and essentially, is implied by (8.27)). In such cases (8.27) can be relaxed:
it is required only that there exists a vector Q* E Q such that R(Q*) has full rank (equal
to nO.) This is a weak condition indeed.
Now set
f(g)
= det[RT(Q)R(Q)J
(8.28)
268
Chapter 8
Here f(Q) is analytic in Q and f(Q*) =1= 0 for some vector Q*. It then follows from the
uniqueness theorem for analytic functions that f(Q) =1= 0 almost everywhere in Q (see
Soderstrom and Stoica, 1983, 1984, for details). This result means that R has full rank for
almost any value of Q in Q. In other words, if the system parameters, the prefilter, the
parameters describing the filters used to create the instruments and the input spectral
density are chosen at random (in a set Q of feasible values), then (8.25a) is true with
probability one. The counterexamples where R does not have full rank belong to a subset
of lower dimension in the parameter space. The condition (8.25a) is therefore not a
problem from the practical point of view as far as consistency is concerned. However, if
R is nearly rank deficient, inaccurate estimates will be obtained. This follows for example
from (8.29), (8.30) below. This phenomenon is illustrated in Example 8.6.
Asymptotic distribution
Next consider the accuracy of the IV estimates. The following result holds for the
asymptotic distribution of the parameter estimate 8:
(8.29)
PIV
= (RTQR)-lRTQ{ E[~)
x
Z(t
i)KiJ
(8.30)
Ki Zi
== F(z)H(z)
(8.31 )
i=O
In the scalar case (ny = 1) the matrix in (8.30) between curly brackets can be written as
(8.32)
Section 8.2
Violation of the assumptions
It is of interest to discuss what happens when some of the basic assumptions are not
satisfied. Assumption A4 (open loop operation) is not necessary. Chapter 10 contains a
discussion of how systems operating in closed loop can be identified using the IV method.
Assume for a moment that AS is not satisfied, so that the true system is not included in
the considered model structure. For such a case it holds that
(8.33a)
In the general case it is difficult to say anything more specific about the limit 8* (see,
however, Problem 8.12). If nz = n8, the limiting estimate 8* can be written as
(8.33b)
where
r = EZ(t)F(q-l)y(t)
In this case, the estimates will still be asymptotically Gaussian distributed,
y' N(e - 8*)
PlY
~ .%(0, PIV )
(8.34)
= R-1SR- T
where
[N
1 E ~ Z(t)F(q-l){y(t) - <l>1(t)8*}
;~oo N
[~Z(S)F(q-l){y(s) -
<l>T(s)8*}
JI'
This can be shown in the same way as (8.29), (8.30) are proved in Appendix A8.I. A
more explicit expression for the matrix S cannot in general be obtained unless more
specific assumptions on the data are introduced. For the case nz > n8 it is more difficult
to derive the asymptotic distribution of the parameter estimates.
Numerical illustrations
The asymptotic distribution (8.29), (8.30) is considered in the following numerical
examples. The first example presents a Monte Carlo analysis for verifying the theoretical
expression (8.30) of the asymptotic covariance matrix. Example 8.6 illustrates the
concept of generic consistency and the fact that a nearly singular R matrix leads to poor
accuracy of the parameter estimates.
Chapter 8
A:
= 1.0U(t - 1) + (1 + O.5q-l)e(t)
(1 - 1.5q-l + 0.7q-2)y(t) = (1.0q-l + 0.5q-2)U(t) + (1 - 1.0q-l + 0.2q-2)e(t)
(1 - O.5q-l)y(t)
./2:
generating for each system 200 realizations, each of length 600. The input u(t) and e(t)
were in all cases mutually independent white noise sequences with zero means and unit
variances.
The systems were identified in the natural model structure (8.2) taking na = nb = 1 for
./1 and na = nb = 2 for '/2'
For both systems two IV variants were tried, namely
51: Z(t) = (u(t -- 1) ... u(t - na - nb)yr
(see (8.15c
and
1
52: Z(t) = AO(q-l) (u(t - 1) ... u(t - na - nb)yr
(see (8.15a, b
where Ao is the A polynomial of the true system. From the estimates obtained in each of
the 200 independent realizations the sample mean and the sample normalized covariance
matrix were evaluated as
p= ~
(8.35)
i=1
where 8i denotes the estimate obtained from realization i, N = 600 is the number of data
points in each realization, and m :::: 200 is the number of realizations. When m and N
tend to infinity, 8 -,> 80 , P-,> P rv . The deviations from the expected limits for a finite
value of m can be evaluated as described in Section B.9 of Appendix B. It foHows from
the theory developed there that
as N
-,> 00
A scalar fiCO, 1) random variable lies within the interval (-1.96, 1.96) with probability
0.95. Hence, by considering the above result for 8 - 80 componentwise, it is found that,
with 95 percent probability,
E[P} k
P!Vd
J
= -m
P 1VJ'k
1 - 11m
m
PIV,PIVkk
JJ
Section 8.2
IPjk -
~ 1.96
PIVjkl
IV:
( p IVjj P
IP]i P-
PIVjjl
P this
Pjk
is
+ p2 )112
IVjk
2.77
Ym
rvjj
In the present example the right-hand side has the value 0.196. This means that the
relative error inPjj should not be larger than 20 percent, with a probability of 0.95.
The numerical results obtained from the simulations are shown in Table 8.1. They are
well in accordance with the theory. This indicates that the asymptotic result (8.29), (8.30)
can be applied also for reasonable lengths, say a few hundred points, of the data series .
()
Yt
=1_
1.0q-l
2Qq
Q2q
2 U
Q vector
()
t
()
+et
u(t)
= wet) -
2Q 2w(t - 2)
Q4 w(t - 4)
TABLE 8.1 Comparison between asymptotic and sample behavior of two IV estimators.
The sample behavior shown is estimated from 200 realizations of 600 data pairs each
Distribution
parameters:
means and
normalized
co variances
System ./!
Variant),!
System./2
Variant)'2
Variant),!
Variant)'2
Asympt.
expect.
values
Sample
estim.
values
Asympt.
expect.
values
Sample
estim.
values
Asympt.
expect.
values
Sample
estim.
values
Asympt.
expect.
values
Sample
estim.
values
Ed!
-0.50
-0.50
-0.50
-0.50
E~2
E~l
1.00
1.00
1.00
1.00
1.25
-0.50
1.26
-0.63
1.31
-0.38
1.37
-0.56
1.25
1.25
1.25
1.23
-1.50
0.70
1.00
0.50
5.19
-7.27
-0.24
6.72
10.38
0.27
-9.13
2.04
-1.44
10.29
-1.51
0.71
1.00
0.49
6.29
-8.72
-0.98
6.74
12.27
1.46
-9.08
2.36
-1.97
8.74
-1.50
0.70
1.00
0.50
0.25
-0.22
--0.08
0.71
0.20
0.06
-0.59
2.04
-1.28
3.21
-1.50
0.70
tOO
0.50
0.27
-0.23
0.02
0.65
0.22
-0.01
-0.54
2.15
--1.03
2.76
Eb 2
PlI
P!2
P 13
P!4
P 22
P 23
P24
P33
P34
P44
Chapter 8
Here e(t) and wet) are mutually independent white noise sequences of unit variance. Let
the instrumental variable vector Z(t) be given by (8.15a, b) with
na
nb = 1
C(q-I)
2Qq-I
Q2q-2
D(q-l) = q-l
Some tedious but straightforward calculations show that for this example (see Problem
8.5)
R
R(Q)
1 - 4Q2 + Q4
( 2(>(1 - (>2)
4Q3
-2Q(1 _ (2)
1 - 4Q2 + Q4
Q2(2 _ (>4)
S(Qb = 1
16(>2
36Q4
16Q6
(>8
g12
Q6
S(>b = (>2
S(>h3 = 1
16(>4
4(>2
+
(>4
2(>9
16Q6
(>8
4Q1O
(8)
and nonzero for all other values of (> in the interval (0, 1).
For (> dose to but different from Q, although the parameter estimates are consistent
(since R(Q) is nonsingular), they will be quite inaccurate (since the elements of P rv (
are large).
For numerical illustration some Monte Carlo simulations were performed. For various
values of g, 50 different realizations each of length N = 1000 were generated. The IV
method was applied to each of these realizations and the sample variance calculated for
each parameter estimate according to (8.35).
A graphical illustration of the way in which Prv(g) and PIV vary with Q is given in
Figure 8.2.
The plots demonstrate dearly that quite poor accuracy of the parameter estimates can
be expected when the matrix R(g) is almost singular. Note also that the results obtained
..
by simulations are quite dose to those predicted by the theory.
Optimal IV method
Note that the matrix P rv depends on the choice of Z(t), F(q-l) and Q. Interestingly
enough, there exists an achievable lower bound on PlY:
105
)t
PlY
)(.
104
103
102
101
0,5
LO
(a)
105
PlY
><
104
103
102
101
0,5
LO
(b)
105
PlY
104
103
102
101
(c)
FIGURE 8.2 (a) PIy(gh,l and PlY],] versus g, (b) P1y(gh.2 and PrY2,2 versus g.
(c) P,y(gh,3 and Ply },3 versus Q.
Chapter 8
~
(Pry - p~f/ is nonnegative definite) where
(8.37)
~(t) being the noise-free part of <p(t). Moreover, equality in (8.36) is obtained, for
example, by taking
'---_Z_C_t)_=_[A_-_l_H_-_l(_q_-l_)<p_-T_U_)l_T __F_C_q-_-l_)_=__H_-l_(q_-_-l_)_ _
Q_=_I_ _ _(8.38)]
These important results are proved as follows. Introduce the notation
w(t)
= RTQZ(t)
aCt)
2:
wet + i)K;
i=O
~(t) = [H-1(q-l)<f>T(t)f
RTQR
Ew(t)F(q-l)<f>T(t)
= Ew(t)
;=0
00
2:
Ew(t + i)KiH-l(q-l)~T(t)
Ea(t)~T(t)
i=O
Thus
PLY
[Ea(t)~T(t)rl[EaCt)AaT(t)][E!3(t)aT(t)]-l
From (8.37),
p~)tl
[E~(t)A-l~T(t)rl
The inequality (8.36) which, according to the calculations above, can be written as
[Ea(t)AaT(t)] - [Ea(t)~T(t)][E~(t)A --l~T(t)rl[E~(t)aT(t)] ~ 0
now follows from Lemma A.4 and its remark (by taking Zl(t) = a(t)AlI2, Z2(t) =
~(t)A -112). Finally, it is trivial to prove that the choice (8.38) of the instruments and the
prefilter gives equality in (8.36).
Section 8.2
yet)
= q:,T(t)8 +
H(q-\ 8, ~)e(t, 8, ~)
(S.39)
Here, H(q-\ 8, ~) isa filter that should describe the correlation properties of the
disturbances (cf. (S.17. It depends on the parameter vector 8 as well as on some
additional parameters collected in the vector ~. An illustration is provided in the
following example.
Example 8.7 Model structures for .'lISa systems
For SISO systems two typical model structures are
A( A(q-l)y(t) = B(q-l)U(t) +
Z~:=:~ e(t)
(S.40)
and
(S.41)
where in both cases
=
D(q-l) =
C(q-I)
Clq-l
+ '" + cncq-nc
(S.42)
Note that both these model structures are special cases of the general model set (6.14)
introduced in Example 6.2. Introduce
~
(Cl . . . Cnc
d1
..
d nd ) T
~I{2
He
corresponds to
--1.
8 P.)
,I"
A(q-l) C(q--l)
D(q-l)
(S.43b)
III
H(q-l; 80 , ~)
== H(q--l)
<-.> ~
130
(S.44a)
for a unique vector ~o which will be called the true noise parameter vector. Further,
introduce the filter fi(q-\ 8, (3) through
H(q-\ 8, ~)
== A(q-l)fi(q-\ 8, 13)
(S.44b)
Chapter 8
P opt
IV
(8.45)
PEM
When H(q-l; 8, (3) does not depend on 8, the optimal IVM and PEM will have the same
asymptotic distribution (i.e. equality holds in (8.45, as shown in Appendix A8.2. Note
for example that for the structure.Alz, (8.41), the filter H(q-l; 8, (3) does not depend
on 8.
An approximate implementation of the optimal IV estimate (8.38) can be carried out
as follows.
Step 1. Apply an arbitrary IV method using the model structure (8.1). As a result, a
consistent estimate 8[ is obtained.
Step 2. Compute
OCt)
yet) - cp T(t)8,
yet)
cpT(t)8 1 + H(q-l;
8[,
(3)e(t,
811
(3)
using a prediction error method (or any other statistically efficient method). Call
the result ~2' Note that when the parametrization H(q-l; 81 , (3) = lID(q-l) is
chosen this step becomes a simple least squares estimation.
Step 3. Compute the optimal IV estimate (8.38) using 81 to form <P(t) and 81 , ~2 to form
H(q-l) and A. Call the result 83 ,
Step 4. Repeat Step 2 with 83 replacing 8[. The resulting estimate is called ~4'
In practice one can repeat Steps 3 and 4 a number of times until convergence is attained.
Theoretically, though, this should not be necessary if the number of data is large
(see below).
In Soderstrom and Stoica (1983) (see also Stoica and S6derstr6m, 1983a, b) a detailed
analysis of the above four-step algorithm is provided, as well as a comparison of the
accuracy of the estimates of 8 and f3 obtained by the above algorithm, with the accuracy
of the estimates provided by a prediction error method applied to the model structure
(8.39). See also Appendix A8.2. The main results can be summarized as follows:
All the parameter estimates 81 , ~2' 83 and ~4 are consistent. They are also
asymptotically Gaussian distributed. For 83 the asymptotic covariance matrix is
P IV = P opt
IV
(8.46)
Assume that the model structure (8.39) is such that H(q--l; 8, ~) (see (8.44b)) does not
depend on 8. (This is the case, for example, for.Al 2 in (8.41. Then the algorithm
converges in three steps in the sense that ~4 has the same (asymptotic) covariance
matrix as ~2' Moreover, (8I ~nT has the same (asymptotic) covariance matrix as the
277
Computational aspects
Section 8.3
i = 1, ... , ny
(8.47)
and thus
8
(~1)
(S.48)
8 ny
This was indeed the case for the full polynomial form (see (8.5)). Assume that Z(t),
F(q-l) and Q are constrained so that they also have a diagonal structure, i.e.
Z(t) = diag[z;(t)]
Q
(8.49)
= diag[Q;]
with dim zJt) ;::;: dim cp;(t), i = 1, ... , ny. Then the IV estimate (8.13) reduces to
8i = arg ~~n
1-1
(8.50)
Chapter 8
The form (8.50) has advantages over (8.13) since ny small least squares problems are to
be solved instead of one large. The number of operations needed to solve the least
squares problem (8.50) for one output is proportional to (dim El i )2(dim Zi)' Hence the
reduction in the computational load when solving (8.50) for i = 1, ... , ny instead of
(8.13) is in the order of ny. A further simplification is possible if none of epi(t), Zi(t) or
fi(q-l) varies with the index i. Then the matrix multiplying Eli in (8.50) will be
independent of i.
[~ Z(t)F(q-l)<J>T(t)] El
= Q1I2
[~ Z(t)F(q-l)y(t)]
(8.51)
(there. are nz equations and nEl unknowns). It is straightforward to solve this problem
analytically (see (8.14)). However, the solution computed using the so-called normal
equation approach (8.14) will be numerically sensitive (cf. Section AA). This means
that numerical errors (such as unavoidable rounding errors made during the computations) will accumulate and have a significant influence on the result. A numerically sound
way to solve such systems is the orthogonal triangularization method (see Section A.3
for details).
= <J>T(t)8 + E(t)
.4: yet)
(8.52a)
<l>T(t) =
(<I>T(t)
<l>i(t
8 = (ST
eiY
(8.52b)
= <J>T(t)El l + E(t)
(8.52c)
which can be seen as a subset of A. A typical case of such nested structures is when A is
of a higher order than .At'. In such a case the elements of <l>2(t) will have larger time lags
than the elements of <J>1(t).
Computational aspects
Section 8.3
279
Z(t)
(Zl(t)
Z2(t)
(8.52d)
~ ~ Z(t)<pT(t) = ~~ (~:~~D(<pT(t)
<pi(t)
(;:: ;::)
(8.52e)
R- 1
N 2:
(8.52g)
Z(t)y(t)
1=1
Using the formula for the inverse of a partitioned matrix (see Lemma A.2),
--I) }
1
(8.52h)
N 2: Z(t) yet)
1=1
E(t)
r
= y(t) - <pT(t)S;
1 N
N
Z2(t)E(t)
2:
1=1
82 =
A
81
1 N
N
Z2(t)y(t) - R 21 8i
2:
1=1
= 8; - Rji R 1282
Two expressions for r are given above. The first expression allows some interesting
interpretations of the above relations, as shown below. The second expression for r
Chapter 8
Remark. Example S.S is based on pure algebraic manipulations which hold for a
general Z(t) matrix. In particular, setting Z(t) = <p(t) produces the corresponding
relations for the least squares method.
II
Summary
Instrumental variable methods (IVMs) were introduced in Section S.1. Extended IVMs
include prefiltering of the data and an augmented IV matrix. They are thus considerably
more general than the basic IVMs.
The properties of IVMs were analyzed in Section S.2. The main restriction in the
analysis is that the system operates in open loop and that it belongs to the set of models
considered. Under such assumptions and a few weak additional ones, it was shown that
the parameter estimates are not only consistent but also asymptotically Gaussian
distributed. It was further shown how the covariance matrix of the parameter estimates
can be optimized by appropriate choices of the prefilter and the IV matrix. It was also
discussed how the optimal IV method can be approximately implemented and how it
compares to the prediction error method. Computational aspects were presented in
Section S.3.
Problems
Problem 8.1 An expression for the matrix R
Prove the expression (S.22) for R.
Problems
281
+ ay(t - 1)
= bluet -
1)
+ b2 u(t - 2) + vet)
whose parameters are estimated with the basic IV method using delayed inputs as
instruments
Z(t) = (u(t - 1)
u(t - 2)
u(t - 3T
nT -
x(t)
wet)
1---_+
u(t)
vet)
yet)
Chapter 8
Show how m: can be chosen so that the consistency condition (8.25b) is satisfied. Discuss
how nt should be chosen to satisfy also the condition (8.25a).
Problem 8.5 Generic consistency
Prove the expressions for R(g) and det R(g) in Example 8.6.
Problem 8.6 Parameter estimates when the system does not belong to the model structure
Consider the system
+ ay(t - 1)
= bu(t - 1)
+ E(t)
Derive the asymptotic (for N - ? 00) expressions of the LS estimate and the basic IV
estimate based on the instruments
u(t - 2T
z(t) = (u(t - 1)
of the parameters a and b. Examine also the stability properties of the models so
obtained.
Problem 8.7 Accuracy of a simple IV method applied to a first-order ARMAX system
Consider the system
yet)
ay(t - 1)
bu(t - 1)
eel)
ce(t - 1)
where u(t) and e(t) are mutually uncorrelated white noise sequences of zero means
and variances 0 2 and ".2, respectively. Assume that a and b are estimated with an IV
method using delayed input values as instruments:
z(t)
u(t - 2T
(u(t - 1)
~ p~~t _
and that
a = c.
PPEM
P always
;;=: 0
Problems 283
Problem 8.9 A necessary condition for consistency of IV methods
Show that (8.26) are necessary conditions for the consistency of IV methods.
Hint. Consider the solution of
xT[EZ(t)ZT(t)]X = 0
with respect to the nz-dimensional vector x. (The solutions will span a subspace - of what
order?) Observe that
xTR
= BO(q-l)U(t) + vet)
where the polynomials A and BO have the same structure and degree as A and B in
(8.3), and where vet) and u(s) are un correlated for all t and s. Let u(t) be a white noise
sequence with zero mean and variance (J~. Finally, assume that the polynomials A O(q-l)
and BO(q-l) have no common zeros.
Show that under the above conditions the IV vector (8.15c) satisfies the consistency
conditions (8.25a, b) for any stable prefilter F(q--l) with F(O) *- O.
Hint. Rewrite the vector </>(t) in the manner of the calculations made in Complement
C5.1.
Problem 8.11 The accuracy of extended IV methods does not necessarily improve when
the number of instruments increases
Consider an ARMA (1, 1) process
yet) + ay(t - 1) = eel) + ce(t -1)
Let
lal <
1;
Ee(t)e(s) =
or,s
II [~ Z(t)y(t -
1)]a +
a is
n;;:1
a is given
by
(i)
where
R = EZ(t)y(t - 1)
C(q-I) = 1
+ cq-l
Chapter 8
1, ... , na + nb
That is to say, the model weighting sequence exactly matches the first (na
coefficients of the system weighting sequence. Comment on this property.
nb)
Bibliographical notes
The IV method was apparently introduced by Reiers01 (1941) and has been popular in
the statistical field for quite a long period. It has been applied to and adapted for dynamic
systems both in econometrics and in control engineering. In the latter field pioneering
work has been carried out by Wong and Polak (1967), Young (1965, 1970), Mayne
(1967), Rowe (1970) and Finigan and Rowe (1974). A well written historical background
to the use of IV methods in econometrics and control can be found in Young (1976), who
also discusses some refinements of the basic IV method.
For a more recent and detailed appraisal of IV methods see the book Soderstrom and
Stoica (1983) and the papers Soderstrom and Stoica (1981c), Stoica and SOderstrom
(1983a, b). In particular, detailed proofs of the results stated in this chapter as well as
many additional references on IV methods can be found in these works. See also Young
(1984) for a detailed treatment of IV methods.
The optimal IV method given by (8.38) has been analyzed by Stoica and Soderstrom
(1983a, b). For the approximate implementation see also the schemes presented by
Young and lakeman (1979), lakeman and Young (1979).
For additional topics (not covered in this chapter) on IV estimation the following
Appendix A8.1
285
papers may be consulted: Soderstrom, Stoica and Trulsson (1987) (IVMs for closed loop
systems), Stoica et al. (1985b), Stoica, Soderstrom and Friedlander (1985) (optimal IV
estimation of the AR parameters of an ARMA process), Stoica and Soderstrom (1981)
(analysis of the convergence and accuracy properties of some bootstrapping IV
schemes), Stoica and Soderstrom (1982d) (IVMs for a certain class of nonlinear
systems), Benveniste and Fuchs (1985) (extension of the consistency analysis to
non stationary disturbances), and Young (1981) (IV methods applied to continuous time
models).
Appendix AB.}
Covariance matrix of IV estimates
A proof is given here of the results (8.29)-(8.32) on the asymptotic distribution of the IV
estimates. From (8.6) and (8.14),
(A8.I.1)
- 1 2: Z(t)F(qN
v'N
dis!
(A8.I.2)
00
1=1
where
Po
= :~oo
N E
1=1
s=l
2: Z(t)F(q-l)V(t) 2: (F(q-l)V(S
"
I
ZT(S)
according to Lemma B.3. The convergence in distribution (8.29) then follows easily from
the convergence of RN to R and Lemma B.4. It remains to show that Po as defined by
(A8.I.2) is given by
Po =
(A8.I.3)
2:
2:
00
2: 2:
OO
2: 2: 2: 2:
1
~~oo N E
2: 2: (N T=-N ;=0
b;I)Z(t)KiAK1+TZT(t + 't)
Chapter 8
,=-00 ;=0
- lim
N~oo
00
N L It I L
't=-N
EZ(t)K;AKT+,z1.(t
+ t)
i=O
(A8.1.4)
which proves (A8.1.3). The second equality of (A8.1.4) made use of assumption A6. The
last equality in (A8.1.4) follows from the assumptions, since
<
a < 1. Hence
Lltl Cal'l
2
~N
T=-N
00
L Cltlal'l
-!>
0, as N
-!> 00
,=()
Next turn to (8.32). In the scalar case (ny = 1), K; and A are scalars, so they commute
and
Po
ALL
K;KjEZ(t
+ i)ZT(t + j)
i=O j=O
i=O j=O
= AE[#a KjZ(t -
j)
J[~ KiZT(t - i) J
(A8.1.5)
= AE[K(q-l)Z(t)][K(q-l)Z(t)V
= AE[ F(q-l)H(q-l)Z(t)][ F(q-l)H(q-l)Z(t)]T
which is (8.32).
Appendix AB.2
Comparison of optimal IV and prediction error estimates
This appendix will give a derivation of the covariance matrix PPEM of the prediction error
estimates of the parameters 8 in the model (8.39), as well as a comparison of the matrices
PPEM and P1frt . To evaluate the covariance matrix PPEM the gradient of the prediction
e(t, 1'])
1']
= lr\q-I; 8,
~
~)[y(t) - <j)T(t)8]
(A8.2.1)
~T]T
[ST
iJe(t, 1'])1
. = -H-I(q-I) iJH(q-l; 'YJ)I
e(t, 1']0)
iJS i 1)=1)"
iJS i
1)=1)0
- H- 1(q-l) ~ <j)T(t)S
(A8.2.2a)
iJS i
iJe(t;1'])1
iJlJ l
1)=1)0
e(t)
(A8.2.2b)
1)=1)0
where ei denotes the ith unit vector. (Note that e(t, 1']0) = e(t).) Clearly the gradient
W.r. t. S will consist of two independent parts, one being a filtered input and the other
filtered noise. The gradient W.r.t. ~ is independent of the input. To facilitate the
forthcoming calculations, introduce the notation
ee(t)
~ iJe~8 1']) 1 __
= ee(t) + ea(t)
1)-1)0
( ) 6
ef:l t
iJe(t, 1']) I
iJP.
IJ
(A8.2.3)
1)=1)0
In (A8.2.3) Ee(t) denotes the part depending on the input. It is readily found from
(A8.2.2a) that
Ee(t) = _H-l(q-l)~T(t)
(A8.2.4)
Qee
6
=
T
-I
EEe(t)A
eeCt)
T.
Eee(t)A
- I Ef:I(t)
(A8.2.S)
QI1e ~ QJf:\
Qflfl ~ EE~'(t)A -Ief:l(t)
(A8.2.6)
Chapter 8
(P~tt)-l +
PEM
QstJ
Q(3tJ
QtJl3)-l
Qf113
(A8.2.8)
(A8.2.~
The inequality (A8.2.9) confirms the conjecture that the optimal prediction error
estimate is (asymptotically) at least as accurate as an optimal IV estimate. It is also found
that these two estimates have the same covariance matrix if and only if
Q=O
(A8.2.1O)
Ee(t)
(A8.2.ll)
= 0
since then QStJ = 0, Qel3 = O. Note that using the filter H defined by (8.44b), the
model structure (8.39) can be rewritten as
(A8.2.l2)
Hence the prediction error becomes
(A8.2.13)
from which it is easily seen that if the filter H(q-l; e, ~) does not depend on
(A8.2.ll) is satisfied. Hence in this case equality holds in (A8.2.9).
e then
Complement CB.l
Yule- Walker equations
This complement will discuss, in two separate subsections, the Yule- Walker equation
approach to estimating the AR parameters and the MA parameters of ARMA processes,
and the relationship of this approach to the IV estimation techniques.
Complement C8.1
289
AR parameters
Consider the following scalar ARMA process:
(CS.l.l)
where
C(q-I) = 1
+ CIq-l +
Ee(t) = 0
+ cncq-nc
Ee(t)e(s) = 'f.?6 t s
A(z)C(z)
*-
for
Izl
~ 1
Let
rk ~
E,y(t)y(t - k)
0, 1, 2, ...
EC(q-l)e(t)y(t - k) = 0
for k > nc
(CS.1.2)
Thus, multiplying both sides of (CS.l.l) by yet - k) and taking expectations give
k
= nc +
1, nc
+ 2,
...
(CS.1.3)
This is a linear system in the AR parameters {a;}. The equations (CS.1.3) are often
referred to as Yule- Walker equations. Consider the first m equations of (CS.1.3), which
can be written compactly as
Ra = -r
where m
(CS.1.4)
na, and
The matrix R has full rank. The proof of this fact can be found in Stoica (19S1c) and
Stoica, Soderstrom and Friedlander (1985).
The covariance elements of Rand r can be estimated from the observations available,
say {y(l), ... , yeN)}. Let Rand f denote consistent estimates of Rand r. Then a
obtained from
Ra
Chapter 8
(CS.1.5)
= -f
liRa + fl12
= min
liRa + fll~
= min
12: (Y (t N
t=
f =
N 2:
t=l
nc -
1) (y(t -
1) ... y(t - na
Y (t - nc - m)
(y(t - nc -
1) yet)
(CS.1.6)
y (t - nc - m)
yet) =
z(t)
leads precisely to (CS.1.5), (CS.1.6). Other estimators of Rand f are asymptotically (as
~ (0) equivalent to (CS.1.6) and therefore lead to estimates which are asymptotically
equal to the IVE (CS.1.5), (CS.1.6). The conclusion is that the asymptotic properties of
the YW estimators follow from the theory developed in Section S.2 for IVMs. A detailed
analysis for ARMA processes can be found in Stoica, Soderstrom and Friedlander (19S5).
For pure AR processes it is possible also to obtain a simple formula for A? We have
N
A?
~Ee2(t) =
=
Ee(t)A(q-l)y(t)
= Ee(t)y(t)
(CS.1.7)
Therefore consistent estimates of the parameters {a;} and {A} of an AR process can be
obtained replacing {ri} by sample covariances in
(CS.l.S)
Complement C8.1
291
(cf. (CS.1.4), (CS.1.7). It is not difficult to see that the estimate of a so obtained is, for
large samples, equal to the asymptotically efficient least squares estimate of a. A
computationally efficient way of solving the system (CS.l.S) is presented in Complement
CS.2, while a similar algorithm for the system (CS.1.4) is given in Complement CS.3.
MA parameters
As shown above, see (CS.1.3), the AR parameters of an ARMA satisfy a set of linear
equations, called YW equations. The coefficients of these equations are equal to the
covariances of the ARMA process. To obtain a similar set of equations for the MA
parameters, let us introduce
Yk
Jt
/':, _1_
=
(2:rr/
-Jt
_1_
<p(w) e
iwkd
(CS.1.9)
where <p(w) is .the spectral density of the ARMA process (CS.1.1). Yk is called the
inverse covariance of y(t) at lag k (see e.g. Cleveland, 1972; Chatfield, 1979). It can
also be viewed as the ordinary covariance of the ARMA process x(t) given by
EE(t)E(S)
:2
(CS. 1. 10)
Of,S
Note the inversion of the MA and AR parts. Indeed, x(t) defined by (CS.l.lO) has the
spectral density equal to 1/[(2Jt)2<p(W)] and therefore its covariances are {yd. It is now
easy to see that the MA parameters {cJ obey the so-called 'inverse' YW equations:
na
+1
(CS.l.l1)
To use these equations for estimation of {cJ one must first estimate {yd. Methods for
doing this can be found in Bhansali (19S0). Essentially all these methods use some
estimate of <p( w) (either a nonparametric estimate or one based on high-order AR
approximation) to approximate {yd from (CS.1.9).
The use of long AR models to estimate the inverse covariances can be done as follows.
First, fit a large-order, say n, AR model to the ARMA sequence {yet)}:
L(q-l)y(t)
= wet)
(CS.1.12)
where
L(q-l)
ao
and where w(t) are the AR model residuals. The coefficients {a;} can be determined
using the YW equations (CS.1.5). For large n, the AR model (CS.1.12) will be a good
description of the ARMA process (CS.1.1), and therefore wet) will be close to a white
noise of zero mean and unit variance. Since L in (CS.1.12) can be interpreted as an
approximation to AI( CA) in the ARMA equation, it follows that the closer the zeros of
C are to the unit circle, the larger n has to be for (CS.1.12) to be a good approximation of
the ARMA process (CS.1.1).
The discussion above implies that, for sufficiently large n a good approximation of the
inverse spectral density function is given by
Chapter 8
which readily gives the following estimates for the inverse covariances:
l/-k
Yk =
'L
(CS.l.13)
6"JJ'i+k
i=O
Inserting (CS.l.13) into the inverse YW equations, one can obtain estimates of the MA
parameters of the ARMA process.
It is interesting to note that the above approach to estimation of the MA parameters of
an ARMA process is quite related to a method introduced by Durbin in a different way
(see Durbin, 1959; Anderson, 1971).
Complement CB.2
The Levinson- Durbin algorithm
Consider the following linear systems of equations with {ak,i}~! and
k = 1, ... , n
Gk
as unknowns:
(CS.2.1)
where
(CS.2.2)
Qk-l
Ok
Qk-2'"
Qo
and where {Qi}f=o are given. The matrices {Rd are assumed to be nonsingular. Such
systems of equations appear, for example, when fitting AR models of orders k =
1, 2, ... , n to given (or estimated) covariances {Qi} of the data by the YW method
(Complement CS.1). In this complement a number of interesting aspects pertaining to
(CS.2.1) are discussed. First it is shown that the parameters {Ok> Gk}k~l readily
determine a Cholesky factorization of R;~I' Next a numerically efficient algorithm,
the so-called Levinson-Durbin algorithm (LDA), is presented for solving equations
(CS.2.1). Finally some properties of the LDA and some of its applications will be given.
QJ...
Ql
Qo
Qm
1
am ,1
0
'1
Qrn
Qm-l'"
Qo
am,m
... a1,1
'----.r-----'
'--v---'
Rm+l
Um + 1
1
(C8.2.3)
To see this, note first that (C8.2.3) clearly holds for m = 1 since
Ga~'l)(~:
~~)C:'1 ~) (~1 ~J
=
(01
which proves the assertion. Thus by induction, (C8.2.3) holds for m = 1, 2, ... , n.
Now, from (C8.2.3),
(C8.2.4)
which shows that U m + 1 D;;'~~ is the lower triangular Cholesky factor of the inverse
covariance matrix R;;,L.
The Levinson - Durbin algorithm
The following derivation of the Levinson-Durbin algorithm (LDA) relies on the
Choiesky factorization above. For other derivations, see Levinson (1947), Durbin
Chapter 8
(1960), Soderstrom and Stoica (1983), Nehorai and Morf (1985), Demeure and Scharf
(1987).
Introduce the following convention. If x = (Xl ... xn)T then the reversed vector
(x" Xn-l'" Xl)T is denoted by XR. It can be readily verified that due to the Toeplitz
structure of the matrices Rk, equations (C8.2.1) give
(C8.2.5)
Inserting the factorization of RJ:~l (see (C8.2.4)) into (C8.2.5) gives
( 1I0k
Elk/Ok
ElI/Ok) (Qk+l)
RJ:l
-(Qk+l
-Elk(Qk+l
r~
+ ElkElI!Ok
+ Elr r~)/Ok
T
(C8.2.6)
)
R
Next note that making use of (C8.2.1), (C8.2.6) one can write
0k+l
Qo
Elk+ 1rk+l
= Qo
(ak+l,k+l
(enr r~ +
Qo
'-----v----'
Ok
= Ok -
Qo
R
+ (El kR+ 1) T rk+I
(C8.2.7)
a~+l,k+l Ok
It now follows from (C8.2.6) and (C8.2.7) that the following recursion, called the LDA,
holds (for k = 1, ... , n - 1):
ak,kQl)/Ok
i=l, ... ,k
(C8.2.8)
-Q1/Qo
01 = Qo
Q1a1,1
Complement C8.2
295
O(n4) operations for computation of all {8k, Ok}k=l. Hence for large values of n, the
LDA will be much faster than a standard routine for solving linear systems.
Note that the LDA should be stopped if for some k, ak+l,k+! = 1. In numerical
calculations, however, such a situation is unlikely to occur. For more details on the LDA
and its numerical properties see Cybenko (1980), Friedlander (1982).
It is worth noting that if only 8 n and On are required then the LDA can be further
simplified. The so-called split Levinson algorithm, which determines 8n and On, is about
twice faster than the LDA; see Delsarte and Genin (1986).
for k
det Rm+l = go
IT
Ok
k=l
m = 1,2
Thus the matrix Rn+l is positive definite if and only if go> 0 and Ok > 0, k = 1, ... , n.
However the latter condition is equivalent to (i).
Next consider the equivalence (i) ~ (iii). We make use of Rouche's theorem to prove
this equivalence. This theorem (see Pearson, 1974, p. 253) states that if.f(z) and g(z) are
two functions which are analytic inside and on a closed contour C and Ig(z)1 < If(z)1 for
z E C, then.f(z) and.f(z) + g(z) have the same number of zeros inside C. It follows from
(C8.2.8) that
(C8.2.9)
where AkCz) and A k+ 1(z) are defined similarly to An(z). If (i) holds then, for Izi = ],
IAk(z)1
Thus, according to Rouche's theorem A k+1 (z) and Ak(Z) have the same number of zeros
inside the unit circle. Since Al (z) = 1 + aI, 1 z has no zero inside the unit circle when (i)
holds, it follows that Ak(z), k = 1, ... ,n have all their zeros strictly outside the unit disc.
Next assume that (iii) holds. This clearly implies lan,nl < 1. It also follows that
for
Izi
~ 1
(C8.2.1O)
For if A n - 1(z) has zeros within the unit circle, then Rouche's theorem, (C8.2.9) and
lan,nl < 1 would imply that An(z) also must have zeros inside the unit circle, which is
Chapter 8
a contradiction. From (C8.2.1O) it follows that lan-l,n-Il < 1, and we can continue in the
same manner as above to show that (i) holds.
For more details on the above properties of the LDA and, in particular, their extension
to the 'boundary case' in which lak,kl = 1 for some k, see Soderstrom and Stoica (1983),
Stoica and SOderstrom (1985).
For some optimization problems in system identification, the independent variables are
coefficients of certain polynomials (see e.g. Chapter 7). Quite often these coefficients
need to be constrained in the course of optimization to guarantee that the corresponding
polynomial is stable. More specifically, let
A(z)
aiz
+ ... +
anz"
and
q; = {{ai}~l I A(z)
=1=
for
Izl
~ I}
It is required that {ai} belong to q;. To handle this problem it would be convenient to
reparametrize A(z) in terms of some new variables, say {Ui}?=I, which are such that
when {Ui} span 9l n, the corresponding coefficients {ai} of A (z) will span q;. Then the
optimization could be carried out without constraint, with respect to {Ui}' Such a
reparametrization of A (z) can be obtained using the LDA.
Let
k
1, ... , n
Uk E
9l
k = 1, ... , n
(cf. Jones (1980. Note that the last equation above is the second one in the LDA
(C8.2.8). When {Uk} span 9l n , the {<Pk} span the interval (-1,1). Then the coefficients
{an,i} of A,,(z) span q;, according to the equivalence 0) <? (iii) proved earlier in this
complement.
Complement C8.2
297
{QiIRn+l > O}
Let
k
1, ... , n,
k = 1, ... , n
By rearranging the LDA it is possible to determine the sequence {Qi }i'=o corresponding
to {ak,k}k=l' The rearranged LDA runs as follows. For k = 1, ... , n - 1,
Qk+l = -Oka k+1.k+1 -ak, I Qk {
0k+l
ak+l
= Ok(1
= ak,1
a~+l,k+l)
(C8.2.11)
ak+l,k+l ak.k+l-1
1, ... , k
=
=
-al.1
1-
aL
When {ad span '!Jl n, the {(l>k = ak, d span the interval ( - 1, 1) and, therefore, {Q i} span
(cf. the previous equivalence (i) > (ii. Thus the optimization may be carried out
without constraints with respect to {ai}'
Note that the reparametrizations in terms of {ak,k} discussed above also suggest
simple means for obtaining the nonnegative definite approximation of a given sequence
{Qo, ... , Qn}, or the stable approximant of a given A (z) polynomial. For details on these
approximation problems we refer to Stoica and Moses (1987). Here we note only that to
approach the second approximation problem mentioned above we need a rearrangement
of the LDA which makes it possible to get the sequence {ak,k}k=l from a given sequence
of coefficients {a n ,i}i'=l' For this purpose, observe that (C8.2.9) implies
q)
"*
<Pk+lAk(Z)
Zk+lA k (z-l)
(CS.2.12)
or, equivalently
ak,i = (ak+l,i - ak+l,k+l ak+1.k+1-i)/(1 -
a~+l,k+1)
i = 1, ... , k and
k = n - 1, ... , 1
(C8.2.14)
298
Chapter 8
According to the equivalence (i) > (iii) proved above, the stability properties of the
given polynomial An(z) are completely determined by the sequence {<h = ak,dk=l
provided by (C8.2.14). In fact, the backward recursion (C8.2.14) is one of the possible
implementations of the Schur-Cohn procedure for testing the stability of AnCz) (see
Vieira and Kailath, 1977; Stoica and Moses, 1987, for more details on this interesting
connection between the (backward) LDA and the Schur-Cohn test procedure).
Complement C8.3
A Levinson-type algorithm for solving nonsymmetric
Yule- Walker systems of equations
Consider the following linear systems of equations with {an, J i~ I as unknowns:
Rn 8n
-Qn
= 1, 2,
... , M
(C8,3.1)
where
8n
= (an, I
"
rm
an,n) T
rm +l
rm-l
rm
rm+n--l
rm+n -2
Rn =
rm+l-- n
rm +2-n
,
..
Qn =
rm
and where m is some fixed integer and {rd are given reals. Such systems of equations
appear for example when determining the AR parameters of ARMA-models of orders
(1, m), (2, m), ' . " (M, m) from the given (or estimated) covariances {rd of the data by
the Yule-Walker method (see Complement C8.1).
By exploiting the Toeplitz structure of the matrix Rn it is possible to solve the systems
(C8.3,1) in 3M 2 operations (see Trench, 1964; Zohar, 1969, 1974) (an operation is
defined as one multiplication plus one addition), Note that use of a procedure which does
not exploit the special structure of (C8.3,1) would require O(M4) operations to solve for
8j, ",,8 M , An algorithm is presented here which exploits not only the Toeplitz
structure of Rn but also the specific form of Qn. This algorithm needs 2M2 operations
to provide 8], ',., 8 M and is quite simple conceptually. Faster algorithms for solving
(C8,3,1), which require in the order of M(log M)2 operations, have been proposed
recently (see Kumar, 1985). However, these algorithms impose some unnecessarily
strong conditions on the sequence of {ri}' If these conditions (consisting of the nonsingularity of some matrices (others than Rn) formed from {rJ) are not satisfied, then
the algorithms cannot be used.
Throughout this complement it is assumed that
det Rn
=1=
for n
= 1, ... , M
(C8.3.2)
Complement C8.3
299
cases condition (CS.3.2) is stronger than necessa y (e.g. note that for '", = 0 and
oF 0, det R2 oF 0 while det R j = 0).
As a starting point, note the following nested structures of Rn+l and Qn+l:
'",-I'm+l
(CS.3.3)
Q,,+1 =
where
(CS.3.4)
and where the notation i = (xn ... Xjyr denotes the vector x = (Xl ... xn)T, i.e. with
the elements in reversed order. Note also that
(CS.3.5)
Using Lemma A.2 and (CS.3.3)-(CS.3.5) above,
= -{
G)
(~n)
R;l(I 0) +
(-R;1f:3n)C
__ ~
-TR n- 1
1
1)/( ~ _
-TR-1R
)}(
n ~
Q1I .. )
'm+n+1
_;:( ~n)
where
/'0,
an = 'm -
-TR-1R
Qn
n Fn -
'm
-TA
+ Qnun
An order update for an does not seem possible. However, order recursions for Ll n and an
can be derived easily. Thus, from Lemma A.2 and (CS.3.3)-(CS.3.5),
-{ C)R;l(O
x
1)
('m~:-l)
Chapter 8
0,,+1
-T
R n+1PI1+1
-] A
Qn+l
= rm
+ (r m +
II
T) { ( L"!.n
0)
+1 QI1
f.!n ( 81 )}
- an
n
where
A simple order update formula for !-tn does not seem to exist.
Next note the following readily verified property. Let x be an (nil) vector, and set
y = Rnx
(or x = R,-;I y)
Then
an
Q~~'R;;-T~fl
rm -
rm
+ Q!A" =
rm
+ Q!L"!.n =
On
With this observation the derivation of the algorithm is complete. The algorithm is
summarized in Table C8.3.1, for easy reference. Initial values for 8, L"!. and are
obtained from the definitions. The algorithm of Table C8.3.l requires about 4n
operations per iteration and thus a total of about 2M2 operations to compute all {8 11 } ,
n = 1, ... , M.
Next observe that the algorithm (C8.3.6)-(C8.3.1O) should be stopped if On = 0 is
obtained. Thus the algorithm works if and only if
for
n= 1, ... , M - 1
and
rm
'* 0
(C8.3.11)
TABLE CS.3.1 A Levinson-type algorithm for solving the non symmetric Toeplitz systems
of equations (C8.3.1)
Initialize:
For
81
~l
= -r m __ I /'111
01
= r",(1 - ()1~1)
11
-rm+l/rm
= 1 to (M - 1) put:
an
= rm+n+l
+ Q~'8n
+ ~~~-Lln
(CR.3.6)
(CS.3.7)
(CS.3.S)
(CS.3.9)
(C8.3.1O)
Complement C8.3
301
=1=
=?
{an
=1=
0 and det Rn
=1=
(C8.3.13)
O}
G~~)Rn+l(;n
~) = (~n
:J
(CS.3.14)
This equation resembles a factorization equation which holds in the case of the
Levinson-Durbin algorithm (see Complement CS.2).
It is interesting to note that for m = 0 and r -i = ri, the algorithm above reduces to the
LDA presented in Complement CS.2. Indeed, under the previous conditions the matrix
Rn is symmetric and !)n = Qn Thus An =
and Iln = an, which implies that the
equations (CS.3.7) and (CS.3.9) in Table CS.3.1 are redundant and should be
eliminated. The remaining equations are precisely those of the LDA.
In the case where m =1= 0 orland r _; =1= ri, the algorithm of Table CS.3.1 is more
complex than the LDA, as might be expected. (In fact the algorithm (CS.3.6)-(CS.3.1O)
resembles more the multivariable version of the LDA (which will be discussed in
Complement C8.6) than the scalar LDA.) To see this more clearly, introduce the
notation
en
fln,nZn
1 + (z ". zn)8 n
(CS.3.15a)
and
(CS.3.15b)
Then premultipJying (C8.3.8) by (z ... zn+l) and (CS.3.9) by (1
respectively 1 and zn+l to both sides of the resulting identities,
An+1Cz)
= An(z) - knzBn(z)
B n +1(z)
Z ...
zBnCz) - knAn(z)
or equivalently
(C8.3.16)
Chapter 8
where
= an
an,
According to the discussion preceeding (C8.3.15), in the symmetric case the recursion
(C8.3.16) reduces to
(C8.3.17)
where k n = an/an = -an+l,n+l and where A"Cz) denotes the 'reversed' AnCz) polynomial (compare with (C8.2.9. Evidently (C8.3.17), which involves one sequence of
scalars {k n } and one sequence of polynomials {AnCz)} only, is less complex than
(C8.3.16). Note that {k n } in the symmetric case are often called 'reflection coefficients',
a term which comes from 'reflection seismology' applications of the LDA. By analogy,
kn and kn in (C8.3.16) may also be called 'reflection coefficients'.
Next, recall that in the symmetric case there exists a simple relation between the
sequence of reflection coefficients {k n} and the location of the zeros of An(z), namely:
Ikpl < 1 for p = 1, ... , n ~An(z) 0 for Izl :;:::; 1 (see Complement C8.2). By analogy it
might be thought that a similar result would hold in the nonsymmetric case. However,
this does not seem to be true, as is illustrated by the following example.
'*
A 1(z) = 1 - koz
B1(z) =
-ko +
and
where
IkII
k1(1
--
< 1
ko) <
1 -
ko
(C8.3.18)
/(1(1 + k o) < 1 + ko
Observe that in the symmetric case where k; = ki ~ k;, the inequalities (C8.3.18) reduce
to the known condition Ikol < 1, Ikll < 1. In the nonsymmetric case, however, in
general k; k; so that (C8.3.18) does not seem to reduce to a 'simple' condition on ko,
ko and k j
..
'*
Complement CB.4
Min-max optimal IV method
Consider a scalar difference equation of the form (S.l), (S.2) whose parameters are
estimated using the IV method (S.13) with nz = n8 and Q = I. The covariance matrix
of the parameter estimates so obtained is given by (S.30). Under the conditions introduced above, (S.30) simplifies to
Pry = A?R-1[EK(q-l)Z(t)K(q-l)ZT(t)]R- T
where
(see (S.32.
The covariance matrix PlY depends on the noise shaping filter H(q-l). If one tries to
obtain optimal instruments i(t) by minimizing some suitable scalar or multivariable
function of Pry then in general the optimal i(t) will also depend on H(q-I) (see e.g.
(S.3S. In some applications this may be an undesirable feature which may prevent the
use of the optimal instruments i(t).
A conceptually simple way of overcoming the above difficulty is to formulate the
problem of accuracy optimization on a min-max basis. For example,
i(t) = arg min
max
PlY
Z(I)EC, f-{(q-l)ECI/
where
Cz
Cf-{
= {z(t)ldet R =1= O}
= {H(q-l)IE[H(q-l)e(t)f
(CS.4.2a)
~
a}
(CS.4.2b)
for some given positive real a. It is clearly necessary to constrain the variance of the
disturbance H(q-l)e(t), for example directly as in (CS.4.2b). Otherwise the inner
optimization of (CS.4.I) would be meaningless. More exactly, without a constraint on
the variance of the disturbance, the 'worst' covariance matrix given by the inner
'optimization' in (CS.4.I) would be of infinite magnitude. It is shown in Stoica and
Soderstrom (19S2c) that the min-max approach (C8.4.I), (CS.4.2) to the optimization of
the accuracy of IV methods does not lead to a neat solution in the general case.
In the following the problem is reformulated by redefining the class Cf-{ of feasible
H(q-l) filters. Specifically, consider
(CS.4.3)
where H(ei(J) is some given suitable function of w. The min-max problem (CS.4.1),
(CS.4.3) can be solved easily. First, however, compare the conditions (CS.4.2b) and
(CS.4.3) on H(q-l). Let H(q-l) be such that
304
Chapter 8
The converse is not necessarily true. There are filters H(q-l) which satisfy (CS.4.2b) but
whose magnitude IH(eiw)1 takes arbitrarily large values at some frequencies, thus
violating (CS.4.3). Thus the explicit condition in (CS.4.3) on H(q-l) is more restrictive
than the implicit condition in (CS.4.2b). This was to be expected since (CS.4.3) assumes
a more detailed knowledge of the noise properties than does (CS.4.2b). However, these
facts should not be seen as drawbacks of (CS.4.3). In applications, disturbances with
arbitrarily large power at some frequencies are unlikely to occur. Furthermore, if there is
no a priori knowledge available on the noise properties then H in (CS.4.3) can be set to
W E
(-Jt, Jt)
(CS.4.4)
for some arbitrary ~. As will be shown, the result of the min-max optimization problem
(CS.4.1), (CS.4.3), (CS.4.4) does not depend on ~.
Turn now to the min-max optimization problem (CS.4.1), (CS.4.3). Since the matrix
J~J1:
[IH(eiwW - [H(eiWW1IF(eiw)[2</>z(eiW)dw
where <pz(e iw ) denotes the spectral density matrix of z(t), is nonnegative definite for all
H(q-l) E CH , it follows that
This problem is of the same type as that treated in Section S.2. It follows from (S.3S) that
and
(CS.4.5)
The closer H(q-l) is to the true H(q-l), the smaller are the estimation errors associated
with the min-max optimal instruments and prefilter above.
Note that in the case of ll(q-l) given by (CS.4.4)
i(t)
= (jJ(t)
(CS.4.6)
(since the value of ~ is clearly immaterial). Thus (CS.4.6) are the min-max optimal
instruments and prefilter when no a priori knowledge on the noise properties is available
(see also Wong and Polak, 1967; Stoica and Soderstrom, 19S2c).
Observe, however, that (CS.4.6) assumes knowledge of the noise-free output (which
enters via CP(t)). One way of overcoming this difficulty is to use a bootstrapping iterative
procedure for implementation of the IVE corresponding to (CS.4.6). This procedure
Complement CB.5
305
should be fed with some initial values of the noise-free output (or equivalently of the
system transfer function coefficients). Another non-iterative possibility to overcome the
aforementioned difficulty associated with the implementation of the min-max optimal
IVE corresponding to (C8.4.6) is described in Stoica and Nehorai (1987a).
For the extension of the min-max optimality results above to multivariable systems,
see de Gooijer and Stoica (1987).
Complement CB.5
Optimally weighted extended IV method
Consider the covariance matrix PlY, (S.30), in the case of single-output systems (ny = 1).
According to (S.32) the matrix Prv then takes the following more convenient form:
Prv = 'A?(RTQR)-lRTQSQR(RTQR)-1
where
S = EK(q-l)Z(t)K(q-l)ZT(t)
K(q-l)
R
H(q-l)F(q-l)
= Ez(t)F(q-l)q7(t)
= Ez(t)F(q-l)(jJT(t)
In Section S.2 the following lower bound P~tt on PlY was derived:
P lV ~ :t..2 [EH- 1(q-l)(jJ(t)H- 1(q-l)(jJT(t)t 1 ~ P~tt
(CS.5.1)
(nz = ne)
H-1(q-l)
(CS.5.2)
Q= I
(see (S.36)-(S.3S. The IV estimate corresponding to the choices (CS.5.2) of the IV
vector z, the prefilter F and the weighting matrix Q is the following:
8=
[~H-l(q-l)(jJ(t)H-\q-l)c:pT(t)
x
[~ H-
r'
(CS.5.3)
1 (q-l)(jJ(t)H- 1(q-l)y(t) ]
Here the problem of minimizing the covariance matrix PlY of the estimation errors is
approached in a different way. First the following lower bound on P IV is derived with
respect to Q:
Chapter 8
Q = S-l
where x (as well as tjJ, Q and 0 below) denote some quantity whose exact expression is
not important for the present discussion, then
Proof. The nested structure (C8.5.4) of the IV vectors induces a similar structure on
the matrices Rand S, say
Rnz+l = ( Rnz)
tjJT
P;;zl+ J
(lIA2)(R~z
tjJ){
(~)S;;zl(J
x (0 - QTS;;)Q)-I(QTS;;zl
=
0) + (S:lQ)
-1) } (:';)
tjJ]T/(O - QTS;;} Q)
Since Snz+ I is positive definite it follows from Lemma A.3 that 0 - QTS;;} Q > O. Hence
P;;}+ I ;?! P~l, which completes the proof.
Complement CB.5
307
Thus, the positive definite matrices {P nz} form a monotonically nonincreasing sequence
for nz = n8, n8 + 1, n8 + 2, ... In particular, this means that Pnz has a limit Fyo as
nz-J. 00. In view of (S.36), this limiting matrix must be bounded below by p~~t:
~
Pnz ~
opt
P lV
(CS.S.S)
for all nz
This inequality will be examined more closely, and in particular, conditions will be
given under which the limit Pw equals the lower bound p~~t. The vector cp(t) can be
written (see Complement CS.1) as
cp(t)
(CS.S.6)
where '/(-B, A) is the Sylvester matrix associated with the polynomials -B(z) and
A(z), and
1
<p,,(t) = A(q 1) (u(t -- 1) ... u(t - na - nb
(C8.S.7)
)[EF(q-l)<Pu(t)ZT(t)][ EK(q-l)Z(t)K(q-l)ZT(t)]-l
X [Ez(t)F(q-1)<p~(t)]./T( -B,
A)
(C8.S.S)
F(q-l) = H-1(q-l)
(C8.5.9)
and set
cp(t) = lr\q-l)<pu(t)
Then (CS.5.S) becomes
(CS.S.lO)
(Note in passing that this inequality follows easily by generalizing Lemma A.4 to
dim cp =1= dim z.) Assume further that there is a matrix M of dimension (na + nb)lnz
such that
cp(t)
Mz(t) + xC!)
(C8.S.11)
where
as nz -)
Set
00
Chapter 8
R z = EZ(t)ZT(t)
Rzx
Rx
Ez(t)xT(t)
= Ex(t)xT(t)
Under the assumptions introduced above, the difference between the two sides of
(CS.5.1O) tends to zero as nz ~ 00, as shown below.
[EZ(t)cpT(t)]
as nz
(CS.5.12)
~ 00
Here it has been assumed implicitly that R; I remains bounded when nz tends to
infinity. This is true under weak conditions (cf. Complement CS.2). The conclusion to
draw from (CS.S.12) is that under the assumptions (CS.S.9), (CS.S.ll),
I1m
p~
I1Z
popt
(e8.S.B) I
IV
nz...........:;.oo
~ 1) )
(C8.5.14)
u(t - nz)
where L(q-l) is an asymptotically stable and invertible filter. It will be shown that z(t)
given by (CS.S.14) satisfies (CS.S.ll). Define M(q-l) and {mj} by
M(q-I) =
2: mjq-j g l/(A(q-I)H(q-l)L(q-l))
j=O
and set
M __
(mo m:l
o
..
~l'
. . '.
:::=~
:
mo ... nlnz-na-nb
Then
CP(t)
= M(q-l)Z(t) = Mz(t)
+ x(t)
na
+ nb)jnz)
(C8.5.15)
Complement CB.5
Xi(t)
2:
309
mkL(q-l)u(t - i - k)
k=nz-i+l
Since M(q-l) is asymptotically stable by assumption, it follows that mk converges
exponentially to zero as k ~ 00. Combining this observation with the expression above
for x(t), it is seen that (CS.S.11) holds for z(t) given by (CS.S.14).
In the calculation leading to (CS.S.13) it was an essential assumption that the prefilter
was chosen as in (CS.S.9). When F(q-l) differs from H-1(q-l) there will in general be
strict inequality in (CS.S.S), even when nz tends to infinity. As an illustration consider
the following simple case:
= 1 + fq-l
H(q-l) = 1
F(q-l)
0<
If I <
na = 0
nb = 1;
<p(t)
= u(t - 1)
In this case
1 - f2
f
R = Ez(t)F(q-l)U(t - 1) =
o
S=I
and for nz
2,
(CS.S.16)
S-1I2
(CS.S.17)
310
Chapter 8
The estimate 8 is called the optimally weighted extended IV estimate. Both (C8.S.3)
and (C8.S.17) rely on knowledge of the noise shaping filter H(q-l). As in the case of
(C8.S.3), replacement of H(q-l) in (C8.S.17) by a consistent estimate will not worsen
the asymptotic accuracy (for N - ? 00) (provided nz does not increase too fast compared
to N).
The main difference between the computational burdens associated with (C8.S.3) and
(C8.S.17) lies in the following operations.
For (C8.S.3):
Computation of the optimal IV vector {H-l(q-l)<p(t)}~l' which needs O(N) operations. Also, some stability tests on the filters which enter in <p(t) are needed to
prevent the elements of <p(t) from exploding. When those filters are unstable they
should be replaced by some stable approximants.
For (C8.S.17):
.. Computation of the matrix S from the data, which needs O(nz x N) arithmetic
operations. Note that for the IV vector z(t) given by (C8.5.14), the matrix S is Toeplitz
and symmetric. In particular this means that the inverse square root matrix S-1/2 can
be computed from S in 0(nz2) operations using an efficient algorithm such as the
Levinson-Durbin algorithm described in Complement CS.2 .
Creating and solving an overdetermined system of nz equations with nEl unknowns.
Comparing this with the task of creating and solving a square system of n8 equations,
which is associated with (CS.S.3), there is a difference of about 0(nz2) + O(nz . N)
operations (for nz ~ na).
Thus from a computational standpoint the IV estimate (CS.S.17) may be less attractive
than (C8.S.3). However, (C8.S.17) does not need astable estimate of the system transfer
function B(q--l)/A(q-l)" as does (C8.S.3). Furthermore, in the case of pure time series
models where there is no external input signal, the optimal IV estimate (C8.S.3) cannot
be applied; while the IV estimate (C8.S.17) can be used after some minor adaptations,
Some results on the optimally weighted extended IV estimate (C8.S.17) in the case of
ARMA processes are reported in Stoica, Soderstrom and Friedlander (198S) and Stoica,
Friedlander and Soderstrom (198Sb). For dynamic systems with exogenous inputs there
is as yet no experience in using this IV estimator.
Complement CB.6
The Whittle- Wiggins-Robinson algorithm
The Whittle- Wiggins- Robinson algorithm (WWRA) is the extension to the multivariate case of the Levinson-Durbin algorithm discussed in Complement CS.2. To
motivate the need for this extension, consider the following problem.
Let yet), t = 1, 2, 3, ... , be an ny-dimensional stationary stochastic process, An
autoregressive model (also called a linear prediction model) of order n of yet) will have
the form
(CS.6.1)
(CS.6.2)
Let
k = 0, 1,2, ...
= RI.
(CS.6.3)
and
Qn = Ro
A I1 ,lR_ 1
+ ... +
An,nR-n
= (Qn
0 ... 0)
(CS.6.4)
The algorithm
An efficient solution to (CS.6.4) was presented by Whittle (1963) and Wiggins and
Robinson (1966). Here the derivation of the WWRA follows Friedlander (19S2). Note
that there is no need to maintain the assumption that {R;} are theoretical (or sample)
covariances. Such an assumption is not required when deriving the WWRA. It was used
above only to provide a motivation for a need to solve linear systems of the type
312
Chapter 8
(CS.6.3). In the following it is assumed only that Rk = R~k for k = 0, 1,2, ... , and that
the (symmetric) matrices f n - 1 are nonsingular such that (CS.6.3) has a unique solution
for all n of interest.
Efficient computation of the prediction coefficients {An,;} involves some auxiliary
quantities. Define {B n ,;} and Sri by
(CS.6.5)
where fn was introduced in (CS.6.4). These quantities can be given the following
interpretation. Consider the backward prediction model
f(t)
with respect to the matrix coefficients {Bn,i} will lead precisely to (C8.6.5); note the
similarity to (C8.6.4), which gives the optimal forward prediction coefficients.
The WWRA computes the solution of (CS.6.4) recursively in n. Assume that the
nth-order solution is known. Introduce
(CS.6.6)
and note that
p"
~ R"., +
~ {R_("+"
CJr
(CS.6.7)
An
Bn,n'"
1 ...
An n
0)
Bn,l
fn+l
(Qn
T
Pn
o ...
o ...
(CS.6.S)
A n+1 ,n+l)
fn+l
= (Qn+l
0 ... 0
0 ... 0
0)
(CS.6.9)
Sn+l
(see (CS.6.4), (CS.6.5. Therefore the idea is to modify (CS.6.S) so that zeros occur in
the positions occupied by Pn and P~. This can be done by a simple linear transformation
of (CS.6.S):
Complement CB.6
_P~SI~l)G
( _p{Q;;l
=
~) f,,+l
IQ _ P S--lpT 0
0"" n
0
(n
0 S" -
313
(CS.6.1O)
P~'Q;;lpJ
where Qn and Sn are assumed to be nonsingular. (Note from (CS.6.13) below that the
nonsingularity (If Qnand Sn for n = 1, 2, ... is equivalent to the nonsingularity of f n for
n = 1,2, .... ) Assuming that fn is nonsingular, it follows from (CS.6.9), (CS.6,1O) that,
for n = 1, 2, ... ,
A n +1,1
...
An+1,n An+1,n+l)
B n + 1,n
...
B n + 1 ,1
( -P;Q-I
n
Qn+I = Qn -- PnS;;1
Sn+1
01.)
pr
(CS.6.1l)
prQ;;lp"
= Sn -
Q1 =
Ro
BI1
AI. 1 R f
SJ
=
-RfRol
Ro + B1,IR I
R o, Po
= R]).
The quantities
Kn+l = PnS;;l
_.
-1
Kn+l = Qn Pn
are the so-called reflection coefficients (also called partial correlation coefficients).
To write the WWR recursion (CS.6.1l) in a more compact form, introduce the
following matrix polynomials:
An(z)
Bn(z) = B n,,,
These recursions induce a lattice structure, as depicted in Figure CS.6.1, for computing
the polynomials An(z) and Bn(z).
The upper output of the kth lattice section is the kth-order predictor polynomial
AkCZ), Thus, the WWRA provides not only the nth-order solution A,,(z), but also all
Chapter 8
r---------I
I An(z)
I
I
I
I
.. .... --r----<.__----t(+ } - - y - - +
IAn-
I
I
I
I
I
I
I
(z)
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
,
I
IB 1(z) B n _ 1(z) I
I __________ JI
L
...
I
I
I
I
I
I
I
I
I
I
I Bn(z)
I __________ -.lI
L
1 A n ,l
An,n
An-
1 ,n-l
U=
urn
(C8.6.12)
where X denotes elements whose exact expressions are not important for this discussion.
From (C8.6.12) and the fact that urnuT is a symmetric matrix it follows immediately
that
(C8.6.13)
Ro
The implications of this result are important. It shows that:
{fn > O}
{Ro
'>
n}
The implication '=:>' is immediate. Indeed, if rn > 0 then U exists; hence (C8.6.13)
holds, which readily implies that Ro > 0 and Qi > 0, i = 1, ... , n. To prove the
implication '-=', assume that Ro > 0 and Qi > 0 for i = 1, ... , n. Since Ro > 0 it follows
that A I 1 exists; thus the factorization (C8.6.13) holds for n = 1. Combining this
. observation with the fact that Q, > 0, it follows that f, > 0. Thus Az, I and Az,z exist;
and so on.
It also follows from (C8.6.13) that if f n > 0 then the matrix
o
Ql
llZ
R O-1I2
is a block lower-triangular Cholesky factor of r;;l. Note that since the matrix U is
triangular it can be efficiently inverted to produce the block-triangular Cholesky factor of
the covariance matrix f n'
Similarly, it can be shown that the matrix coefficients {Bn,d provide a block uppertriangular Cholesky factorization of f~l. To see this, define
I
V=
B 1,1
B n ,n-l ... B n ,l
Bn,n
(CR.6.14)
Sn
which concludes the proof of the assertion made above. Similarly to the preceding result,
it follows from (C8.6.14) that
{fn > O}
'>
1, ... , n}
Chapter 8
> 0
"*
Izi
:::s; 1
?
Note that Ac is a block companion matrix associated with the polynomial An(z). It is well
known that det z"A,Jz- l ) is the characteristic polynomial of Ac (see e.g. Kailath, 1980).
Hence det An(z) has all its roots outside the unit circle if and only if all the eigenvalues of
Ac lie strictly within the unit circle. Let A denote an arbitrary eigenvalue of Ac and let
fi
(~l) "*
fill
fin - A~.n-lfil
=
=
-A~~~n!-!l
Ail
(C8.6.15)
AI111 -1
Alln
"*
"*
(CS.6.16)
Recall from (C8.6.1) that 8 = (An. I
the Toeplitz structure of f n that
ll*fn.- 1 !-! = (rt*
...
o)rn(~)
= {!-!f(I
= lliQn!-!l
8)+A*(0
+
1l*)}fn{(~T)lll+A(~)}
IAI2!-!*fn_l!t
(CB.6.17)
Since rn > 0 implies rn- 1 > 0 and Qn > 0, it follows from (CB.6.17) that IA.I < 1, and
the proof of the implication (i) => (iv) is concluded.
To prove the implication (iv) => (i), consider the following AR model associated with
AnCz):
(CB.6.1B)
where EW(t)WT(S) = QnOt,s' Note that Qn is a valid covariance matrix by assumption.
Since the AR equation above is stable by assumption, x(t) can be written as an infinite
moving average in wet),
x(t)
Ew(t)xT(t) = Qn
Ew(t)xT(t - k) =
(CB.6.19)
for k ;:;: 1
With (C8.6.19) in mind, postmultiply (CB.6.1B) by xT(t - k) for k = 0,1,2,. . and take
expectations to get
Ro
An,lR-: 1
+ ... +
(An, 1
.,.
(CB.6.20a)
An,nR-:n = Qn
An,n)r~-l =
-(R1 , ..
R~)
(C8.6.20b)
and
k;:;: 1
(C8.6.20c)
where r~-l is defined similarly to r n - 1. The solution {Ro ... R~} of (C8.6.20a, b), for
given {An,i} and Qn, is unique. To see this, assume that there are two solutions to
(C8.6.20a, b). These two solutions lead through (C8.6.20c) to two different infinite
covariance sequences associated with the same (AR) process, which is impossible.
Next recall that {An,i} in (C8.6.20a, b) are given by (C8.6.4), With this in mind, it is
easy to see that the unique solution of (C8,6.20a, b) is given by
k
= 0, ".,
(C8.6.21)
III
Remark 1. Note that the condition Qn > 0 in (iv) is really necessary. That is to say, it is
not implied by the other conditions of (iv). To see this, consider the following first-order
bivariate autoregression:
Chapter 8
A 1(q--l)y(t) = (e(t)
eel)
where
aq --1 )
-J
1 + aq
lal
< 1
Ee(t)e(s) = Ot,s
Let Rk
Ro
+ a2
(1 - a2 )3 1 _ a2
1 (1
1-
a2 )
(1 _ a2 )2
which is positive definite. Furthermore, by the consistency properties of the Yule- Walker
estimates, the solution of (C8.6.3) for n = 1 is given by A 1 ,1 above. The corresponding
AJ(z) polynomial (see above) is stable by construction. Thus, two of the conditions in
(iv) (the first and third) are satisfied. However, the matrix r 1 is not positive definite (it is
only positive semidefinite) as can be seen, for example from the fact that
(I
Al,l)rJ ( ; )
A
1.1
E(e(t)(e(t) e(t
e(t)
(1
1)
1
Ql
is singular. Note that for the case under discussion the second condition of (iv) (i.e.
Ql > 0) is not satisfied.
III
Remark 2. It is worth noting from the above proof of the implication (iv) => (i), that
the following result of independent interest holds:
rn > 0 then the autoregression (C8.6.18), where {An,;} and Qn are given by
(C8.6.4), matches the given covariance sequence {Ro, ... , Rn} exactly
III
If
It is now shown that the conditions (ii) and (iii) above are equivalent to requiring that
some matrix reflection coefficients (to be defined shortly) have singular values less than
one. This is a direct generalization of the result presented in Complement C8.2 for the
scalar case.
For Qn > 0 and 8 n > 0, define
(C8.6.22)
The matrix square root
Qlf2(Q1f2yr
Qlf2
Qlf2QTI2
=Q
Then
(C8.6.23)
(C8.6.24)
(ef. (C8.6.11)). Let 0n+1 denote the maximum singular value of K n + 1 . Clearly (ii) or (iii)
imply
0i
< 1, i = 1, ... , n}
(C8.6.25)
Using a simple inductive reasoning it can be shown that the converse is also true. If
Ro = Qo = So> 0 and 01 < 1, then (C8.6.23), (C8.6.24) give QJ > 0 and SI > 0, which
together with 02 < 1 give Q2 > 0 and S2 > 0, and so on. Thus, the proof that the
statements (i), (ii) or (iii) above are equivalent to (CB.6.25) is complete.
The properties of the WWRA derived above are analogous to those presented in
Complement CB.2 for the (scalar) Levinson-Durbin algorithm, and therefore they may
find applications similar to those described there.
Chapter 9
RECURSIVE IDENTIFICATION
METHODS
9.1 Introduction
In recursive (also called on-line) identification methods, the parameter estimates are
computed recursively in time. This means that if there is an estimate 8(t - 1) based on
data up to time t - 1, then 8(t) is computed by some 'simple modification' of 8(t - 1).
The counterparts to on-line methods are the so-called off-line or batch methods, in which
all the recorded data are used simultaneously to find the parameter estimates. Various
off-line methods were studied in Chapters 4, 7 and 8.
Recursive identification methods have the following general features:
They are a central part of adaptive systems (used, for example, for control or signal
processing) where the (control, filtering, etc.) action is based on the most recent
model.
Their requirement on primary memory is quite modest, since not all data are stored.
They can be easily modified into real-time algorithms, aimed at tracking time-varying
parameters.
They can be the first step in a fault detection algorithm, which is used to find out
whether the system has changed significantly.
Most adaptive systems, for example adaptive control systems (see Figure 9.1), are
based (explicitly or implicitly) on recursive identification. Then a current estimated
model of the process is available at all times. This time-varying model is used to
determine the parameters of the (also time-varying) regulator. In this way the regulator
will be dependent on the previous behavior of the process (through the information flow:
process -jo model-jo regulator). If an appropriate principle is used to design the regulator
then the regulator should adapt to the changing characteristics of the process.
As stated above, a recursive identification method has a small requirement on primary
memory since only a modest amount of information is stored. This amount will not
increase with time.
There are many variants of recursive identification algorithms designed for systems
with time-varying parameters. For such algorithms the parameter estimate e(t) will not
converge as t tends to infinity, even for time-invariant systems, since the algorithm is
made to respond to process changes and disregards information from (very) old data
points.
Fault detection schemes can be used in several ways. One form of application is failure
320
Section 9.2
I
I
Process
I
I
Regulator
Recursive
identifier
I
I
I
rparameters
Estimated
I
J
diagnosis, where it is desired to find out on-line whether the system malfunctions. Fault
detection is also commonly used with real-time identification methods designed for
handling abrupt changes of the system. When such a change occurs it should be noticed
by the fault detection algorithm. The identification algorithm should then be modified so
that previous data points have less effect upon the current parameter estimates.
Many recursive identification methods are derived as approximations of off-line
methods. It may therefore happen that the price paid for the approximation is a reduction in accuracy. It should be noted, however, that the user seldom chooses between
off-line and on-line methods, but rather between different off-line methods or between
different on-line methods.
yet) = b + e(t)
where e(t) denotes a disturbance of variance ",2.
In Example 4.4 it was shown that the least squares estimate of
mean,
A
O(t)
e = b is the arithmetic
~ yes)
(9.1)
5=1
= t1
['-1
5~ yeS)
+ yet)
1 t[(t
Chapter 9
1)8(t - 1)
A
+ yet)]
(9.2)
1
B(t - 1) + [[yet) - 8(t - 1)]
A
The result is quite appealing. The estimate of 8 at time t is equal to the previous estimate
(at time t - 1) plus a correction term. The correction term is proportional to the
deviation of the 'predicted' value O(t -- 1) from what is actually observed at time t,
namely yet). Moreover, the prediction error is weighted by the factor 1/t, which means
that the magnitude of the changes of the estimate will decrease with time. Instead, the
old information as condensed in the estimate O(t - 1) will become more reliable.
The variance of e(t) is given, neglecting a factor of )...2, by
pet)
(<1>T<1-l
=!
<1>T = (1 ... 1)
(lit)
(9.3)
(see Lemma 4.2). The algorithm (9.2) can be complemented with a recursion for P(t).
From (9.3) it follows that
p-l(t) = t = p-I(t - 1)
+1
and hence
p(t)
l(t - 1)
pet - 1)
pet - 1)
(9.4)
To start the formal discussion of the recursive least squares (RLS) method, consider the
scalar system (dim y = 1) given by (7.2). Then the parameter estimate is given by
OCt) =
[~cp(S)cpT(S)J-T~ CP(S)Y(S)]
(9.5)
(d. (4.10) or (7.4. The argument t has been used to stress the dependence of 0on time.
The expression (9.5) can be computed in a recursive fashion. Introduce the notation
pet)
]-1
(9.6)
+ cp(t)cpT(t)
(9.7)
.~ C(J(S)cpT(S)
Since trivially
p-1(t)
= r1(t -
1)
it follows that
Set)
pet)
[~ cp(s)y(s) + cp(t)y(t) J
Section 9.2
Thus
OCt) =
O(t -
1)
+- K(t)E(t)
(9.8a)
K(t) = P(t)cp(t)
(9.8b)
(9.8c)
Here the term E(t) should be interpreted as a prediction error. It is the difference
between the measured output y(t) and the one-step-ahead prediction }lUI t - 1; 8(t - 1
= cpT (t)O(t - 1) of y (t) made at time t - 1 based on the model corresponding to the
estimate 8(t - 1). If E(t) is small, the estimate 8(t - 1) is 'good' and should not be
modified very much. The vector K(t) in (9.8b) should be interpreted as a weighting or
gain factor showing how much the value of E(l) will modify the different elements of the
parameter vector.
To complete the algorithm, (9.7) must be used to compute pet) which is needed in
(9.8b). However, the use of (9.7) needs a matrix inversion at each time step. This would
be a time-consuming procedure. Using the matrix inversion lemma (Lemma A.l),
however, (9.7) can be rewritten in a more useful form. Then an updating equation for
pet) is obtained, namely
P(t)
= pet -
+- <pT(t)p(t - l)cp(t)]
(9.9)
Note that in (9.9) there is now a scalar division (a scalar inversion) instead of a matrix
inversion. The algorithm consisting of (9.8)-(9.9) can be simplified further. From (9.8b),
(9.9),
+- <pT(t)p(t - l)cp(t)]
(9.10)
This form for K(t) is more convenient to use for implementation than (9.8b). The reason
is that the right-hand side of (9.10) must anyway be computed in the updating of pet)
(see (9.9).
The derivation of the recursive LS (RLS) method is now complete. The RLS algorithm
consists of (9.8a), (9.8c), (9.9) and (9.10). The algorithm also needs initial values
0(0) and P(O). For convenience the choice of these quantities is discussed in the next
section.
For illustration, consider what the general algorithm (9.8)-(9.10) becomes in the
simple case discussed in Example 9.1. The equation (9.9) for P(t) becomes, since
cp(t) == 1,
_
pet) - pet - 1) - 1
p 2 (t - 1)
+- pet -
1)
= K(t)
lit
PCt .- 1)
1
+- pet -
1)
324
= eCt
1)
+ -[yet) t
Chapter 9
8(t - 1)]
Forgetting factor
The approach in this case is to change the loss function to be minimized. Let the modified
loss function be
Vr ({))
2.:
Af-SE2(S)
(9.11)
s=l
The loss function used earlier had A = 1 (see Chapters 4 and 7 and Example 9.1) but
now it contains the forgetting factor A, a number somewhat less than 1 (for example
0.99 or 0.95). This means that with increasing t the measurements obtained previously
are discounted. The smaller the value of A, the quicker the information in previous
data will be forgotten. One can rederive the RLS method for the modified criterion (9.11)
(see problem 9.1). The calculations are straightforward. The result is given here since a
recursive prediction error method is presented in Section 9.5 for which the recursive
LS method with a forgetting factor will be a special case.
The modified algorithm is
=
t(t) =
K(t) =
P(t) =
e(t)
S(t - 1) + K(t)E(t)
y(t) - cpT(t)8(t - 1)
(9.12)
Section 9.3
Real-time identification
325
(9.13)
x(t + 1)
= x(t)
yet) = cpT(t)X(t) + e(t)
(9.14a)
(9.14b)
x(t + 1)
= x(t) + vet)
(9.16)
This means that the parameter vector is modeled as a random walk or a drift. The
covariance matrix Rican be used to describe how fast the different components of 8 are
expected to vary. Applying the Kalman filter to the model (9.16), (9.14b) gives the
following recursive algorithm:
OCt)
O(t - 1) + K(t)E(t)
(9.17)
Chapter 9
convergence properties and small variances of the estimates for time-invariant systems
(which requires A close to 1 or R J close to 0) on the other. Note that the algorithm (9.17)
offers more flexibility than (9.12) does, since the whole matrix R] can be set by the user.
It is for example possible to choose it to be a diagonal matrix with different diagonal
elements. This choice of Rl may be convenient for describing different time variations for
the different parameters.
Initial values
The Kalman filter interpretation of the RLS algorithm is useful in another respect. It
provides suggestions for the choice of the initial values 8(0) and P(O). These values are
necessary to start the algorithm. Since pet) (times A2) is the covariance matrix of e(t) it is
therefore reasonable to take for 8(0) an a priori estimate of 8 and to let P(O) reflect the
confidence in this initial estimate 8(0). If P(O) is small then K(t) will be small for all t and
the parameter estimates will therefore not change too much from 8(0). On the other
hand, if P(O) is large, the parameter estimates will quickly jump away from 8(0).
Without any a priori information it is common practice to take
8(0)
P(O)
(9.18)
QI
p-1(t)
= p-\O) +
<p(S)<pT(S)
s=1
Now set
x(t)
= P-l(t)8(t)
Then
x(t) = P-l(t)8(t -- 1)
+ <p(t)c(t)
= [p-l(t - 1)
= xU - 1) +
<p(t)y(t)
= x(O)
2:
<p(s)y(s)
8=1
and hence
8(t) = P(t)x (t)
=
(9.19)
Section 9.4
If p--\O) is small, (i.e. P(O) is large), then B(t) is close to the off-line estimate
Satr(t) =
(9.20)
The expression for p-\t) can be used to find an appropriate P(O). First choose a time to
when Set) should be approximately BOff(t). In practice one may take to = 10-25. Then
choose P(O) so that
to
p-J(O) ~
2: cp(S)cpT(S) = toEcp(t)cpT(t)
(9.21)
s=1
For example, if the elements of cp(t) have minimum variance, say 0 2 , then P(O) can be
taken as in (9.18) with Q ~ 1/[to02 ].
The methods discussed in this section are well suited to systems that vary slowly with
time. In such cases A is chosen close to 1 or R J as a small nonnegative definite matrix.
If the system exhibits some fast parameter changes that seldom occur, some modified
methods are necessary. A common idea in many such methods is to use a 'fault detector'
which tests for the occurrence of significant parameter changes. If a change is detected
the algorithm can be restarted, at least partly. One way of doing this is to decrease the
forgetting factor temporarily or to increase R J or parts of the P(t) matrix.
Set)
-1
(9.22)
Note the algebraic similarity with the least squares estimate (9.5). Going through the
derivation of the RLS algorithm, it can be seen that the estimate (9.22) can be computed
recursively as
= 8(t - 1) + K(t)E(t)
E(t) = y(t) - cpT(t)S(t - 1)
K(t) = P(t)z(t) = pet - l)z(t)/[l + fflT(t)p(t - 1)z(t)]
PCt) = pet - 1) - pet - l)z(t)cpT(t)P(t - 1)/[1 + cpT(t)p(t - l)z(t)]
S(t)
(9.23)
This is similar to the recursive LS estimate: the only difference is that cp(t) has been
changed to z(t) while cpT(t) is kept the same as before. Note that this is true also for the
'off-line' form (9.22).
328
Chapter 9
z(t)
= cp(t)
(9.24)
= BO(q-l)U(t)
AO(q-l)X(t)
(9.25)
The signal x(t) is not available by measurements and cannot be computed from (9.25)
since AO(q--l) and BO(q-l) are unknown. However, x(t) could be estimated using the
parameter vector OCt) in the following adaptive fashion:
z(t)
x(t) = z T(t)8(t)
The procedure initialization as well as the derivation of real-time variants (for tracking
time-varying parameters) of the recursive instrumental variable method are similar to
those described for the recursive least squares method.
The recursive algorithm (9.23) applies to the basic IV method. In Complement C9.1 a
recursive extended IV algorithm is derived. Note also that in Section 8.3 it was shown
that IV estimation for multivariable systems with ny outputs often decouples into ny
estimation problems of MISO (multiple input, single output) type. In all such cases the
algorithm (9.23) can be applied.
"2
(9.27)
s=l
where Q is a positive definite weighting matrix. (For more general criterion functions
that can be used within the PEM, see Section 7.2.) For" = 1, (9.27) reduces to the loss
function (7.15a) corresponding to the choice h1(R) = tr R, which was discussed
previously (see Example 7.1). Use of the normalization factor 112 instead of lit will be
more convenient in what follows. This simple rescaling of the problem will not affect
the final result. Note that so far no assumptions on the model structure have been
introduced.
The off-line estimate, Of' which minimizes Vf(8) cannot be found analytically (except
for the LS case). Instead a numerical optimization must be performed. Therefore it is not
possible to derive an exact recursive algorithm (of moderate complexity) for computing
8t . Instead some sort of approximation must be used. The approximations to be made
are such that they hold exactly for the LS case. This is a sensible way to proceed since
Section 9.5
the LS estimate, which is a special case of the PEM (with E(t, 8) a linear function of 8),
can be computed exactly with a recursive algorithm as shown in Section 9.2.
Assume that 8(t - 1) minimizes VI-l(O) and that the minimum point of VI(O) is close to
8(t - 1). Then it is reasonable to approximate VICO) by a second-order Taylor series
expansion around 8(t - 1):
'
,A
(9.28)
The right-hand side of (9.28) is a quadratic function of 8. Minimize this with respect to 8
and let the minimum point constitute the new parameter estimate 8(t). Thus
8(t)
= 8(t -
(9.29)
which corresponds to one step with the Newton-Raphson algorithm (see (7.82
initialized in 6(t - 1). In order to proceed, recursive relationships for the loss function
VI(O) and its derivatives are needed. From (9.27) it is easy to show that
VI(O) = AVI_I(O) +
~ET(t,
8)QE(t, 0)
(9.30a)
(9.30b)
(9.30c)
The last term in (9.30c) is written in an informal way since E"(t, 8) is a tensor if E(t, 8) is
vector-valued. The correct interpretation of the last term in this case is
(9.30d)
where Ei(f, 8) and Qij are scalars with obvious meanings, and EJ(t, 8) is the matrix of
second-order derivatives of Ei(t, 8).
Now make the following approximations:
V;_1(8(t - 1 = 0
(9.31)
V;'-1(8(t - 1 = V;'-1(8(t - 2
(9.32)
(9.33)
The motivation for (9.31) is that 8(t - 1) is assumed to be the minimum point of Vt-I(O).
The approximation (9.32) means that the second-order derivative V;'-l (8) varies slowly
with 8. The reason for (9.33) is that E(t, 0) at the true parameter vector will be a white
process and hence
This implies that at least for large t and 8 close to the minimum point, one can indeed
neglect the influence of the last term in (9.30c) on V7(8). The approximation (9.33) thus
allows the last term of the Hessian in (9.30c) to be ignored. This means that a
Chapter 9
e(t)
= e(t X
V7(e(t - 1))
QE(t, e(t -
(9.34a)
(9.34b)
This algorithm is as it stands not well suited as a recursive algorithm. There are two
reasons for this:
The inverse of V7 is needed in (9.34a), while the matrix itself (and not its inverse), is
updated in (9.34b) .
For many model structures, calculations of E(t, 8(t - 1)) and its derivative will for
every t require a processing of all the data up to time t.
The first problem can be solved by applying the matrix inversion lemma to (9.34b), as
described below. The second problem is tied to the model structure used. Note that so
far the derivation has not been specialized to any particular model structure. To produce
a feasible recursive algorithm, some additional approximations must be introduced. Let
(9.35)
denote approximations which can be evaluated on-line. The actual way of implementing
these approximations will depend on the model structure. Example 9.2 will illustrate how
to construct E(t) and 1jJ(t) for a scalar ARMAX model. Note that in (9.35) the quantities
have the following dimensions:
= [v7(e(t -
1))]-1
(9.36)
pet)
Section 9.5
Note that at each time step it is now required to invert a matrix of dimension (nyiny)
instead of an (nSI nS) matrix, as earlier. Normally the number of parameters (nS) is much
larger than the number of outputs (ny). Therefore the use of the equation (9.38) leads
indeed to an improvement of the algorithm (9.34).
A general recursive prediction error algorithm has now been derived, and is
summarized as follows:
= O(t - 1) + K(t)E(t)
K(t) = P(t)ljJ(t)Q
SCt)
pet)
(9.39a)
(9.39b)
(9.39c)
Similarly to the case of the scalar RLS previously discussed, the gain K(t) in (9.39b) can
be rewritten in a more convenient computational form. From (9.38),
x 'tpT(t)p(t - l)ljJ(t)Q/A
=
that is,
K(t)
(9.40]
The algorithm (9.39) is applicable to a variety of model structures. The model structure
will influence the way in which the quantities E(l) and ljJ(t) in the algorithm are computed
from the data and the previously computed parameter estimates. The following example
illustrates the computation of E(t) and tjJ(t).
Example 9.2 RPEM for an ARMAX model
Consider the model structure
where ny
= nu
= 1, and
=1+
Clq--l
+ ... + cnq-n
(9.41)
332
Chapter 9
This model structure was considered in Examples 6.1, 7.3 and 7.8, where the following
relations were derived:
'(t,8)
= -( -yF(t-l, 8)
(9.42a)
... _yF(t - n, 8)
(9.42b)
where
(9.42c)
(9.42d)
(9.42e)
It is clear that to compute ret, 8) and '(t, 8) for any value of 8, it is necessary to process
all data up to time t. A reasonable approximate way to compute these quantities
(9.43a)
1jJ(t)
(9.43c)
(9.43d)
(9 .43e)
yF(t)
The idea behind (9.43) is simply to formulate (9.42a, c-e) as difference equations. Then
these equations are iterated only once using previously calculated values for initialization. Note that 'exact' computation of E(S, .) and E' (s, .) would require iteration of
(9.43) with S(t - 1) fixed, from t = 1 to t = s.
The above equations can be modified slightly. When updating Set) in (9.39a) it is
necessary to compute E(t). Then, of course, one can only use the parameter estimate
S(t - 1) as in (9.43a). However, E(t) is also needed in (9.43a, b, e). In these relations it
would be possible to use O(t) for computing ret) in which case more recent information
could be utilized. If the prediction errors computed from Set) are denoted by E(t), then
E(t)=y(t)+al(t-1)y(t-1)+ ... +an (t-l)y(t-n)
- btCt -
l)u(t - 1) - ... -
bn(t -
l)u(t -- n)
(9.44a)
Section 9.5
uF(t-1) .. uF(t-n)
F(t)
E(l) -
c] (t)tf(t -
1) ... - cnCt)ff(t - n)
(9.44d)
(9.45)
epT(t)8 + e(t)
yet)
ep(t)
(9.46a)
(9.46b)
(9.46c)
One could try to apply the LS method to this model. The problem is, of course, that the
noise terms e(t - 1), ... , e(t - n) in ep(t) are not known. However, they can be replaced
by the estimated prediction errors. This approach gives the algorithm
8(t - 1) + K(t)E(t)
Set)
E(t)
= yet) - epT(t)8(t - 1)
(9.47)
pet)
ep(t)
This algorithm can be seen as an approximation of the RPEM algorithm (9.39), (9.43).
The 'only' difference between (9.47) and RPEM is that the filtering in (9.42c-e) is
Chapter 9
neglected. This simplification is not too important for the amount of computations
involved in updating the variables B(t) and pet). It can, however, have a considerable
influence on the behavior of the algorithm, as will be shown later in this chapter
(cf. Examples 9.4 and 9.5). The possibility of using the E(t) in <:pet) as in (9.44) still
III
remains.
(9.48a)
L
A
= ~(t - l)Z(t)[AQ-l
(9.48b)
+ '\IJT(t)p(t - l)Z(t)]-l
(9.48c)
For all methods discussed earlier, except the instrumental variable method, Z(t) = '\IJCt).
The algorithm (9.48) must be complemented with a part that is dependent on the
model structure where it is specified how to compute Eft) and '\IJ(t).
The algorithm above is a set of nonlinear stochastic difference equations which depend
on the data u(l), y(l), u(2), y(2), ... It is therefore difficult to analyze it except for the
case of the LS and IV methods, where the algorithm is just an exact reformulation of
the corresponding off-line estimate. To gain some insights into the properties of the
algorithm (9.48), consider some simulation examples.
Example 9.4 Comparison of some recursive algorithms
The following system was simulated:
1.0q-l
y(t) = 1 _ O.9q
u(t) + e(t)
where u(t) was a square wave with amplitude 1 and period 10. The white noise sequence
e(t) had zero mean and variance 1. The system was identified using the following
estimation methods: recursive least squares (RLS), recursive instrumental variables
(RIV), PLR and RPEM. For RLS and RIV the model structure was
b)T
Theoretical analysis
Section 9.6
y(t)
+ ay(t - 1)
bu(t - 1)
8 = (a
335
+ eel) + ce(t - 1)
C)T
In all cases initial values were taken to be e(O) = 0, P(O) = 101. In the RIV algorithm
the instruments were chosen as Z(t) = (u(t - 1) u(t- 2)f. The results obtained from
300 samples are shown in Figure 9.2.
The results illustrate some of the general properties of the four methods:
RLS does not give consistent parameter estimates for systems with correlated equation
errors. RLS is equivalent to an off-line LS algorithm, and it was seen in Example 2.3
(see system '/2), that RLS should not give consistency. For low-order systems the
deviations of the estimates from the true values are often smaller than in this example.
For high-order systems the deviations are often more substantial.
In contrast to RLS, the RIV algorithm gives consistent parameter estimates. Again,
this follows from previous analysis (Chapter 8), since RIV is equivalent to an off-line
IV method.
e Both PLR and RPEM give consistent parameter estimates of a, band c. The estimates
i1 and b converge more quickly than c. The behavior of PLR in the transient phase may
be better than that of RPEM. In particular, the PLR estimates of c may converge
II
faster than for RPEM.
e
1.0u(t - 1)
e(t)
where u(t) was a square wave with amplitude 1 and period 10. The white noise sequence
e(t) had zero mean and variance 1. The system was identified using RLS and a first-order
model
yet)
+ ay(t - 1)
bu(t - 1)
+ e(t)
(a(o)
A
b(O)
= 0
P(O) = QI
Various values of Q were tried. The results obtained from 300 samples are shown in
Figure 9.3.
It can be seen from the figure that large and moderate values of Q (i.e. Q = 10 and
Q = 1) lead to similar results. In both cases little confidence is given to 9(0), and feU)}
departs quickly from this value. On the other hand, a small value of Q (such as 0.1 or
0.01) gives a slower convergence. The reason is that a small Q implies a small K(t) for
t ~ 0 and thus the algorithm can only make small corrections at each time step.
II
The behavior of RLS exhibited in Example 9.5 is essentially valid for all methods.
The choice of Q necessary to produce reasonable performance (including satisfactory
RLS
6(1)
I.rr/~
~~
./'
-1
200
100
300
(a)
RIV
6(1)
\.I"""" '-"'-.I'
1.....
-1
.~
200
100
300
(b)
FIGURE 9.2 Parameter estimates and true values for: (a) RLS, (b) RIV, (c) PLR
(d) RPEM, Example 9.4.
PLR
6(1)
'vJ
~~
~~
~
-I
a,C
la
200
100
300
(e)
RPEM
6(1)
~
'V~~
.......
~
\a
-1
a,
100
200
(d)
300
6(1)
II
= 10
o ~-----------------------------------------------1
-1
100
200
300
(a)
6(1)
Ii! = I
2
-1
300
200
100
(b)
10, (b)
= 1,
9(/)
!! = 0.1
2
1\
-1
~v~v-~----------------------------------------------------4
200
100
300
(c)
6(1)
II
= 0.01
-1
100
200
(d)
300
340
Chapter 9
convergence rate) has been discussed previously (see (9.18)-(9.21). See also Problem
9.4 for a further analysis.
Example 9.6 Effect of the forgetting factor
The following ARMA process was simulated:
yet) - O.9y(t - 1)
eel) + O.ge(t -- 1)
The white noise sequence e(t) had zero mean and variance 1. The system was identified
using RPEM and a first-order ARMA model
yet) + ay(t - 1)
= e(t) + ce(t
- 1)
c estimates
this example.
6(1)
.. = 1.0
fh
"
lV
\~
\L
400
(a)
FIGURE 9.4 Parameter estimates and true values: (a) A = 1.0, (b) A = 0.99,
(c) A = 0.95, Example 9.6.
13(1)
A = 0.99
(b)
e(t)
). = 0.95
-1
400
(e)
342
Chapter 9
The features exhibited in Example 9.6 are valid for other methods and model structures.
These features suggest the use of a time-varying forgetting factor A(t). More exactly,
A(t) should be small (that is slightly less than 1) for small t to make the transient
phase short. After some time A(t) should get close to 1 (in fact A(t) should at least tend to
1 as t tends to infinity) to enable convergence and to decrease the oscillations around
the true values. More details on the use of variable forgetting factors are presented in
Section 9.7.
Simulation no doubt gives useful insight. However, it is also clear that it does not
permit generally valid conclusions to be drawn. Therefore it is only a complement to
theory. The scope of a theoretical analysis would in particular be to study whether the
parameter estimates a(t) converge as t tends to infinity; if so, to what limit; and also
possibly to establish the limiting distribution.
Recall that for the recursive LS and IV algorithms the same parameter estimates are
obtained asymptotically, as in the off-line case, provided the forgetting factor is 1 (or
tends to 1 exponentially). Thus the (asymptotic) properties of the RLS and RIV estimates
follow from the analysis developed in Chapters 7 and 8.
Next consider the algorithm (9.48) in the general case. Suppose that the forgetting
factor A is time varying and set A = A(t). Assume that A(t) tends to 1 as t increases. The
algorithm (9.48) with A replaced by A(t) corresponds to minimization of the criterion
Vt (8) =
Lit
A(k)
}T(S, 8)QE(S, 8)
(9.49)
y(1) = 1
The reason for using the name 'step length' will become apparent later (see (9.51)).
Roughly speaking, computing Set) when 8(t - 1) is known, corresponds to taking one
Gauss-Newton step, and y(t) controls the length of this step.
Note that A(t) == 1 gives yet) = lit. It is also possible to show that if A(t) ~ 1, then
ty(t) ~ 1.
Next define the matrix R(t) by
(9.50)
The matrix pet) will usually be decreasing. If A(t) == 1 it will behave as lit for large t. The
matrix R(t) will under weak assumptions have a nonzero finite limit as t ~ 00.
Now (9.48a) can be rewritten as
a(t) = e(t - 1)
+ y(t)R-\t)Z(t)QE(t)
= A(t)P-1(t -
1) + Z(t)Q1jJT(t)
(9.51a)
Section 9.6
With the substitutions (9.49), (9.50) this becomes
1)R(t - 1) + y(t)Z(t)Q1jJT(t)
+ y(t)[Z(t)Q1jJT(t) - R(t --
= R(t - 1)
(9.51b)
1)]
x(t) = x(t - 1)
where
~(t)
y(t)~(t)
(9.52a)
x(t) == x(O) +
2:
y(k)~(k)
(9.52b)
k=l
Note that E~=l y(k) diverges. (Since y(k) = 11k asymptotically, ELI y(k) = log t.) Hence
if (9.52) converges we must have
E'r,,(t) = 0
(9.53)
~=
~~
EZ(t)QE(t, 0)
= EZ(t)Q1jJT(t, 0)
(R*)-lf(8*)
=0
G(8*) - R* == 0
(9.S5a)
In particular, it is found that the possible limit points of (9.51a) must satisfy
f(8*) = 0
(9.S5b)
Chapter 9
:'t 8,
R;lf(e,)
(9.56)
d
d'tRT = G(e.,) - HI:
Assume that (9.56) is solved numerically by a Euler method and computed for
... Then
8/2 = 8 /1
Rl2
+ (t2 - t1)R;;lfCel)
= RII + (t2
- t1)[G(e 1)
't
= t1 , t2 ,
(9.57)
RIJ
Note the similarity between (9.57) and the algorithm (9.51). This similarity suggests
that the solutions to the deterministic ODE will be close to the paths corresponding to
the algorithm if (9.48) if used with the time scale
't
2:
(9.58)
y(k) = log t
k=l
The above analysis is quite heuristic. The link between the algorithm (9.51) and the
ODE (9.56) can be established formally, but the analysis will be quite technical. For
detailed derivations of the following results, see Ljung (1977a, b), Ljung and Soderstrom
(1983).
1. The possible limit points of {S(t)} as t tends to infinity are the stable stationary points
of the differential equation (9.56). The estimates Set) converge locally to stable
stationary points. (If 8(to) is close to a stable stationary point e* and the gains {K(t)},
t ~ to are small enough, then S(t) converges to 8*.)
2. The trajectories of the solutions to the differential equations are expected paths of the
algorithm (9.48).
3. Assume that there is a positive function Vee, R) such that along the solutions of the
differential equation (9.56) we have
:'t V(eT' R
T)
::::;
O. Then as
00,
the estimates
{e, I:'t
R
V(8 T , R,) =
o}
or to the boundary of the set q; given by (6.2). (It is assumed that the updating of G(t)
in (9.48a) is modified when necessary to guarantee that G(t) E q;, i.e. the model
corresponding to Set) gives a stable predictor; this may be done, for instance, by
reducing the step length according to a certain rule, whenever Set) If q;.)
Next examine the (locally) stable stationary points of (9.56), which constitute the
possible limits of Set). Let 8*, R* = G(e*) be a stationary point of the ODE (9.56). Set
r* = vec(R*), r, = vec(H,J, d. Definition A.lO. If the ODE is linearized around
(8*, R*), then (recall (9.55)
Theoretical analysis
Section 9.6
0
i. ( 81: - 8*) _ (CR*)-1 f(8)1
as 8=8*
dT
r - r*
1:
0)(8
-8*)
345
(9.59)
-/ \r: - r*
In (9.59) the block marked X is apparently of no importance for the local stability
properties. The stationary point 8*, R * will be stable if and only if the matrix
L(S*)
[G(8*)]-1 af (8)/
08 8=8*
------------
-----~
o=
(9.61)
af(8) I
as
8=8
= E 01j!
as (t,
8=8
(9.63a)
VE(t, 8) /
ae
_
8--8
= -/
(9.63b)
Chapter 9
which evidently has all eigenvalues in the left half-plane. Strictly speaking, in the
multivariable case (ny > 1), 'tjJ(t, e) is a matrix (of dimension (nelny and thus 8'tjJ/ae
a tensor. The calculations made in (9.63) can, however, be justified since a'tjJ1a8 and
E(t, 8 0 ) are certainly uncorrelated. The exact interpretation of the derivative 8'tjJ/8e is
similar to that of (9.30d).
As shown above, (8(h G(e o is a locally stable solution to (9.56). To investigate global
stability, use result 3 quoted above and take V",,(8) as a candidate function for V(8, R).
Clearly V oo (8) is positive by construction. Furthermore, from (9.61),
d
(9.64)
Dc = {S, Rlf(e) = O}
= {e,
(9.65)
R/:8voo(e) = 0, R
= G(e)}
This set consists of the stationary points of the criterion V 00(8). If eo is a unique stationary
point, it follows that the RPEM gives consistent parameter estimates under weak
conditions.
It can also be shown that the RPEM parameter estimates are asymptotically Gaussian
distributed with the same distribution as that of the off-line estimates (see (7.59), (7.66),
(7.72. Note the word 'asymptotic' here. The theory cannot provide the value of t for
which it is applicable. The result as such does not mean that off- and on-line PEM
II
estimates are equally accurate, at least not for a finite number of data points.
Example 9.8 Convergence analysis of PLR for an ARMAX model
In order to analyze the convergence properties of PLR for an ARMAX model, the
relation
(9.66)
is proposed as a tentative (Lyapunov) function for application of result 3 above. The
true system, which corresponds to eo, will be denoted by
(9.67)
+ (8'[ - 80 ) T (G(e,) -
(8~ - 6oyrf(8,J
R~)(R( -
80 )
(9.68)
Section 9.6
f(8 ,,)
Theoretical analysis
Ecp(t)E(t) = Ecp(t){ E(t) - e(t)}
(9.69)
This follows since cp(t) depends only on the data up to time t 'uncorrelated with e(t). It now follows from (9.46), (9.47) that
E(t) - e(t)
347
1. It is hence
-1
(9.70)
Define
0(8,;)
Ecp(t) C /
oq
(9.71)
1) cpT(t)
Thus a sufficient convergence condition will be obtained if it can be shown that under
that condition the matrix
(9.73)
is positive definite. Let x be an arbitrary n8 vector, set z(t)
spectral density of z(t). Then
x TfIx
C;o(~--T5 z(t)
2Ez(t)
?-
2E[Z(t)h~o(~-1) - ~}z(t)J
= 2
- EZ2(t)
xTR.,x
rt
-11:
{I
<PzC m) Re Co(~j",)
So if
all w
I} dm
'2
(9.74)
348
Chapter 9
then it can be concluded that Set) converges to 80 globally, The condition (9,75) is often
expressed in words as 'lICo-1I2 is a strictly positive real filter'. It is not satisfied for all
possible polynomials Co(q-l). See, for example, Problem 9.12.
Consider also the matrix L(8 0 ) in (9.60), which determines the local convergence
properties. In a similar way to the derivation of (9.63a),
= - Ecp(t)1jJ T (t)le=eo
where (d. (9.42), (9.46
1jJ(t) =
CCq
I) cp(t)
Hence for PLR applied to ARMAX models the matrix L(8 0 ) is given by
(9.76)
For certain specific cases the eigenvalues of L(Oo) can be determined explicitly and hence
conditions for local convergence established. A pure ARMA process is such a case. Then
the eigenvalues of L(Oo) are
-1 with multiplicity n
1
k = 1, ... , n
9.7
Practical aspects
The preceding sections have presented some theoretical tools for the analysis of
recursive identification methods, The analysis is complemented here with a discussion of
some practical aspects concerning:
Search direction.
Choice of forgetting factor.
Numerical implementation.
Practical aspects
Section 9.7
349
Search direction
The general recursive estimation method (9.48) may be too complicated to use for some
applications. One way to reduce its complexity is to use a simpler gain vector than K(t) in
(9.48). In turn this will modify the search direction, i.e. the updating direction in (9.48a).
The following simplified algorithm, which is often referred to as 'stochastic approximation' (although 'stochastic gradient method' would be a more appropriate name) can be
used:
Set)
= e(t - 1)
+ K(t)E(t)
K(t) = Z(t)Qlr(t)
(9.77)
Forgetting factor
The choice of the forgetting factor A in the algorithm is often very important.
Theoretically one must have A = 1 to get convergence. On the other hand, if A < 1 the
algorithm becomes more sensitive and the parameter estimates change quickly. For that
reason it is often an advantage to allow the forgetting factor to vary with time. Therefore
substitute Aeverywhere in (9.48) by A(t). A typical choice is to let A(t) tend exponentially
to 1. This can be written as
A(t)
= 1 - Ab[1 - A(O)]
(9.78)
Typical values for Ao and A(O) are Ao = 0.99 and A(O) = 0.95. Using (9.78) one can
improve the transient behavior of the algorithm (the behavior for small or medium
values of t).
Chapter 9
Numerical implementation
The updating (9.39c) for P(t) can lead to numerical problems. Rounding errors may
accumulate and make the computed pet) indefinite, even though pet) theoretically is
always positive definite. When P(t) becomes indefinite the parameter estimates tend to
diverge (see, for example, Problem 9.3). A way to overcome this difficulty is to use a
square root algorithm. Define Set) through
pet)
S(t)ST(t)
(9.79)
and update Set) instead of pet). Then the matrix pet) as given by (9.79) will automatically
be positive definite. Consider for simplicity the equation (9.39c) for the scalar output
case with Q = 1 and A = 1, i.e.
PU) = pet - 1) - pet -- 1)tv(t)1vT(t)P(t .- 1)/[1 + tpT(t)p(t - l)tp(t)]
(9.80)
The updating of Set) then consists of the following equations:
f(t)
ST(t - IN(t)
~(t)
+ fT(t) f(t)
(9.81)
S(t - l)f(t)
K(t) = L(t)/f3(t)
(9.82)
In practice it is not advisable to compute K(t) as in (9.82). Instead the updating of the
parameter estimates can be done using
Set)
8(t - 1) + L(t)[E(t)IB(t)]
(9.83)
Proceeding in this way requires one division only, instead of nO divisions as in (9.82).
In many practical cases the use of (9.80) will not result in any numerical difficulties.
However, a square root algorithm or a similar implementation is recommended (see
Ljung and Soderstrom, 1983, for more details), since the potential numerical difficulties
will then be avoided.
In Complements C9.2 and C9.3 some lattice filter implementations of RLS for AR
models and for multivariate linear regressions are presented, respectively. The lattice
algorithms are fast in the sense that they require less computation per time step than
does the algorithm (9.81), (9.83). The difference in computational load can be significant
when"the dimension of the parameter vector e is large.
Summary
In this chapter a number of recursive identification algorithms have been derived. These
algorithms have very similar algebraic structure. They have small requirements in terms
of computer time and memory.
Problems 351
In Section 9.2 the recursive least squares method (for both static linear regressions and
dynamic systems) was considered; in Section 9.4 the recursive instrumental variable
method was presented; while the recursive prediction error method was derived in
Section 9.5.
The recursive algorithms can easily be modified to track time-varying parameters, as
was shown in Section 9.3.
Although the recursive identification algorithms consist of rather complicated nonlinear transformations of the data, it is possible to analyze their asymptotic properties.
The theory is quite elaborate. Some examples of how to use specific theoretical tools for
analysis were presented in Section 9.6. Finally, some practical aspects were discussed
in Section 9.7.
Problems
Problem 9.1 Derivation of the real-time RLS algorithm
Show that the set of equations (9.12) recursively computes the minimizing vector of the
weighted LS criterion (9.11).
Problem 9.2 Influence of forgetting factor on consistency properties
of parameter estimates
Consider the static-gain system
t = 1, 2, 3 ...
where
Ee(t) = 0
Ee(t)e(s) =
Of,S
b=
arg
min 2: AN-1[y(t) -
bu(t)j2
(i)
1=1
where N denotes the number of data points, and the forgetting factor A satisfies
o < A ~ 1. Determine var(b) = E(b - b)2. Show that for A = 1,
val' (b)
---7
as N
---7
00
(ii)
(i.e. that b is a consistent estimate, in the mean square sense). Also show that for A < 1,
there are signals u(t) for which (ii) does not hold.
Hint. For showing the inconsistency of b in the case A< 1, consider u(t) = constant.
Remark. For more details on the topic of this problem, see Zarrop (1983) and Stoica
and Nehorai (1988).
Problem 9.3 Effects of PCt) becoming indefinite
As an illustration of the effects of pet) becoming indefinite, consider the following simple
example:
352
System: yet)
80
Ee(t)e(s)
Model: y(t) = 8
Chapter 9
+ eel)
= "f?Ot,s
+ E(t)
Let the system be identified using RLS with initial values 8(0) =
80
oaf
(b) Assume that Po < 0 (which may be caused by rounding errors from processing
previous data points). Then show that Vet) will be increasing when t increases to (at
least) t = -lIPo. (If Po is small and negative, Vet) will hence be increasing for a long
period!)
Problem 9.4 Convergence properties and dependence on initial conditions
of the RLS estimate
Consider the model
y(t)
= epT(t)8 + E(t)
Let the off-line weighted LS estimate of 8 based on y(l), y(2), ... , yet), ep(l), ... , ep(t)
be denoted by 8t .
8t =
[,tl
At-sep(S)epT(S)
J~T~ AHep(S)y(s) J
Consider also the RLS algorithm (9.12) which provides recursive (on-line) estimates B(t)
of 8.
(a) Derive difference equations for p-l(t) and P-l(t)8(t). Solve these equations to find
how O(t) depends on the initial values 8(0), P(O) and on the forgetting factor A.
Hint. Generalize the calculations in (9.18)-(9.20) to the case of A :::; 1.
(b) Let P(O) = QI. Prove that for every t for which Ot exists,
lim O(t)
(c) Suppose that
= fit
lim [e(t) - et ]
t-~oo
as t~
00.
Prove that
112>
(i)
Problems
353
Comment on the fact that (i) is a sufficient condition for global convergence of PLR,
in view of the interpretation of PLR as an 'approximate' RPEM (see Section 9.5).
(b) Determine the set (for n = 2)
for
ill E
(-n, n)}
8(t + 1)
()loS'
= 8(t) + y(t)[y(t +
1) - 8(t)]
8(0) =0
(i)
where either
1
yet) = - t + 1
(ii)
yet) = y
(iii)
or
E
(0, 1)
2:
AI - k [y(k) -
of
(iv)
k=l
with A = 1; while Set) given by (i), (iii) is asymptotically equal to the minimizer of (iv)
with A = 1 - y. Then, assuming that eCt) = 0 = constant, use the results of Problem 9.2
to establish the convergence properties of (i), (ii) and (i), (iii). Finally, assuming that
O(t) is slowly time-varying, discuss the advantage of using (iii) in such a case over the
choice of yet) in (ii).
Chapter 9
with the convention IT;+I 'A(k) = 1 (d. the expression before (9.49. Show that the
RPEM approximately minimizing Vtce) is given by (9.48) with Z(t) == 'IjJ(t) and 'A
replaced by 'A(t):
Set)
S(t -
1) + K(t)E(t)
K(t) = P(t)'IjJ(t)Q
=
Set)
= arg min
E2(S, 8)
s=t-m+l
K 2 (t)
PCt - 1)(cp(t)
X {I + (
pet)
cp(t - m)
.~T(t)
-cp (t - m)
= pet - 1) - (KJ(t) Ki t
m}-l
:T(t)
) pet - 1)
-cp (t - m)
Remark. The identification algorithm above can be used for real-time applications,
since by its construction it has a finite memory. See Young (1984) for an alternative
treatment of this problem.
is a possible local convergence point (Le. a point for which (9.55b) holds and L(O*) has
all eigenvalues in the left half-plane). Also show that other types of stationary points of
Voc(S) cannot be local convergence points to the RPEM.
Hint. First show the following result. Let A > 0 and B be two symmetric matrices.
Then AB has all eigenvalues in the right half-plane if and only if B > O.
Problems
355
where A is a positive definite matrix, <p(t) is some stationary full-rank random vector and
for
W E
(-n, n)
M=
B- 1MB = BT[E<P(f)
1
<pT(t)JB
C(q 1)
(i)
for any real vector h ::/= O.
(b) Determine the set (for n = 2)
Dl = {cr, c2lRe(1
cle im
c2e 2im )
> 0
for
WE
(-n, n)}
Izi
~ 1}
Problem 9.13 Local convergence of the P LR algorithm for a first-order ARMA process
Consider a first-order ARMA process
pet)
= [/ -
(i)
Chapter 9
where K(t) is defined in (9.12). A clear advantage of (i) over (9.9) is that use of (i)
preserves the nonnegative definiteness of pet), whereas due to numerical errors which
affect the computations, pet) given by (9.9) may be indefinite. The recursion (i) appears
to have an additional advantage over (9.9), as now explained. Let ~K denote a small
perturbation of K(t) in (9.9) or (i). Let ~P denote the resulting perturbation in P(t).
Show that
for (9.9)
and
for (i)
qJ(t)
(b j
bnb
al'"
ana)
Assume that the RLS algorithm (9.12) is used to estimate the unknown parameters 8.
Find u(t) at each time instant t such that
u(t)
arg
min
u(t); lu(t)1 ~ 1
det pet
+ 1)
Since for t sufficiently large, pet + 1) is proportional to the covariance matrix of the
estimation errors, it makes sense to design the input signal as above.
Problem 9.16 Illustration of the convergence rate for stochastic approximation algorithms
Consider the following noisy static-gain system:
=a
Let a > 0 (the case a < 0 can be treated similarly). The unknown parameter 8 is
recursively estimated using the following stochastic approximation (SA) algorithm:
1
8(t) = S(t - 1) + ([y(t) - as(t - 1)]
A
E[8(N) -
Sf
(i)
Compare the convergence rates of the SA algorithm above to the convergence rate of the
recursive least squares (LS) estimator which corresponds to 0) with lit replaced by
lI(at).
Hint. For large N it can be shown that
{N(I!+l)/(!l
kl! =
k=l
+ 1) for!l > -1
log N
for !l = -1
constant
for !l < - 1
(ii)
aCt)
e = ae
e(t) = e(t -- 1)
(iii)
Here the parameter a which drastically influences the accuracy of the estimates, appears
as a factor in the gain sequence. Note that for the RLS algorithm alt is replaced by lit
~(i~.
Bibliographical notes
Recursive identification is a rich field with many references. Some early survey papers
are Saridis (1974), Isermann et at. (1974), Soderstrom et al. (1978), Dugard and Landau
(1980). The paper by Soderstrom et al. (1978) also contains an analysis based on the
ODE approach which was introduced by Ljung (1977a, b). For a thorough treatment of
recursive identification in all its aspects the reader is referred to the book by Ljung and
Soderstrom (1983). There is also a vast literature on adaptive systems for control and
signal processing, for which recursive identification is a main topic. The reader is referred
to Astrom and Wittenmark (1988), Astrom et at. (1977), Landau (1979), Egardt (1979),
Goodwin and Sin (1984), Macchi (1984), Widrow and Stearns (1985), Honig and
Messerschmitt (1984), Chen (1985), Haykin (1986), Alexander (1986), Treichler et at.
(1987) and Goodwin, Ramadge and Caines (1980). Astrom (1983b, 1987) and Seborg
et al. (1986) have written excellent tutorial survey papers on adaptive control, while
Basseville and Benveniste (1986) survey the area of fault detection in connection with
recursive identification.
358
Chapter 9
(Section 9.3). Some further aspects on real-time identification are given by Bohlin
(1976), Benveniste and Ruget (1982), Benveniste (1984, 1987), and Benveniste et al.
(1987). Identification of systems subject to large parameter changes is discussed for
example by Hagglund (1984), Andersson (1983), Fortescue et al. (1981).
(Section 9.4). The adaptive IV method given by (9.26) was proposed by Wong and
Polak (1967) and Young (1965, 1970). See also Young (1984) for a comprehensive
discussion. The extended IV estimate, (8.13), can also be rewritten as a recursive
algorithm; see Friedlander (1984) and Complement C9.l.
(Section 9.5). The recursive prediction error algorithm (9.39) was first derived for
scalar ARMAX models in Soderstrom (1973), based on an idea by Astrom. For similar
and independent work, see Furht (1973) and Gertler and Banyasz (1974). The RPEM
has been extended to a much more general setting, for example, by Ljung (1981). Early
proposals of the PLR algorithm (9.47) were made by Panuska (1968) and Young (1968),
while Solo (1979) coined the name PLR and also gave a detailed analysis.
(Section 9.6). The ODE approach to analysis is based on Ljung (1977a, 1977b); see
also Ljung (1984). The result on the location of the eigenvalues of L(Bo) is given by
Holst (1977), Stoica, Holst and Soderstrom (1982). The analysis of PLR in Example 9.6
foHows Ljung (1977a). Alternative approaches for analysis, based on martingale theory
and stochastic Lyapunov functions, have been given by Solo (1979), Moore and Ledwich
(1980), Kushner and Kumar (1982), Chen and Guo (1986); see also Goodwin and Sin
(1984). The asymptotic distribution of the parameter estimates has been derived for
RPEM by Ljung (1980); see also Ljung and Soderstrom (1983). Solo (1980) and
Benveniste and Ruget (1982) give some distribution results for PLR. A detailed
comparison of the convergence and accuracy properties of PLR and its iterative off-line
version is presented by Stoica, Soderstrom, Ahlen and Solbrand (1984, 1985).
(Section 9.7). The square root algorithm (9.81) is due to Potter (1963). Peterka (1975)
also discusses square root algorithms for recursive identification schemes. For some
other sophisticated and efficient ways (in particular the so-called UD factorization) of
implementing the update of factorized pet) like (9.81), see Bierman (1977), Ljung and
Soderstrom (1983). For more details about lattice algorithms see Friedlander (1982),
Samson (1982), Ljung (1983), Cybenko (1984), Haykin (1986), Ljung and Soderstr()m
(1983), Honig and Messerschmitt (1984), Benveniste and Chaure (1981), Lee et at.
(1982), Porat et al. (1982), OJ<;er et al. (1986), and Karlsson and Hayes (1986, 1987).
Numerical properties and effects of unavoidable round-off errors for recursive identification algorithms are described by Liung and Liung (1985) and Cioffi (1987).
Complement C9.1
The recursive extended instrumental variable method
Consider the extended IV estimate (8.13). Denote by Set) the estimate based on t data
points. For simplicity of notation assume that F(q-l) == 1, Q = 1 and that the system is
scalar. Then
Set) =
P(t)RT(t)r(t)
Complement C9.1
359
where
t
ret) =
2.:
z(s)y(s)
s=l
t
R(t) =
2.:
Z(S)cpT(S)
s=l
pet) = (RT(t)R(t)r 1
= [RT(t -
1)
+ cp(t)zT(t)]{r(t -
1)
+ z(t)y(t)
(C9.l.l)
The first term in (C9.1.1) is equal to zero by the definition of e(t - 1). The remaining
terms can be written more compactly as
( (t)
cp
RT(t - 1)z(t) +
= (RT(t - l)z(t)
(n8Il)
<p(t) = (w(t)
(neI2)
A-I t () -
cpU))
(0 1)
1 ZT(t)Z(t)
and
v(t)
zT(t)r(t = (
yet)
1))
(212)
(211)
- (;:g;)e(t - 1)}
Chapter 9
= p-l(t -
1)
= p-l(t - 1) + (w(t)
CP(tG
ZT(t;z(t)(;:gn
e.
~t)
(ne!l)
(nBI2)
cpCt) = (w(t)
(neI2)
cp(t)
vet)
(nSI1)
(_ZT~)Z(t) ~)
(212)
1)
(211)
zT(t)r(t yet)
R(t)
= R(t - 1) + Z(t)cpT(t)
ret)
= ret -
pet)
1)
(nzlnB)
+ z(t)y(t)
(nzI1)
pet - 1) -- K(t)cpT(t)p(t - 1)
(nBlne)
A simple initialization procedure which does not require extra computations is given by
=0
P(O)
gI
reO) = 0
R(O)
e(o)
The extended RIV algorithm above is more complex numerically than the basic RIV
recursion (9.23). However, in some applications, especially in the signal processing field,
the accuracy of e(t) increases with increasing dim z = nz. In such applications use of the
extended RIV is wen worth the effort. See Friedlander (1984) for a more detailed
discussion of the extended RIV algorithm (called overdetermined RIV there) and for a
description of some applications.
Complement C9.2
361
Complement C9.2
Fast least squares lattice algorithm for AR modeling
Let yet) be a stationary process. Consider the following nth-order autoregressive model
of y(t) (also called a linear (forward) prediction model):
yet)
n)
+ En(t)
(C9.2.1)
Here {an,i} denote the coefficients and En(t) the residual (or the prediction error) of the
nth-order model. Writing (C9.2.1) for t = 1, ... , N gives
y(l)
y(2)
y(1)
0
0
y(l)
yeN)
yeN - 1)
yeN - n)
v
'--.,----'
cf>'I(N)
YN
En(1)
a n ,l
..J
an,n
(C9.2.2)
En(N)
'--.,----'
8n
For convenience, assume zero initial conditions. For large N this will have only a
negligible effect on the results that follow. The vector 8n in (C9.2.2) is determined
by the least squares (LS) method. Thus 8n is given by
(C9.2.3)
(ct. (4.7, where the dependence of 8n on N is shown explicitly. From (C9.2.2),
(C9.2.4)
where P'N is the Nth unit vector:
!IN = (0 ... 0
If'
With
(C9.2.5)
equation (C9.2.4) can be written more compactly as
En(N) = !l~Pn(N)YN
(C9.2.6)
The problem is to determine En(N) and 8n(N) for n = 1, ... , M (say) and N = 1,2, ...
Use of a standard on-line LS algorithm for this purpose requires O(M3) operations
(multiplications and additions) per time step. Any algorithm which provides the
aforementioned quantities in less than O(M3) operations is called a fast algorithm. In the
following a fast lattice LS algorithm is presented.
The problem of computing En(N) for n = 1, ... , M and N = 1, 2, ... is considered
first. Determination of En(N) is important for prediction applications and for other
applications of AR models (e.g. in geophysics) for which 8n (N) is not necessarily
needed. As will be shown, it is possible to determine En(N) for n = 1, ... , M and given
Chapter 9
Order update
Let
n + 1
Then
(C9.2.7)
-I
1)/[
1/!;; 1/!n
With
(C9.2.8)
this gives
r--------------------------------------------------------------,
I
I
P,,+!(N)
= Pn(N) - Pn(NNn(NNnCN)Pn(N)/on(N)
(C9.2.9)
L ______________________________________________________________ I
Complement C9.2
363
Time update
r-------------------------------(-----------)--------------------;
I
PnCN) - Pn(N)[lN[lTvPn(N)/Yn(N)
Pn(N - 1)
L __________________________________
I
I
(C9.2.1O)
___________________________
:
~
where
(C9.2.11)
The result (C9.2.1O) will be useful in the following even though it is not a true time
update (it does not give a method for computing Pn(N) from Pn(N - 1).
To prove (C9.2.1O) note from (C9.2.7)-(C9.2.9) that the left-hand side of (C9.2.1O) is
equal to
Q ~ 1-
(<I>(N)
[IN) {(
<PT(N)
[lTv
(<I>(N)
ltN)
}-l(<I>T(N)
[lTv
(where the index n has been omitted). Let cpT denote the last row of <I>(N). Note that
cp
<I>T(N)<I>(N)
<I>T(N)[lN
and
<I>T(N - 1)<I>(N - 1)
+ cpcpT
Thus
Q
= I - (<P(N) [IN){
G)(O
1)
I ) [<I>T(N)<I>(N) - cpcpT]-l(I
+ ( _cpT
't'
t""
= I -
= I -
(~ ~)
= (P(No- 1) ~)
which concludes the proof of (C9.2.1O).
Time and order update
+ 1)
= ( ____0____)
YN <I>n(N)
Thus
-en)} (<I>'~(TvN)
0)
(C9.2.12)
With
QnCN)
g y'J.JP,,(N)YN
(C9.2.13)
this becomes
,----------------------------------_._--------------------------,
:I
I
P,,+I(N + 1)
(1
0
T .
PrIeN) - Pn(N)YNYNP,,(N)/Qn(N)
(C9.2.14)
L ______________________________________________________________
II
I
~
It is important to note that the order update and time- and order update formulas,
(C9.2.9) and (C9.2.14), hold also for n = 0 provided that Po(N) = I. With this
convention it is easy to see that the time update formula (C9.2.1O) also holds for n = O.
The update formulas derived above are needed to generate the desired recursions for
the prediction errors. It should already be quite clear that in order to get a complete set
of recursions it will be necessary to introduce several additional quantities. For easy
reference, Table C9.2.1 collects together the definitions of all the auxiliary variables
needed to update E. Some of these variables have interesting interpretations. However,
to keep the discussion reasonably brief, these variables are considered here as auxiliary
quantities (refer to Friedlander, 1982, for a discussion of their 'meanings').
TABLE CY.2.1 The scalar
variables used in the fast
prediction error algorithm
E,,(N)
= ~;:;P,,(N)YN
a,,(N)
y,,(N)
= I!;:;P,,(N)~N
Q,,(N)
'll';;(N)P,,(N)1j!,,(N)
y;:;Pn(N)YN
13,,(N) = ~;:;P,,(N)'ll'n(N)
I),,(N) = 'll'~(N)Pn(N)YN
Complement C9.2
365
Note that some of the auxiliary variables can be updated in several different ways.
Thus, the recursions for updating E can be organized in a variety of ways to provide a
complete algorithm. The following paragraphs present all these alternative ways and
indicate which seem preferable. While all these possible implementations of the update
equations for E are mathematically equivalent, their numerical properties and computational burdens may be different. This aspect, however, is not considered here.
Equations will be derived for updating the quantities introduced in Table C9.2.1. For
each variable in the table all of the possible updating equations for Pn(N) will be
presented. It is implicitly understood that if an equation for updating Pn(N) is not
considered, then this means that it cannot be used.
From (C9.2.6) and (C9.2.9),
En+l(N)
= EnCN) -
~n(N)bn(N)/(Jn(N)
(C9.2.15)
1J!n+l(N + 1)
(1J!n~N)
(C9.2.16)
and
tlN+l
(0 )
(C9.2.17)
tlN
~n+l(N
1)
~n(N)
- En(N)bnCN)/Qn(N)
(C9.2.18)
1J!n(N)
= ( 1J!n(N - 1) )
yeN - n - 1)
(C9.2.19)
and
YN =
(;~~;)
(C9.2.20)
bnCN)
= Dn(N -
1)
~nCN)En(N)/Yn(N)
(C9.2.21)
Next we discuss the update ofQ. Either (C9.2.9) or (C9.2.1O) can be used: use of
(C9.2.9) gives
Qn+l(N)
= QnCN) -
D~(N)/(Jn(N)
(C9.2.22a)
QnCN)
= QIl(N -
1)
E~(N)/Yn(N)
(C9.2.22b)
We recommend the use of (C9.2.22a). The reason may be explained as follows. The
recursion (C9.2.22a) holds for n = 0 provided
366
Chapter 9
and
Qo(N) = YfvYN
oo(N)
= 1jJJ(N)1jJoCN)
Note that
and similarly for ooCN) and oo(N). Thus (C9.2.22a) can be initialized simply and exactly.
This is not true for (C9.2.22b). The larger n is, the more computations are needed to
determine exact initial values for (C9.2.22b). Equation (C9.2.22b) can be initialized by
setting Qn(O) = 0 for n = 1, ... , M. However, this initialization is somewhat arbitrary
and may lead to long transients.
Next consider o. Either (C9.2.10) or (C9.2.14) can be used. From (C9.2.1O) and
(C9.2.19),
(C9.2.23a)
on+l(N + 1)
= on(N) -
o~(N)/Qn(N)
(C9.2.23b)
For reasons similar to those stated above when discussing the choice between (C9.2.22a)
and (C9.2.22b), we tend to prefer (C9.2.23b) to (C9.2.23a).
Finally consider the update equations for y. From (C9.2.9),
Yn+l(N)
= Yn(N) -
~~(N)/on(N)
(C9.2.24a)
Yn+l(N + 1)
Yn(N) - f~(N)/Qn(N)
(C9.2.24b)
Both (C9.2.24a) and (C9.2.24b) are convenient to use but we prefer (C9.2.24a) which
seems to require somewhat simpler programming.
A complete set of recursions has been derived for updating fn(N) and the supporting
variables of Table C9.2.1. Table C9.2.2 summarizes the least squares prediction
algorithm derived above, in a form that should be useful for reference and coding. The
initial values for the variables in Table C9.2.2 follow from their definitions (recalling that
Po(N) = I).
Next note the following facts. Introduce the notation
knCN) = onCN)/on(N)
(C9.2.25)
kn(N) == onCN)/QnCN)
Then the updating equations for
fn+l(N)
~n+l(N)
and
can be written as
= fn(N) - kn(NWn(N)
= ~n(N - 1) - kn(N - l)f n(N - 1)
(C9.2.26)
These equations define a lattice filter for computation of the prediction errors, which is
depicted in Figure C9.2.1. Note that k and k are so-called reflection coefficients.
The lattice filter of Figure C9.2.1 acts on data measured at time instant N. Its
parameters change with N. Observe that the first n sections of the lattice filter give the
Complement C9.2
TABLE C9.2.2 The least squares lattice prediction algorithm
Given: M
00(1)
0; Eo(l)
Set: Qo(l)
2, 3, ... :
[30(1)
y(l); 0,,(1) = 0
for n
1, ... , M *
Initialize:
eo(N) = yeN)
~o(N) =
yeN - 1)
Qo(N)
Qo(N - 1)
oo(N)
Qo(N - 1)
Oo(N) = oo(N - 1)
+ y2(N)
yeN - 1)y(N)
Yo(N) = 1
For n
~n+1(N) = ~,,(N
0,,+1(N)
E"+1(N)
f.nCN) -
~,,(N)On(N)/on(N)
yeN)
.. II II ..
.....:.........:....4-....-----(
........ ---+-+-----1>{
FIGURE C9.2.1 The lattice filter implementation of the fast LS prediction algorithm
for AR models.
367
Chapter 9
nth-order prediction errors En(N). Due to this nested modular structure the filter of
Figure C9.2.1 is called a lattice or ladder filter.
The lattice LS algorithm introduced above requires only ~ 10 M operations per time
step. However, it does not compute the LS parameter estimates. As will be shown in the
next subsection, computation of Eln(N) for n = 1, ... , M and given N needs O(M2)
operations. This may still be a substantial saving compared to the standard on-line LS
method, which requires O(M3) operations per time step.
Note that for sufficiently large N a good approximation to Eln(N) can be obtained in the
following way. From (C9.2.1) it follows that
En(N)
A,;!(q-l)y(N)
A';!(z)
1 - a';!,lz -
Next introduce
and
Then
~n(N) = yeN -
n-
b'::l)
n ( ~
b n n
=
B';!(q-l)y(N - 1)
The above equation is the backward prediction model of y(.) and {~n (N)} are the socalled backward prediction errors.
Inserting the above expressions for E and ~ in the lattice recursions (C9.2.26) gives
(C9.2.27)
Note that since the 'filters' acting on yO in (C9.2.27) are time-varying it cannot be
concluded that they are equal to zero despite the fact that their outputs are identically
zero. However, for large N the lattice filter parameters will approximately converge and
then the following identities will approximately hold (the index N is omitted):
Complement C9.2
369
Pn(N) = I - <Pn(N)R,,(N)
Order update
From (C9.2.9) it follows that
(C9.2.30)
r--------------------------------------------------------------~
Rn+l(N)
(Rn~N)
(C9.2.32) 1
~---------------------------------------------------------------
Time update
It follows from (C9 .2.10) that
<P(N - l)R(N = (
0
1)
0)1
Premultiplying this identity by the N - l)IN) matrix (I 0), the following equation is
obtained:
(<P(N - 1)R(N - 1) 0)
Chapter 9
370 Recursive identification methods
r--------------------------------------------------------------,
:I
Rn(N) - Rn(N)!!N!!~Pn(N)/yn(N) = (Rn(N - 1) 0)
(C9.2.33) :
L ___ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ - - - - ______ I
~
Rn+1(N + 1)
= C)[cI>!t(N)cI>n(N)]-l(O
+ (
1 )(0
-RnCN)YN
cI>~'(N)
y~Pn(NIrJfl(N)
which gives
r--------------------------------------------------------------,
I
II
Rn+1(N + 1)
( ' )
( )
(0 YNPr.(N/Qn(N)
(C9.2.34) :I
:L ______________________________________________________________
0 Rn(N)
-RnCN)YN
JI
The formulas derived above can be used to update en(N) and any supporting quantities.
As will be shown, the update of enCN) may be based on the recursions derived earlier for
updating E,,(N). It will be attempted to keep the number of additional recursions needed
for updating 8n (N) to a minimum.
Note that either (C9.2.32) or (C9.2.33) can be used to update 8,,(N). First consider the
use of (C9.2.32). It follows from (C9.2.32) that
(C~2.35)
where 0 and
(J
is the vector of the coefficients of the backward prediction model (see the discussion at
the end of the previous subsection).
To update bn(N), either (C9.2.33) or (C9.2.34) can be used. From (C9.2.33) and
(C9.2.19),
b,,(N)
bn(N - 1) +
cn(N)~n(N)!Yn(N)
(C9.2.36a)
where
c"CN) ~ Rn(NhtN
The update of c is discussed a little later. Using equation (C9.2.34) and (C9.2.16) gives
bn+1(N + 1)
= (
0 ) + (
1 ) on(N)/Qn(N)
bn(N)
-en(N)
(C9.2.36b)
Complement C9.2
371
Next consider the use of (C9.2.33) to update 8,,(N). Using this equation and (C9.2.20),
(C9.2.37)
Concerning the update of c, either (C9.2.32) or (C9.2.34) can be used. Use of (C9.2.32)
results in
(C9.2.-38a)
cn+l(N + 1)
= ( 0 ) + (
c,,(N)
1 ) E,,(N)/Q,,(N)
-8 nCN)
(C9.2.38b)
The next step is to discuss the selection of a complete set of recursions from those above,
for updating the LS parameter estimates. As discussed earlier, one should try to avoid
selection of the time update equations since these need to be initialized rather arbitrarily.
Furthermore, in the present case one may wish to avoid time or time and order
recursions since parameter estimates may not be needed at every sampling point. With
these facts in mind, begin by defining the following quantities:
(C9.2.39)
(C9.2.40)
Note from (C9.2.36a) and (C9.2.37) that
b~(N) =
bn(N - 1)
8~(N) =
8,,(N - 1)
l'
bn+1(N)
(0) (1)
The following equations can be used to update the LS parameter estimates: (C9.2.35),
(C9.2.38a), (C9.2.39), (C9.2.40) and (C9.2.41). Note that these equations have the
following important feature: they do not contain any time update and thus may be used
to compute 8 n (N) when desired. The auxiliary scalar quantities 0, 0, ~,y, E, and Q which
appear in the equations above are available from the LS lattice prediction algorithm. The
initial values 8 1(N), bl(N) and cl(N) for the algorithm proposed are given by
372
Chapter 9
2: yet BI(N)
l)y(t)
1=2
oo(N)
G()(N)
=--
N-I
2:
(C9.2.42a)
y2(t)
1=1
N-j
2:
b 1(N) =
yet - 1)y(t)
1=2
= oo(N - 1)
(C9.2.42b)
G()(N)
N-I
2: y2(t)
1= I
and
cI(N) = yeN - 1) = ~()(N)
N-j
Go(N)
y2(t)
1=1
(C9.2.42c)
2:
where 00, Go and ~o are also available from the LS lattice prediction algorithm.
Note that the fast lattice LS parameter estimation algorithm introduced above requires
~2.5 M2 operations at each time sampling point for which it is applied. If the LS
parameter estimates are desired at all time sampling points, then somewhat simpler
algorithms requiring _M 2 operations per time step may be used. For example, one may
use the update formulas (C9.2.35), (C9.2.36b) to compute Bn(N) for n = 1, ... , M and
N = 1,2,3, ... The initial values for (C9.2.35) and (C9.2.36b) are given by (C9.2.42a,
b). It might seem that (C9.2.36b) needs to be initialized arbitrarily by setting bn(O) = 0,
n = 1, ... , M. However, this is not so since (C9.2.36b) is iterated for n from 1 to
min (M - 1, N - 2). This iteration process is illustrated in Table C9.2.3.
Finally consider the following simple alternative to the fast LS parameter estimation
algorithms described above. Update the covariances
f~ = ~ ~k y(t)y(t + k)
= ft- l
~[Y(N -
k)y(N) -
f~-l]
1= 1
as each new data point is collected. Then solve the Yuie- Walker system of equations
associated with (C9.2.1) (see equation (C8.2.1) in Complement C8.2), when desired,
using the Levinson-Durbin algorithm (Complement C8.2), to obtain estimates of
TABLE C9.2.3 Illustration of the iteration of (C9.2.36b) for M
N
{bn(N)}
b l (2)
b l (3)
b l (4)
b l (5)
"
b2 (3)
"
I'"
b 2 (4)
'"
bi4)
"
b2 (5)
bo(5)
b4 (5)
=4
6
b l (6)
I'"
I'"
} equation (C9.2.42b)
b2(6)
bo(6)
b4 (6)
} eqwtioo (C9.2.36b)
Complement C9.3
373
8 1 , .. , 8 M' This requires - M2 operations. For large N the estimates obtained in this
way will be close to those provided by the fast LS algorithms described previously.
Complement C9.3
Fast least squares lattice algorithm for multivariate regression models
Consider the following multivariate nth-order regression model:
t
= 1,2,
(C9.3.1)
where x is an (nxI1) vector, y is an (nyI1) vector, the (nxlny) matrices Cn,; denote the
coefficients, and the nx-vector en denotes the prediction errors (or the residuals) of the
model. As it stands equation (C9.3.1) may be viewed as a multivariate (truncated)
weighting function model. Several other models of interest in system identification can
be obtained as particular cases of (C9.3.1). For x(t) = yet), equation (C9.3.1) becomes a
multivariate autoregression. A little calculation shows that a difference equation model
can also be written in the form (C9.3.1). To see this, consider the following multivariable
full polynomial form difference equation: this was discussed in Example 6.3 (see
(6.22)-(6.25, and its use in identification was described in Complement C7.3. This
model structure is given by
x(t) + AlX(t - 1) + ... + Anx(t - n)
= BlU(t - 1) + ...
(C9.3.2)
where x, u and en are vectors, and Ai and Bi are matrices of appropriate dimensions.
With
yet)
X(t)
= ( u(t)
and
Cn,l,= (-A-1 B,')
= 1, ... , n
o
(x(l) ... x(N)
'---v-----'
xJv
(nxIN)
= (Cn,l
yeN - 1)
y(l)
.. , Cn,n)
'---v-----'
8~
(nxlny . n)
o ...
\
0 y(l)
4>J(N)
yeN - n)
J
(C9.3.3)
(ny . nlN)
Chapter 9
2: I,N-te~(t)en(t)
(C9.3.4)
1=1
where", E (0, 1) is a forgetting factor (cf. (9.11). It follows from (4.7) and some simple
calculation that 8n is given by
8n (N)
[<I>~(N)AN<I>n(N)]-l<I>~(N)ANXN
(C9.3.S)
AN
(C9.3.6)
'"
=
=
<l>n(N)
AW<I>nCN)
Pn(N)
=I
IlN
(0 ... 0
l)T
(C9.3.7)
- <Pn(N)[<P~~(N)<Pn(N)rl<l>~(N)
T
T)] I-!N
= [XNT - 8,,(N)<I>n(N
= x'}:.r{I - AN<I>n(N)[<I>~(N)AN<I>n(N)]-I<I>~(N)} IlN
= xArAW Pn(N)A NlI2I-!N
-1'
= xNPn(NhtN
The problem considered in this complement is the determination of e'l(N) and 8 n (N) for
= 1, 2, ... and n = 1 to M (some maximum order). This problem is a significant
generalization of the scalar problem treated in Complement C9.2. Both the model and
the LS criterion considered here are more general. Note that by an appropriate choice of
'" in (C9.3.4), it is possible to discount exponentially the old measurements, which
are given smaller weights than more recent measurements.
In the fonowing a solution is provided to the problem of computing en(N) for n = 1,
... , M and N = 1,2, .... As will be shown, a good approximation (for large N) of the
LS parameter estimates 8 n (N) may be readily obtained from the parameters of the
lattice filter which computes cn(N). Exact computation of 8 n (N) is also possible and
an exact lattice LS parameter estimation algorithm may be derived as in the scalar case
(see, e.g., Friedlander, 1982; Porat ct al., 1982). However, in the interest of brevity we
concentrate on the lattice prediction filter and leave the derivation of the exact lattice
parameter estimation algorithm as an exercise for the reader.
Since the matrix PrI(N) plays a key role in the definition of en(N) we begin by studying
various possible updates of this matrix. Note that Pn(N) has the same structure as in the
scalar case. Thus, the calculations leading to update formulas for Pn(N) are quite similar
to those made in Complement C9.2. Some of the details will therefore be omitted.
Complement C9.3
375
Order update
Let
1jJn(N)
(0 ... 0
'---,--'
~n(N)
+ 1 columns
A'w1jJn(N)
Then
(j)T
n+l
(N)
(~~'(N)
1jJ~(N)
r--------------------------------------------------------------,
I
-1
- T
:L ______________________________________________________________
Pn+l(N) = PnCN) - Pn(N)1jJn(N)on (N)'4'n(N)Pn(N)
(C9.3.8) :
where
Time update
Let ep denote the last column of (j)~(N). Then
(j)~(N)I!N
= ep
0)
+ epepT
With these equations in mind, the following time update formula is obtained in exactly
the same way as for the scalar case (see Complement C9.2 for details):
r--------------------------------------------------------------~
Pn(N)
0) + Pn(N)[!NflNP,,(N)/Yn(N)
(Pn(N - 1)
0
0
(C9.3.9) :
IL ______________________________________________________________
where
Chapter 9
iJ.r = (y(l)
... yeN))
Thus
where
YN
AWYN
The nested structure of if> n+ 1 (N + 1) above leads exactly as in the scalar case to the
following update formula:
r--------------------------------------------------------------,
:
I
n+l(
+ 1)
(1
C9 3 10
..)
:
I
I
~--------------------------------------------------------------~
where
Qn(N) = yJvPnCNHN
Note that the update formulas introduced above also hold for n = 0 provided PoCN) = 1.
The update formulas for Pn(N) derived previously can be used to update enCN) and
any supporting quantities. Table C9.3.1 summarizes the definitions of all the variables
needed in the algorithm for updating en(N).
Several variables in Table C9.3.I may be updated in order and in time as well as in
time and order simultaneously. Thus, as in the scalar case treated in Complement C9.2,
the recursive-in-time-and-order calculation of enCN) may be organized in many different
ways. The following paragraphs present one particular implementation of the recursive
algorithm for computing the prediction errors {en(N)}. (See Complement C9.2 for
details of other possible implementations.)
(nxll)
(nylny)
y,,(N) = I-tPn(N)I-tN
(111)
gn(N) = y';;P,,(N)YN
(nylny)
P,,(N) = ~~(N)P,,(N)I-tN
(nyll)
on(N) = ~~(N)Pn(N)YN
(nylny)
dn(N) = x';;Pn(N)~n(N)
(nxlny)
En(N)
(nyil)
y';;Pn(N)I-tN
Complement C9.3
377
1(,J+l(N + 1) = (0 1(,J(N
""A,;+J
""A,;)
= (0
1(,~'(N) = (/,Y21(,J(N - 1)
yA,;
(AlI2yA,;_1
yeN~
xA,;
OY2x];,_1
x(N
yeN - n - 1
With the above identities in mind, a straightforward application of the update formulas
for PnCN) produces the following recursions:
= en(N) - dnCN)o;;l(N)~n(N)
~n+l(N + 1) = ~n(N) - OnCN)g;;l(N)EnCN)
On(N) = Aon(N - 1) + ~n(N)EJ(N)lYn(N)
Qn+l(N) = QnCN) - O~'(N)o;;l(N)On(N)
on+l(N + 1) = on(N) - OnCN)g;;l(N)oJ(N)
Yn+l(N) = Yn(N) -- ~~'(N)o;;l(N)~n(N)
dn(N) = Adn(N - 1) + en(NW~(N)/Yn(N)
En+l(N) = En(N) - OJ(N)o;;l(N)~n(N)
en+l(N)
The initial values for the recursions above follow from the definitions of the involved
quantities and the convention that Po(N) = I:
eo(N) = x(N)
~o(N) =
yeN - 1)
=1
Eo(N) = yeN)
oo(N)
= AOo(N -
1)
+ yeN - l)yT(N)
Chapter 9
Kn+l(N) = dn(N)a;;l(N)
Kn+l(N) = On(N)Q;;l(N)
Kn+l(N) = o~(N)a;;l(N)
Then the recursions for e,
~,
and
can be written as
1) = ~n(N) - Kn+l(N)En(N)
En+l(N)
= En(N) -
(C9.3.11)
Kn+l(N)~n(N)
These equations define a lattice filter for computation of the prediction errors, as
depicted in Figure C9.3.1. Note that, as compared to the (scalar) case analyzed in
Complement C9.2, the lattice structure of Figure C9.3.1 has one more line which
involves the parameters e and K.
r-------]
r-------~
yeN)
Eo(N)
:....;....~-_~'-4--t-----<
I
I
I
IEl(N)
EM-l(N)
EM(N)
. . . .. ...... -----,I__-<r---+\ + ) - - - - 1 - I
I
I
I
I
I
~l(N)
----~I--~----~
I
I
I-KM(N)
I
I
I
I
I
x(N)
leo(N)
lel(N)
--------~----~
L _______ J
1st lattice stage
....
eM-leN)
-----.--------~
FIGURE C9.3.1 The lattice filter implementation of the fast LS prediction algorithm
for multivariable regression models.
I
I
I
I
I
I~M(N)
Complement C9.3
379
Next it will be shown that a good approximation (for large N) of the LS parameter
estimates {8 n (N)} can be obtained from the parameters of the fast lattice predictor
introduced above. To see how this can be done, introduce the following notation:
C;:(q-l)
(B;:'l ..
B;:'n)
-tV~(N)ANcDn(N)[cD~(N)ANcDn(N)rl
= x~ANcDn(N)[cD~(N)ANcDn(N)]-l
enCN)
= x(N) - C;:(q-l)y(N)
~n(N)
nCN)
= yeN) + (A;:' 1
1)
Y(N :
yeN - n)
= B;:(q-l)y(N - 1)
and
...
A;;:n)(Y(N:-
1)) =
A;:(q-l)y(N)
yeN - n)
Substituting the above expressions for e, ~ and
=0
[B;::lCq-l) - q- 1 B;:(q-l) + Kn+1(N)A;:(q-I)]y(N) = 0
[A;:+l(q-l) - A;:(q-l) + Kn+1(N)q-lB;:(q-l)]y(N) = 0
[C;:+l(q-l) - C;:(q-l) -- Kn+l(N)q-lB;:(q-l)]y(N)
(C9.3.12)
As the number of data points increases, the predictor parameters will approximately
converge. Then, it can be concluded from (C9.3.12) that the following identities will
approximately hold (the superscript N is omitted):
(C9.3.13)
Chapter 9
Kn} which are provided by the lattice prediction algorithm. Note that this computation of {8 n (N)} may be performed when desired (i.e. for some specified values of N).
Note also that the hardware implementation of (C9.3.13) may be done using a lattice
filter similar to that of Figure C9.3.1.
Chapter 10
IDENTIFICATION OF
SYSTEMS OPERATING IN
CLOSED LOOP
10.1
Introduction
381
10.2
Chapter 10
Identifiability considerations
The situation to be discussed in this and the following sections is depicted in Figure 10.1.
The open loop system is assumed to be given by
T -
(10.1)
where e(t) is white noise. The input u(t) is determined through feedback as
u(t) = -F(q-I)y(t) + L(q-l)V(t)
(1O.2a)
In (1O.2a) the signal vet) can be a reference value, a setpoint or noise entering the
regulator. F(q-l) and L(q-l) are matrix filters of compatible dimensions (the notations
dim y(t) = dim eel) = ny, dim u(t) = nu, dim v(t) = nv are used in what follows). Most
often the goal of identification of the system above is the determination of the filters
Gs(q--l) and HsCq-l). Sometimes one may also wish to determine the filter F(q-l) of the
feedback path.
A number of cases will be considered, including:
.. The feedback F(q-l) mayor may not be known .
.. The external signal vet) mayor may not be measurable.
Later (see (10.17, the feedback (1O.2a) will be extended to a shift between r different
time-invariant regulators:
(1O.2b)
The reason for considering such an extension is that this special form of time-varying
regulator will be shown to give identifiability under weak conditions.
For the system (10.1) with the feedback (1O.2a) the closed loop system can be shown
to be
yet)
[1 + Gs(q-l)F(q-l)]-l[G.,(q-l)L(q-l)V(t) + Hs(q-l)e(t)]
u(t)
(10.3)
Section 10.2
The following general assumptions are made:
I>
I>
I>
I>
The open loop system is strictly proper (i.e. it does not contain any direct term). This
means that q,(O) = O. This assumption is weak and is introduced to avoid algebraic
loops in the closed loop system. (An algebraic loop occurs if neither Gs(q-l) nor
F(q-l) contains a delay. Then yet) depends on u(t), which in turn depends on yet). To
avoid such a situation it is assumed that the system has a delay so that yet) depends
only on past input values.)
The subsystems from v and e to y of the closed loop system are asymptotically stable
and have no unstable hidden modes. This implies that the filters L(q-l), Hs(q-l) and
[1 + Gs(q-l)F(q-l)]-l are asymptotically stable.
The external signal v(t) is stationary and persistently exciting of a sufficient order.
What 'sufficient' means will depend on the system. Note that it is not required that the
signal w(t) ~ L(q-l)V(t) is persistently exciting. For example, if vet) is scalar and the
filter L(q-l) is chosen to have zeros on the unit circle that exactly match the frequencies for which <1> v ( w) is nonzero, then w(t) will be persistently exciting of a lower
order than vet) (see Section 5.4). It will be convenient in the analysis which follows to
assume that vet) is persistently exciting, and to allow L(q-l) to have an arbitrary form.
The external signal v(t) and the disturbance e(s) are independent for all t and s.
Spectral analysis
The following two general examples illustrate that a straightforward use of spectral
analysis will not give identifiability.
Example 10.1 Application of spectral analysis
Consider a SISO system (nu = nv = ny = 1) as in Figure 10.1. Assume that L(q-l) == 1.
For convenience, introduce the signal
(lO.4a)
From (10.3) the following descriptions of the input and output signals are obtained:
yet)
G s(ql1)F(q I) [ Gs(q-l)V(t)
F(:-l) z(t) ]
1
u(t) = 1 + Gs(q l)F(q Ij[V(t) - z(t)]
(lO.4b)
(lO.4c)
Hence
<1>u(w) =
11 +
(lO.4d)
<1>yu(w) =
jl +
(lO.4e)
Assuming that the spectral densities <1>" (w) and <1> yu (w) can be estimated exactly, which
should be true at least asymptotically as the number of data points tends to infinity, it is
384
Chapter 10
(lO.4f)
=
== 0),
(lO.4h)
Here the result is the negative inverse of the feedback. In the general case it follows from
(1O.4f) that
1 + Gs(e-iW)F(e- i ",)
<Pz(w)
(lO.4i)
F(e iw)
<Pv(w) + <Pz(w)
which shows how the spectral densities <Pv(w) and <PAw) influence the deviation of
G(e- iw ) from the true value Gs(e- iW ).
For the special case when F(q-l) == 0, equation (10.4i) cannot be used as it stands.
However, from (lO.4a) ,
(lO.4D
and (lO.4i) gives for this special case
G(e- iw ) - Gs(e- iw )
I
F=O
(lO.4k)
This means that for open loop operation (F(q-l) == 0) the system is identifiable (cf.
Section 3.5).
To summarize, the spectral analysis applied in the usual way gives biased estimates if
there is a feedback acting on the system.
II
The next example extends the examination of spectral analysis to the multivariable case.
-F(q-l)y(t)
(10.5)
<Pu(w)
<Pyu(w)
-<py(w)FT(e iW )
(10.6)
Section 10.2
G(e- iUJ )
~yu(W)~~l(W)
(10.7)
(see (3.33. Assuming that the spectral density estimates ~yU and ~u are exact, (10.6),
(10.7) give
(10.8)
ey(t)
Hs(q-l)e(t)
eu(t)
(1O.9a)
These are the parts of yet) and u(t), respectively, that depend on the disturbances e(),
see (10.3). Since
yet) - Gs(q-l)U(t)
= #,(q-l)e(t) = ey(t)
one gets
(1O.9b)
This expression must be zero if the estimate (10.7) is to be consistent. It is easily seen that
the expression is zero if p(q-l) == 0 (i.e. no feedback), or expressed in other terms, if the
disturbance Hs(q-l)e(t) and the input u(t) are uncorrelated. In the general case of
P(q-t) '1= 0, however, (1O.9b) will be different from zero. This shows that the spectral
analysis fails to e,ive identifiability of the open loop system transfer function.
Sometimes, however, a modified spectral analysis can be used. Assume that the
external input signal vet) is measurable. A simple calculation gives, (d. (10.1
Gs(e-iw)<puv(w)
(10.10)
If v(t) has the same dimension as u(t) (i.e. nv = nu) G.\.(q-l) can therefore be estimated
by
(10.1~
This estimate of G,(e- iw ) will work, unlike (10.7). It can be seen as a natural extension of
III
the 'open loop formula' (10.7) to the case of systems operating under feedback.
It is easy to understand the reason for the difficulties encountered when applying spectral
analysis as in Examples 10.1 and 10.2. The model (10.8) provided by the method, that is
386
Chapter 10
yet) = -F-1(q-l)U(t)
gives a valid description of the relation between the signals u(t) and yet). This relation
corresponds to the inverse of the feedback law yet) ~ u(t) (see (10.5. Note that:
(i) The feedback path is noise-free, while the relation of interest u(t) ~ y(t) corresponding to the direct path is corrupted by noise.
(ii) Within spectral analysis, the noise part of the output is not modeled.
(iii) The nonparametric model used by the spectral analysis method by its very definition
has no structural restrictions. Hence it cannot eliminate certain true but uninteresting relationships between u(t) and yet) (such as the inverse feedback law model
(10.8)).
The motivation for the results obtained by the spectral analysis identification method lies
in the facts noted above.
The situation should be different if a parametric model is used. As will be shown later
in this chapter, the system will be identifiable under weak conditions if a parametric
identification method is used. Of the methods considered in the book, the prediction
error method shows most promise since the construction of IV methods normally is
based on the assumption of open loop experiments. (See, however, the next subsection.)
The IV methods can be extended to closed loop systems with a measurable external
input. A similar extension of the spectral analysis was presented in Example 10.2. This
extension of IV methods is illustrated in the following general example.
Example 10.3 IV method for closed loop systems
Consider the scalar system
A(q-l)y(t)
B(q-l)U(t)
+ wet)
(10.12)
Section 10.2
Identifiability considerations
387
One can apply IV estimators to (10.12) as in Chapter 8. The IV vector Z(t) will now
consist of filtered and delayed values of the external input vet). Then the second
consistency condition (8.25b) (EZ(t)w(t) = 0) is automatically satisfied. Fulfilment of
the first consistency condition, rank[EZ(t)<pT(t)] = dim 8, will depend on the system, the
experimental condition and the instruments used. As in the open loop case, this condition will be generically satisfied under weak conditions. The 'noise-free' vector <P(t)
will in this case be the part of <p(t) that depends on the external input. To be more exact,
note that from (10.12) and (10.13) the following description for the closed loop system is
obtained:
<p(t)
= (-y(t -
1)
-yet - na)
u(t - 1)
u(t - nbT
(1O.15a)
<P(t)
= (-.9(t -- 1)
-.9(t - na)
a(t - 1)
a(t - nb))T
(HU5b)
_
u(t)
= A(q
A(q-l)T(q-l)
l)R(q 1) + B(q l)S(q 1) vet)
(10. 15c)
_
yet)
= A(q
B(q-l)T(q-l)
l)R(q 1) + B(q l)S(q
(1O.15d)
In the following it will generally be assumed that a prediction error method (PEM) is
used. In most cases it will not be necessary to assume that the external input is measurable, which makes a PEM more attractive than an IV method. A further advantage
is that a PEM will give statistically efficient estimates under mild conditions. The
disadvantage from a practical point of view is that a PEM is more computationally
demanding than an IV method.
As a simple illustration of the usefulness of the PEM approach to closed loop system
identification, recall Example 2.7, where a simple PEM was applied to a first-order
system. It was shown that the appropriate requirement on the experimental condition in
order to guarantee identifiability is
E(-y(t)(_y(t)
u(t)
u(t)) > 0
(10.16)
388
Chapter 10
This condition is violated if and only if u(t) is not persistently exciting of order 1 or if u(t)
is determined as a static (i.e. zero order) linear output feedback, u(t) = -ky(t). It is easy
to see that for a proportional regulator the matrix in (10.16) will be singular (i.e. positive
semidefinite). Conversely, to make the matrix singular the signals yet) and u(t) must be
linearly dependent, which implies that u(t) is proportional to yet). For a linear higherorder or a nonlinear feedback the condition (10.16) will be satisfied and the PEM will
give consistent estimates of the open loop system parameters.
In the following a generalization will be made of the experimental condition (1O.2a).
As already stated a certain form of time-varying regulator will be allowed. Assume that
during the experiment, r different constant regulators are used such that
i
1, ... , r
(10.17)
1, ... , r
(10.18)
1
;=1
For example, if one regulator is used for 30 percent of the total experiment time and a
second one for the remaining 70 percent, then YI = 0.3, Y2 = 0.7.
When dealing with parametric methods, the following model structure will be used
(cf. (6.1:
(10. 19a)
Gu(t)
+ He(t)
Ee(t)eT(t)
(1O.19b)
For convenience the subscript s will be omitted in the true system description in the
following calculations. Note that the model parametrization in (1O.19a) will not be
specified. This means that we will deal with system identifiability (SI) (cf. Section 6.4).
We will thus be satisfied if the identification method used is such that
G=G
H=H
(10.20)
It is then a matter of the parametrization of the model only, and not of the experimental
condition, whether there is a unique parameter vector 8 which satisfies (10.20). In
Chapter 6 (see (6.44 the set DT(J,.At) was introduced to describe the parameter
vectors satisfying (10.20). If DT(J,./It) consists of exactly one point then the system is
parameter identifiable (PI) (cf. Section 6.4).
In the following sections three different approaches to identifying a system working in
closed loop will be analyzed:
Direct identification. The existence of possible feedback is neglected and the recorded
data are treated as if the system were operating in open loop .
Indirect identification. It is assumed that the external setpoint vet) is measurable and
that the feedback law is known. First the closed loop system is identified regarding vet)
Section 10.3
Direct identification
389
as the input. Then the open loop system is determined from the known regulator and
the identified closed loop system .
.Joint input-output identification. The recorded data u(t) and yet) are regarded as
outputs of a multivariable system driven by white noise, Le. as a multivariable
(ny + nu )-dimensional time series. This multivariable system is identified using the
original parameters as unknowns.
For all the approaches above it will be assumed that a PEM is used for parameter
estimation.
10.3
Direct identification
With this approach the recorded values of {u(t)} and {yet)} are used in the estimation
scheme for finding 8 in (1O.19a) as if no feedback were present. This is of course an
attractive approach if it works, since one does not have to bother about the possible
presence of feedback.
To analyze the identifiability properties, first note that the closed loop system
corresponding to the ith feedback regulator (10.17) is described by the following
equations (cf. (10.3)):
Yi(t)
= (I + GFJ-l(GLiv(t) + He(t
(10.21)
(10.22)
(see (1O.19b). Let Ii denote the time interval(s) when the feedback (10.17) is used. Ii
may consist of the union of disjoint time intervals. The length of Ii is YiN. The asymptotic
loss function associated with the PEM (see Chapter 7) is given by
v=
h(Roo(8
1
Ree(8) = lim N
N-'?oo
2:
EE(t)Er(t)
t=l
where h is a scalar increasing function such as tr or det. Note that if E(t) is a stationary
process then Rc/J reduces to Rc" = EE(t)ET(t), which was the case dealt with in Chapter 7.
In the present case E(t) is nonstationary. However, for t Eli, E(t) = Ei(t) which is
stationary. Thus
390
N->oo
lIV
Chapter 10
YiN EEi(t)ET(t)
;=1
2: y;EEi(t)ET(t)
(10.23)
;=1
Recall that as N tends to infinity, the prediction error estimates tend to the global
minimum points of the asymptotic loss function heR,,). Thus, to investigate the
identifiability (in particular, the consistency) properties of the PEM, one studies the
global minima of heR,,).
To see if the direct approach makes sense, consider a simple example.
Example 10.4 A first-order model
Let the system be given by
yet) + ay(t - 1)
= bu(t -
1)
+ e(t)
(cf. Example 2.8). The input is assumed to be determined from a time-varying proportional regulator,
-fly(t)
-f2y(t)
u(t) = {
Then
Yi(t) + (a + bfi)Yi(t - 1)
e(t)
Ei(t)
Yi(t) + (a + bfi)Yi(t - 1)
which gives
",2[1
.
+ (a + bli - a - bfi)2]
1 - (a + bf;)2
(1O.24a)
b)
?: ",2 = Vea, b)
(1O.24b)
Section 10.3
Direct identification
391
As shown in Section 7.5, the limits of the parameter estimates are the minimum points
of the loss function V(a, 6). It can be seen from (10.24) that a = a, 6 = b is certainly a
minimum point. To examine whether a = a, b = b is a unique minimum, it is necessary
to solve the equation V(a, 6) = ;.2 with respect to a and 6. Then from (10.24a),
+ btl - a - btl
a + biz -
a -
biz
= 0
(1O.24c)
= 0
or
(1
1
(1O.24d)
*'
i = 1,2
(1O.25a)
Then
v.
.",z
,y,
[1 + (d + bfi(a- +a bfi)2
- bfi)2]
(1O.25b)
1 -
and
(1O.25c)
Now VI does not have a unique minimum point. It is minimized for all points on the
line (in the a, b space) given by
(1O.25d)
The true parameters a = a, b = b are a point on this line. Similarly, the minimum points
of V2 are situated on the line
a + biz
= a
+ biz
(1O.25e)
The intersection of these two lines will thus give the minimum point of the total loss
function V and is the solution to (1O.24d). The condition t1 iz is necessary to get an
III
intersection.
*'
Return now to the general case. It is not difficult to see that (using G(O)
H(O) = il(o) = /)
Yi(t)
e(t)
C(O) = 0,
Using these results a lower bound on the matrix Roo (10.23) can be derived:
Chapter 10
;=1
r
2: y;Ee(t)eT(t)
Ee(t)eT(t)
;=1
If equality can be achieved then this will characterize the (global) minimum points of any
suitable scalar function V = heR,,).
To have equality it is required that
i
= 1, ... ,
(10.26)
o[1 -
F;(1
+ GF;)-lG
- OU - Fj (1
iI- 1 (1
(10.27)
+ GFi)-lG}]L;v(t) == 0
+ OF;)(1 + GFi)-lH == I
1, ... , r
(10.28)
These two conditions describe the identifiability properties of the direct approach.
Note that (10.27), (10.28) are satisfied by the 'desired solution' 0 = G, iI = H. The
crucial problem is, of course, whether other solutions exist or not. The special cases
examined in the following two examples will give useful insights into the general
identifiability properties.
Example 10.5 No external input
Assume that vet) == 0 and r = 1. Then (10.27) carries no information and (10.28) gives
iI- 1 (1 + OF1 )(1 + GFI)-lH == I
(10.29)
However, this relation is not sufficient to conclude (10.20), i.e. to get identifiability. It
depends on the parametrization and the regulator whether or not
H==H
are obtained as a unique solution to (10.29). Compare also Example 2.7 where the
identity (10.29) is solved explicitly with respect to the parameter vector e.
The identity (10.29) can be rewritten as
Direct identification
Section 10.3
393
If the regulator FI has low order, then the transfer function fI-1(1
fI-1ML1v(t) == 0
Since LJv(t) is pe it follows from Property 4 of pe signals (Section 5.4) that
which implies that M == O. Now use the fact that
iI- 1M == 0,
== 0
or
which implies
G=G
For this case (where use is made of an external persistently exciting signal) identifiability
II
is obtained despite the presence of feedback.
Returning to the general case, it is necessary to analyze (10.27), (10.28) for i = 1, ... , r.
It can be assumed that vet) is persistently exciting of an arbitrary order. The reason is that
L;(q-l) can be used to describe any possible nonpersistent excitation property of the
signal L;(q-I)V(t) injected in the loop. Then v(t) in (10.27) can be omitted (cf. Property 4
of Section 5.4). Using again the relations
394
Chapter 10
GF;)-IG = ] - F;G(1
F;G)-1 = (1
F;G)-1
(iI- l
H- l )G(I
F;G)-IL; - (iI- 16
H- 1G)(1
F;G)-IL;
== 0
H- 1 G)F;(1
G}i)-lH
== 0
(10.30)
(iI- 1
H- l )(1
(iI- l
H- l )
GF;r- I H
+ (iI- 16 -
or
+ (iI- 16 -
H-1G)F;
== 0
(10.31)
G(1 + F;G)-IL;)
-(1 + FiG)-lL;
0) (I
L;
(10.32)
G(1 + F;G)-IL;)
-I
The last matrix appearing in (10.32) is clearly nonsingular. Thus considering simultaneously all the r parts of the experiment,
(10.33)
Hence if the matrix
'76
(~ ::_F_~ ~_1_:_::_L_Or_)
__
(_10_.3_4_)~
___________________________
which trivially gives the desired relation (10.20). This rank condition is fairly mild. Recall
that the block matrices in g; have the following dimensions:
]
Since Fi and Li in g; are filters in the operator q-l, the rank condition above should be
explained. Regard for the moment q-I as a complex variable. Then we require that
rank g; = nu + ny holds for almost every q-l (as already stated immediately following
(10.34.
Indirect identification
Section 10.4
395
r(ny + nv)
(10.35)
This gives a lower bound on the number r of different regulators which can guarantee
identifiability. If in particular nu = nv, then r ?! 1. The case nu = nv, r = 1 was examined
in Example 10.6. Another special case is when nv = 0 (no additional external input).
Then r?! 1 + nu/ny. If in particular ny = nu (as in Example 10.4), then r?! 2, and it is
thus necessary and in fact also sufficient (see (10.34)) to use two different proportional
regulators.
The identifiability results for the direct approach can be summarized as follows:
.. Identifiability cannot be guaranteed if the input is determined through a noise-free
linear low-order feedback from the output .
.. Identifiability can be obtained by using a high-order noise-free linear feedback. The
order required will, however, depend on the order of the true (and unknown) system.
See Complement CI0.1 for more details .
.. Simple ways to achieve identifiability are: (1) to use an additional (external) input such
as a time-varying setpoint, and/or (2) to use a regulator that shifts between different settings during the identification experiment. The necessary number of settings
depends only on the dimensions of the input, the output and the external input.
H;(q-l) ~ (I
+ GF;)-lH
GF;)-lGL;
i = 1, ... , r
(10.37)
The parameters of the system (10.36) can be estimated using a PEM (or any other
parametric method giving consistent estimates) as in the open loop case, since vet) is
assumed to be persistently exciting and independent of e(t).
If Gi and Hi are parametrized by the original parameters of G and H (Li and Fi in the
expressions of G; and Hi are known), then the second step above is not needed since the
parameters of interest are estimated directly. In the general case, however, when
unconstrained 'standard' parametrizations are used for G; and Hi, step 2 becomes
396
Chapter 10
necessary. It is not difficult to see that both these ways to parametrize a; and H; lead to
estimation methods with identical identifiability properties. In the following, for
conciseness, it is assumed that the two,-step pr9cedure above is used. Therefore assume
that, as a result of step 1, estimates a; and Hi of ai and Hi are known. In step 2 the
equations
(I
+ GF;)-IGL; ==
(I
-I'
G;
-"-
+ GF;) H == Hi
(10.38)
i=l, ... ,r
if.
Hi' In view of
(10.39)
which obviously is equivalent to (HU8). Consider next the first part of (10.38). This
equation can also be written
GL i
or
(10.40)
which is equivalent to (10.27) for persistently exciting vet). Thus the same identifiability
relations, namely (10.27), (10.28), are obtained as for the direct identification. In
particular, the analysis carried out for the direct approach, leading to the rank requirement on P:7 in (10.34) is still perfectly valid. This equivalence between the direct and
the indirect approaches must be interpreted appropriately, however. The identifiability
properties are the same, but this does not mean that both methods give the same result in
the finite sample case or that they are equally easy to apply. An advantage of the direct
approach is that only one step is needed. For the indirect approach it is not obvious how
the 'identities' (10.38) should be solved. In the finite sample case there may very well be
no exact solution with respect to 8. It is then an open question in what sense these
'identities' should be 'solved'. Accuracy can easily be lost if the identities are not solved
in the 'best' way. (See Problem 10.2 for a simple illustration.) To avoid this complication
of the two-step indirect approach i and H; can be parametrized directly in terms of the
original parameters of G and H. Then solving the equations (10.38) is no longer
necessary, as explained above. Note, however, that the use of constrained parametrizations for i and Fi; complicates the PEM algorithm to some extent. Furthermore, there
is a general drawback of the indirect identification method that remains. This drawback
is associated with the need to know the regulators and to be able to measure vet), which,
moreover, should be a persistently exciting signal.
Section 10.5
(y(t))
z(t) g
u(t)
(10.41)
as an (ny + nu )-dimensional time series, and apply standard PEM techniques for
estimating the parameters in an appropriately structured model of z(t). To find the
identifiability properties one must first specify the representation of z(t) that is used. As
long as one is dealing with prediction error methods for the estimation it is most natural
to use the 'innovation model' see (6.1), (6.2) and Example 6.5. This model/representation is unique and is given by
z(t) = ;/C'(q-l; 8)iU)
(10.42)
Ei(t)iT(s) = Ai(8)ot,s
where
;/C'-1 (q -1; 8) is asymptotically stable
;/C'(O; 8)
(10.43)
I
Since the closed loop system is asymptotically stable it must be assumed that ;/C'(q-\ 8) is
also asymptotically stable.
The way in which the filter ;/C'(q-\ 8) and the covariance matrix Ai(8) depend on the
parameter vector 8 will be determined by the original parametrization of the model.
Using the notational convention of (10.19), it follows from the uniqueness of (10.42) as a
model for z(t) and from the consistency of the PEM applied to (10.42), that the relations
determining the identifiability properties are
~_~___A_i_=___A_i_______________________________(1__0.~
L-_;/C'
__=
__
Similarly to the indirect approach, the joint input-output method could be applied in
two ways:
Parametrize the innovation model (10.42) using the original parameters. Determine
these parameters by applying a prediction error method to the model structure
(10.42) .
The second alternative is a two-step procedure. In the first step the innovation model
(10.42) is estimated using an arbitrary parametrization. This gives estimates, say
X(q-l) and Ai. Then in the second step the identities
(10.45)
are solved with respect to the parameter vector 8.
The identifiability properties of both these methods are given by (10.44). For practical
use the first way seems more attractive since the second step (consisting of solving
(10.45), which can be quite complicated) is avoided.
For illustration of the identity (10.44) and of the implied identifiability properties,
consider the following simple example.
398
Chapter 10
iel
+ ay(t -
1) = bu(t - 1)
+ e(t) + ce(t -
1)
(1O.46a)
A;
+ vet)
(1O.46b)
where vet) is white noise (Ev 2 (t) = A~) and is independent of e(s) for all t and s. Assume
that the closed loop system is asymptotically stable, i.e.
+ bf
(10.46c)
+-
aq-l)y(t)
bq-1v(t) + (1 + cq-1)e(t)
and hence
z(t) =
1
-1 (
+ aq
b q - 1 ) (e(l)
1 + aq-l
vet)
1 + cq---l
-f(l
+ cq-l)
(10.47a)
This is 'almost' an innovation form. The inverse of the (212) matrix filter in (1O.47a) has a
denominator given by (1 + cq-1)(1 + aq-1) and is hence asymptotically stable.
However, the leading (zeroth-order) term of the filter differs from I. So to obtain the
innovation form the description (1O.47a) must be modified. This can be done in the
following way:
1
(1
+
+
aq
from which
-1
:7{'(q
; e) = 1
(1O.47b)
(c + bf)q--l
f(a - C)q-l
1 (1 +
+aq-l
(c + bf)q-l
f(a _ C)q-l
bq - 1 )
1 + aq-l
(10.48)
A(e) = E (
e(t)
) (e(t)
-fe(t) + vet)
-feet)
+ v(t = (
A;
-fAe
Now assume that a PEM is used to fit a first-order ARMA model to z(t). Note that in
doing so any unique parametrization of the first-order ARMA model may be used; in
other words, there is no need to bother about the parametrization (10.48) at this stage.
Next, using the estimates ;Ie and A provided by the previous step, determine the
unknown parameters a, b, etc., of the original system, by solving the identities (10.44).
At this stage the parametrization (10.48) is, of course, important. The quantities a, b, c
and A; are regarded as unknowns, as are f and A~ (which characterize the feedback
Joint ident~fication
Section 10.5
399
1
+
(1 + +
1 (1
+
(6
bq - 1 )
1 + aq-l
bj)q-l
jCa - 6)q-l
b/)q--l
+ (c + bf)q-l
- 1 + (a
~2
(_;~;
bf)q
_j~2)
+e ~~
p~;
bq- 1 )
1 + aq-l
f(a - C)q-l
(,,2
-f";
(10.49)
--fA?)
f 2 ,,; +e ,,~
a= a
6= c
~2e = ,,2e
j=f
~2v
,,2v
(10.50)
Thus both the system and the regulator parameters can be identified consistently.
As a general case consider the following feedback for the system (10.1):
=
vet) =
u(t)
-F(q-l)y(t)
+ vet)
K(q-l)W(t)
Ew(t)w T(S)
(10.51)
Aw<)".,
Assume that:
The closed loop system is asymptotically stable.
K(q-l) and K- 1(q-l) are asymptotically stable, K(O) = I.
wet) and e(s) are independent for all t and s .
.. H-1(q-l)G(q-l) and K- 1 (q-l)F(q-l) are asymptotically stable.
The model to be considered is
yet)
G(q-\ 8)u(t)
+ H(q-l; 8)e(t)
+ K(q-l; 8)w(t)
Ee(t)eT(t)
Ae(8)
Ew(t)wT(t) = Aw(8)
(10.52)
One thus allows both the regulator F(q-l) and the filter K(q-l) to be (at least partly)
unknown. Note that since the parametrization will not be specified, the situations when
F(q-l) and/or K(q-l) are known can be regarded as special cases of the general model
(l0.52).
In Appendix AlO.1 it is shown that for the above case one can identify both the system,
the regulator and the external input shaping filter. More specificaUy, it is shown that the
identifiability relations (10.44) imply
H=H
F=F
K=K
(10.53)
The joint input-output approach gives identifiability under essentially the same
conditions as the direct and indirect identification approaches. It is computationally more
demanding than the direct approach, but on the other hand it can simultaneously identify
the open loop system, the regulator and the spectral characteristics of the external signal
vet).
400
Chapter 10
To get a more specific relation to the direct approach, consider again the case given by
(10.52). Assume that some parameters, say 81> were used to model G and H, while
others, say 821 were used to model F and K. If a direct approach is used for estimation of
8 1 the matrix
(10.54)
(10.55)
For the joint input-output method some appropriate scalar function of the matrix
1
N
2)
1=1
(10.56)
is to be minimized in some suitable sense, where the prediction error is given by (see
(10.42) and Appendix A1O.1)
El(t,8 1) )
E2(t, 8], 8 2 )
Let
(10.58)
2(t, 82 ) = K- 1(q-\ 82)[u(t)
+ F(q-\ 82)y(t)]
Assume that the regulator has a time delay such that F(O)
1 L.J
~ (El(f,
81
S N(8 b 82 ) = N
_
1=1
E2(t,
82 )
[E1
(t, 8 1)
-T
E2 (t,
(10.59)
= O.
Then
8 2 )]
(10.60)
Choose the criterion functions, for some weighting matrices Ql and Q2, as
Vdir (8 t ) = tr QIRN(8t)
Vjoint(8j, 8
2)
= tr{
(~1
;J
tr[QI RN(8 1)
= Vdir (8 1)
(10.61)
SN(8 1 , 82)}
+ Q2 RN(8 2)]
(10.62)
Then the corresponding estimates 81 (which m1l11mlZe the two loss functions) are
identical. Note that (10.62) holds for finite N and for all parameter vectors. Thus, in this
Section 10.6
Accuracy aspects
401
case the direct and the joint input-output approaches are equivalent in the finite sample
case, and not only asymptotically (for N ~ (0).
lal <
(10.63)
Ey2(t)
~ E[ e(tW
(10.64)
= ",z
Moreover, equality in (10.64) is obtained precisely for the so-called minimum variance
regulator,
u(t)
= by(t)
(10.65)
Assume that the parameters a and b of (10.63) are estimated using a PEM applied in a
model structure of the same form as (10.63). The PEM will then, of course, be the simple
LS method. Take the determinant of the normalized covariance matrix of the parameter estimates as a scalar measure of the accuracy. The accuracy criterion is thus
given by
402
Chapter 10
(10.66)
== ry(O)ru(O) - r~,,(O)
with ry(O) = Ey2(t), ru(O) = Eu 2(t), ryu(O) = Ey(t)u(t) (see (7.68)). The problem is to
find the experimental conditions that minimize this criterion. Clearly V can be made
arbitrarily small if the output or input variance can be chosen sufficiently large. Therefore it is necessary to introduce a constraint. Consider a constrained output variance.
In view of (10.64), impose the following constraint:
(10.67)
where () > 0 is a given value.
First the optimum of V under the constraint (10.67) will be determined. Later it will be
shown how the optimal experimental condition can be realized. This means that the
optimization problem will be solved first without specifying explicitly the experimental
condition. This will give a characterization of any optimal experimental condition in
terms of ry(O) , ryu(O) and ruCO). The second step is to see for some specific experimental
conditions if (and how) this characterization can be met.
For convenience introduce the variables rand z through
ry(O)
A2(1 + b2r)
Ey(t)y(t + 1)
A2bz
(10.68)
ry,,(O)
Ey(t)u(t)
b1 Ey(t)[y(t +
+ abr)
1
b2[(1 + a2)A2(1 + b2r) +
,,2 + 2aA2bz -
2,,2]
One can now describe the criterion V, (10.66), using rand z as free variables. The
result is
Section 10.6
Accuracy aspects
403
(10.70)
Consider now minimization of V, (10.70), under the constraint (10.69). It is obvious that
the optimum is obtained exactly when
z = 0
(1O.71a)
v = 0(1
b2
(lO.71b)
+ 0)
So far the optimization problem has been discussed in general terms. It now remains
to see how the optimal condition (10.71) can be realized. In view of (10.68), equation
(lO.71a) can also be written as
Ey(t)y(t + 1)
=0
(10.72)
-1--2: ~ A,2(1 + 0)
- a
i.e.
a2
1 - a
(10.73)
0;:;;':--2
since otherwise the constraint (10.67) cannot be met at all. Introduce now
wet)
1
-1- - = 1 u(t)
+ aq
Then
yet) = bw(t - 1) + 1
+ aq Je(t)
A,2
a2 )
(10.74)
404
Chapter 10
or
lal
~ 1 - lal
= eel)
b yet) +
(10.75)
v(t)
which is a minimum variance regulator plus an additive white noise vet) of variance "'~'
which is independent of e(s) for all t and s. The closed loop system then becomes
y(t)
= bv(t -
1)
+ e(t)
",Z = ",2(1
+ b)
(10.76)
which gives "'~ = I.?b/b 2 Note that by an appropriate choice of A~ one can achieve the
optimal accuracy (lO.71b) for any value of b. This is in contrast to the case of open loop
operation.
As a third possibility consider the case of shifting between two proportional regulators
to be used in proportion Yi, i = 1, 2
(10.77)
Yi(t) + (a + bkJy;(t - 1)
e(t)
Since the data are not stationary in this case the covariances which occur in the definition
of V, (10.66), should be interpreted with care. Analogously to (10.23),
2
2
ry(O) = y\EYl(t)
+ Y2 EYZ(t)
, etc.
Section 10.6
r =
Y2r2 =
61b 2
z =
YIZI
Y2Z2 =
(10.78)
61b 2
+ 6)
(10.79)
i = 1, 2
r1 = r2
= EYT(t) = 1 _ (a
bk;)2
or
(10.80)
The second requirement in (10.78) gives
(a + bk 1)
/.,.2
Y1i - (a + bk1f
(a + bk2 )
+ Y2 1 - (a + bk2f
/.,.2 -
Yl = Y2 = 0.5
Thus this alternative consists of using two different proportional regulators, each to be
used during 50 percent of the total experiment time, and each giving an output variance
1 + 0 -I----lL..--4-------------7f
O~~r_--_+----------~----------~~--_r--~
-1 - a
bk 1
-a
bk2
1- a
bk
FIGURE 10.2 The output variance versus the regulator gain k, Example 10.8.
406
Chapter 10
equal to the bound A,z(1 + 0). Note again, that by using feedback in this way one can
achieve the optimal accuracy (1O.71b) for all values of o.
The optimal experimental condition (10.80), (10.81) is illustrated in Figure 10.2,
where the output variance is plotted as a function of the regulator gain for a proportional
feedback u(t) = -ky(t).
The minimum variance regulator k = -alb gives the minimal value ",z of this variance.
The curve happens to be symmetric around k = -alb. The two values of k that give
Ey2(t) = A2(1 + 0) are precisely those which constitute the optimal gains given by
II
(10.80).
The above example shows that the experimental conditions involving feedback can really
be beneficial for the accuracy of the estimated model. In practice the power of the input
or the output signal must be constrained in some way. In order to get maximally
informative experiments it may then be necessary to generate the input by feedback. Note
that the example illustrates that the optimal experimental condition is not unique, and
thus that the optimal accuracy can be obtained in a number of different ways. Open loop
experiments can sometimes be used to achieve optimal accuracy, but not always. In
contrast to this, several types of closed loop experiments are shown to give optimal
accuracy.
Summary
Section 10.2 showed that for an experimental condition including feedback it is not
possible to use (in a straightforward way) nonparametric identification methods such as
spectral analysis. Instead a parametric method for which the corresponding model has an
inherent structure must be used. Three parametric approaches based on the PEM to
identify systems which operate in closed loop were described:
.. Direct identification (using the input--output data as in the open loop case, therefore
neglecting the possible existence of feedback; Section 1003) .
.. Indirect identification (first identifying the closed loop system, next determining the
open loop part assuming that the feedback is known; Section 10.4) .
.. Joint input-ooutput identification (regarding u(t) and yet) as a multivariable (nu + ny)dimensional time series and using an appropriately structured model for it; with this
approach both the open loop process and the regulator can be identified; Section
10.5).
These approaches will all give identifiability under weak and essentially the same conditions. From a computational point of view, the direct approach will in most cases be
the simplest one.
H was shown in Section 10.6 that the use of feedback during the experiment can be
beneficial if the accuracy of the estimates is to be optimized under constrained variance
of the output signal.
Problems 407
Problems
Problem 10.1 The estimates of parameters of (nearly unidentifiable' systems
have poor accuracy
From a practical point of view it is not sufficient to require identifiability, as illustrated in
this problem.
Consider the first-order system
yet) + ay(t - 1)
bu(t - 1)
e(t)
where e(t) is white noise of zero mean and variance }..2. Assume that the input is given by
u(t)
= -fy(t) + vet)
where vet) is an external signal that is independent of e(s) for all t and s. Then the system
is identifiable if v(t) is pe of order l.
(a) Let vet) be white noise of zero mean and variance 0 2 . Assume that the parameters a
and b are estimated using the LS method (the direct approach). Show that the
variances of these parameter estimates fi and Gwill tend to infinity when 0 2 tends to
zero. (Hence a small positive value of 0 2 will give identifiability but a poor accuracy.)
(b) The closed loop system will have a pole in - a, with a = a + bf. Assume that the
model with parameters fi and Gis used for pole assignment in -a. The regulator gain
will then be j = (a - fi)/G. Determine the variance of j and examine what happens
when 0 2 tends to zero.
Hint.
f - f
a a - a
b(a -r
- -b-=
fi
Hence vare!)
(-lib
fi) -
b2
G(a - a)
--lib 2 )
-(a -- a)/b
yet)
bu(t - 1)
e(t)
+ vet)
where k is assumed to be known and vet) is white noise of zero mean and variance
The noises vet) and e(s) are independent for all t and s.
0 2.
(a) Assume that direct identification (with the LS method) is used to estimate b in the
model structure
408
Chapter 10
bu(t - 1) + E(t)
hI = b
h _ kii + b
2-1+k2
bl , 62 and 63 . For 63
yet) + ay(t - 1)
e - a
Assume that v(s) and the white noise e(t) are independent.
Assume that direct identification using the PEM applied to a first-order ARMAX
model structure is applied. Examine the identifiability condition in the following three
cases:
(a) vet) == 0
(b) vet) == m (a nonzero constant)
(c) vet) = a sin wt (0 < w < Jr, a > 0)
Problems
409
Problem 10.4 On the use of the output error method for systems operating under feedback
The output error method (OEM) (mentioned briefly in Section 7.4 and discussed in some
detail in Complement C7.5) cannot generally be used for identification of systems
operating in closed loop. This is so since the properties of the method depend critically
on the assumption that the input and the additive disturbance on the output are uncorrelated. This assumption fails to hold, in general, for closed loop systems. To illustrate
the difficulties in using OEM for systems operating under feedback, consider the following simple closed loop configuration:
direct path:
yet) = gu(t - 1)
+ vet)
Igl
< 1
-yet) + ret)
The assumption Ig I < 1 guarantees the stability of the closed loop system. For simplicity,
consider the following special (but not peculiar) disturbance vet) and reference signal
ret):
vet) = (1
ret)
+ gq-1)e(t)
= (1 +
gq-l)E(t)
where
o;o(,S
=
Ee(t)E(s) =
o~o(,S
Ee(t)e(s)
EE(t)E(S)
0 for all t,
The output error estimate g of g determined from input-output data {u(t), y(t)} is
asymptotically (i.e. for an infinite number of data points) given by
AO(q-l)y(t)
BO(q-l)U(t) + Co(q-l)e(t)
where
Bo(q -1)
CO(q-l)
= bo]q -1 + ... + b
Onq- n
= 1 + COlq--1 + ... + conq--n
are assumed to have no common factor. Assume that z"B o(Z-1) has all zeros inside
the unit circle and that z = 0 is the only common zero to Bo(z) and Ao(z) - Co(z).
Assume further that the system is controlled with a minimum variance regulator.
This regulator is given by
410
Chapter 10
Bo(q 1)
Suppose that the system is identified using a direct PEM in the model structure (an
nth-order ARMAX model)
.At: A(q-I)y(t)
B(q-l)U(t)
+ C(q-l)e(t)
Show that the system is not identifiable. Moreover, show that if b i and C(q-I) are
fixed to arbitrary values then biased estimates are obtained of A(q-l) and the
remaining parameters of B(q-I). Show that the minimum variance regulator for the
biased model coincides with the minimum variance regulator for the true system.
(Treat the asymptotic case N ~ 00.)
(b) Consider the system
yet)
+ aoy(t -
1) = bou(t - 1)
+ e(t) + coe(t -
1)
Assume that it is identified using the recursive LS method in the model structure
~u(t
- 1)
+ e(t)
aCt)
~
yet)
Investigate the possible limits of the estimate aCt). What are the corresponding
values of the output variance? Compare with the minimal variance value, which can
be obtained when the true system is known.
Hint. The results of Complement C1O.1 can be used to answer the questions on
identifiability.
Problem 10.6 Another optimal open loop solution to the input design
problem of Example 10.8
Consider the optimization problem defined in Example 10.8. Suppose that
is satisfied so that an open loop solution can be found. Show that an optimal input can be
generated as a sinusoid
u(t)
A sin
OJt
OJ.
Problems
411
r?!O
(b) Use the result of (a) to show that any values of rand z can be obtained by using a
shift between two proportional regulators.
Problem 10.8 Optimal accuracy with bounded input variance
Consider the system (10.63) with lal < 1 and the criterion V, (10.66). Assume that Vis to
be minimized under the constraint
1..2
(0) 0)
Eu 2 (t) :;:;:; b 2 0
(a) Use the variables rand z, introduced in (10.68), to solve the problem. Show that
optimal accuracy is achieved for
0(1
+ a2 ) + a2 (1 - a2 )
b2(1 _ a2?
r =
= -
a(20 + 1 - a 2 )
b(l _ a2)2
(b) Show that the optimal accuracy can always be achieved with open loop operation.
Problem 10.9 Optimal input design for weighting coefficient estimation
The covariance matrix of the least squares estimate of {h(k)} in the weighting function
equation model
M
yet)
~ h(k)u(t - k)
+ e(t)
Ee(t)e(s) = ".lOt,s
1, "."' N
k=l
1..2 { E (
1) (u(t -
}-l
u(t - M)
Show that both det P and [P]ii as well as Amax( P) achieve their minimum values over fi'if
u(t) is white noise of zero mean and unit variance.
Hint. Apply Lemma A.35.
Problem 10.10 Maximal accuracy estimation with output power constraint
may require closed loop experiments
To illustrate the claim in the title consider the following all-pole system:
yet)
+ aly(t -
1)
1)
+ e(t)
(i)
412
Chapter 10
where e(t) is zero-mean white noise with variance ).,.2. The parameters {aj}i'=l and bare
estimated using the least squares method. For a large number of data points the
estimation errors {v'N(ai - ai)}i'=h v'N(b - b) are Gaussian distributed with zero
mean and covariance matrix P given in Example 7.6. The optimal input design problem
considered here is to minimize det P under output power constraint
u(t)
mm
arg
det P
(6
0)
Show that a possible solution is to generate u(t) by a minimum variance feedback control
together with a white noise setpoint perturbation of variance A.26/b2 , i.e.
u(t)
Ew(t)w(s)
= [aly(t) + a2y(t =
(A.26/b 2 )6 t ,s
+ ... + any(t - n +
Ew(t)e(s) = 0 for all t, s
1)
l)]/b
+ wet)
(ii)
Hint. Use Lemma A.5 to evaluate det P and Lemma A.35 to solve the optimization
problem.
Bibliographical notes
The chapter is primarily based on Gustavsson et af. (1977, 1981). Some further results
have been given by Anderson and Gevers (1982). See also Gevers and Anderson (1981,
1982), Sin and Goodwin (1980), Ng et al. (1977) and references in the above works.
The use of IV methods for systems operating in closed loop is described and analyzed
by Soderstrom, Stoica and Trulsson (1987).
Sometimes it can be of interest to design an open loop optimal input. Fedorov (1972)
and Pazman (1986) are good general sources for design of optimal experiments. For
more specific treatments of such problems see Mehra (1974, 1976, 1981), Goodwin and
Payne (1977), Zarrop (1979a, b), Stoica and Soderstrom (1982b). For a further
discussion of experiment design, see also Goodwin (1987), and Gevers and Ljung (1986).
For some practical aspects on the choice of the experimental condition, see, for
example, Isermann (1980).
Appendix Al 0.1
Analysis o/the joint input-output identification
In this appendix the result (10.53) is proved under the assumptions introduced in
Section 10.5.
The identifiability properties are given by (10.44). To evaluate them one first has to
find the innovation representation (10.42). The calculations for achieving this will
resemble those made in Example 10.7. Set
Appendix A10.l
Fo
Fo = F(O)
413
= F(O, 0)
(y(t)
u(t)
(
=
(1 + GF)-l(GKw(t) + He(l))
)
{I - F(I + GF)-lG}Kw(t) - F(1 + GF)-lHe(t)
= ( (1
= (
(I + GF)-1H
-F(1 + GF)-lH
G(1 +FG)-lK)( I
(1 + FG)-lK
Fo
0)(
I
(AlO.1.1)
I
o)(e(t)
-Fo I
wet)
G(I + FG)-lK) (
e(t)
)
(1 + FG)-lK
-Foe(t) + w(t)
eel)
)
- Foe(t) + wet)
Since the closed loop system is asymptotically stable, then so also is the filter de'. Using
the variant (A.4) of Lemma A.2,
de'-l
(~
K- 1 (1+ FGJ
(~
Thus the filter J{>-l is asymptotically stable since by assumption H- 1 , H-1G, K- 1 and
K- 1F are asymptotically stable.
Since G(O) = 0, H(O) = I, K(O) = I, F(O) = Fo, it is easy to verify that de'(0) = I. It is
also trivial to see that i(t) is a sequence of zero mean and uncorrelated vectors (a white
noise process). Its covariance matrix is equal to
414
Chapter 10
(AlO.1.2)
FoAe = FoAe
A
AT
= FoAe F 0 + Aw
(I + OFr-l(iI + OKFo) == (I + GF)-l(H + GKFo)
0(1 + FO)-lK == G(I + FG)-lK
(I + FO)-l( -FiI + KFo) == (I + FG)-l( -FH + KF'o)
(I + For-iK == (I + FG)-lK
FoAe F 0 + Aw
(A1O.1.3c)
(A1O.L3d)
(A1O.1.3e)
(A1O.1.3f)
(A1O.1.3g)
(A1O.1.4)
Now, equation (A1O.l.3e) gives
(I + OF)-lOK
= (I + GFr-1GK
or
(A1O.l.Sa)
(A1O.l.Sb)
(A1O.LSc)
Appendix A10.l
Equation (A1O.1.3f) gives (recall from (A1O.1.4) that
-(hi -
PH)
+ (K - K)Fo
FH - KFo
fto
415
= Fo)
+ (1 + ftO)(1 + FG)-l
x (-FH + KFo)
= ((1 +
ftO)(1 + FG)-l - J]
(A1O.l.5d)
x (-FH + KFo)
= (fto
Using
OK- GK = O(K - K)
+ (0 - G)K
O(fto - FG)(1
[O(fto - FG)
or
which implies
(A1O.1.6a)
O==G
Similarly, using
ftiI - FH = fteiI - H)
+ (ft - F)H
-ft(oft- GF)(l
- (fto - FG)(1
or
which implies
F==F
(A1O.1.6b)
H==H
K==K
(A1O.1.6c)
The conclusion is that the system is identifiable using the joint input-output approach.
The identities (AlO.1.4) and (AlO.1.6) are precisely the stated identifiability result
(10.53).
416
Chapter 10
Complement C10.1
Identifiability properties of the P EM applied to ARMAX systems
operating under general linear feedback
Consider the ARMAX system
(CIO.I.l)
where
A(q-I) = 1
(ClO.I.2)
The integer k, k ;?; 1, accounts for the time delay of the system. Assume that the system is
controlled by the linear feedback
(C10.I.3)
where
R(q-l) = 1
S(q-I) =
So
rlq-l
+ ... +
rnrq-nr
(ClO.I.4)
Note that the feedback (ClO.I.3) does not include any external signal. From the
viewpoint of identifiability this is a disadvantage, as shown in Section 10.3. The closed
loop system is readily found from (ClO.I.1) and (ClO.1.3). It is described by
(ClO.I.S)
The open loop system (ClO.I.1) is assumed to be identified by using a direct PEM
approach applied to an ARMAX model structure.
We make the following assumptions:
AI. There is no common factor to A, Band C; C has all its zeros outside the
unit circle.
A2. There is no common factor to Rand S.
A3. The closed loop system (ClO.I.5) is asymptotically stable.
A4. The integers na, nb, nc and k are known.
Assumptions A1- A3 are all fairly weak. Assumption A4 is more restrictive from a
practical point of view. However, the use of A4 makes the details in the analysis much
easier, and the general type of conclusions will hold also if the polynomial degrees are
overestimated. For a detailed analysis of the case when A4 is relaxed, see Soderstrom
et at. (1975).
Write the estimated model in the form
(ClO.1.6)
Complement CiO.i
417
where E(t) is the model innovation. The estimated polynomials A, Band Care
(asymptotically, when the number of data points tends to infinity) given by the
identifiability identity (10.26):
E(t)
(C1O.1.7)
e(t)
(C1O.1.8)
(C1O.1.9)
where we have dropped the polynomial arguments for convenience. This is the identity
that must be analyzed in order to find the identifiability properties. Consider first two
examples.
Example CIO.1.1 Identifiability properties of an nth-order system with an
mth-order regulator
Assume that na = nb = ne = n, nr = ns = m. Assume further that C and AR + q-kBS
are coprime. The latter assumption implies that the numerators and the denominators of
(C1O.1.9) must satisfy
C==C
'1R + q-kBS
== AR + q--kBS
(ClO. 1. lOa)
(A -
A)R
== -q-k(B - B)S
(ClO.1.lOb)
This identity can easily be rewritten as a system of linear equations by equating the
different powers of q-l. There are 2n + I unknowns (the coefficients of A and B) and
k + n + m equations. Hence, in order to have a unique solution we require
2n+1::S: k+n+m
which is equivalent to
k+m~n+l
(CIO.l.l~
418
Chapter 10
A-
q-kS is a factor of
A
(CIO.1.lOd)
R is a factor of B - B
In view of the degree condition (ClO.1.lOc) this is impossible. Thus (CIO.1.lOb) implies
A == 0, B - B == 0, which proves sufficiency.
To summarize,
A-
B==B
A==A
c==c
if and only if the degree condition (ClO.1.lOc) holds. Thus the system is identifiable if
and only if the delay and/or the regulator order are large enough.
II
S == G
== BF
For this type of regulator there is cancellation in the right-hand side of the identity
(ClO.1.9), which reduces to
ABF
q - k BG
==
C
or equivalently
B(AF -
C) ==
_q-k BG
(ClO.l.llb)
Since in this case AR + q-kBS = BC, a basic assumption of the analysis in Example
ClO. 1. 1 is violated here. The polynomial identity (ClO.l.llb) can be written as a system
of linear equations with na + nb + nc + 1 unknowns. The number of equations is
max(nb + na + k - 1, nb + nc, k + nb + deg G) = max(na + nb + k - 1, nb + nc, nb +
nc, na + k + nb - 1) = max(nb + nc, na + nb + k - 1). A necessary condition for
identifiability is then readily established as
na
+ nb + nc + 1
max(nb
+ nc, na + nb +
k - 1)
which gives
k ? nc
+2
(C10.1.11c)
Complement C10.1
419
Assume now that G, as defined by (ClO.l.lla), and B are coprime. (This assumption is
most likely to be satisfied.) Since B is a factor of the left-hand side of (~lO.l.11b) it must
also appear in the right-hand side. This is only possible by taking B == B. Next,
C == AF +
q-kG
(C -
C) ==
(A -
A)F
(CIO.l.lld)
C - C and thus C ==
C. Then it
AR
C==
+ q-kBS ==
CoH
(ClO.1.12)
PoH
Co(AR
-k
+ q BS) == PoC
A
(ClO.l.13)
C==
AR
CoM
(ClO.1.14)
+ q-kBS == PoM
for some polynomial M. The degree of M cannot exceed nh. Some straightforward
calculations give
H(AR
which implies
R(HA - MA) == q-kS(-HB
+ MB)
(ClO.1.1S)
= nc
- nh
420
Chapter 10
+ ns - na) - 1
(ClO.1.16)
Next we show that this condition is also sufficient for identifiability. We would first like to
conclude from (ClO.1.15) that
HA - MA
== 0
HB - MB
== 0
(ClO. l. 17)
Since Rand q-kS are coprime by assumption, (ClO.l.17) follows from (ClO.1.15)
provided that
deg(HA - MA) < deg q-kS
or
+ nh <
+ ns
or
nh
+ nb < nr
== HCoM == CM
(ClO.1.18)
Since M by assumption Al is the only common factor to AM, BM and CM, it follows
from (ClO.l.17), (ClO.1.18) that M == H and hence
A==A
B==B
C=C
This completes the proof that (ClO.1.16) is not only a necessary but also a sufficient
identifiability condition.
Note that, as before, identifiability is secured only if the delay and/or the order of the
regulator are large enough.
In the general case when the model polynomials have degrees fla ;?!: na, fib ;?!: nb,
flc ;?!: nc and delay k ::::; k the situation can never improve (see Soderstrom et al. (1975)
for details). The reason is easy to understand. When the model degrees are increased,
the number of unknowns will automatically increase. However, the number of 'independent' equations that can be generated from the identity (C10.1.9) mayor may not
increase. It will certainly not increase faster than the number of unknowns.
For illustration, consider the main result (ClO. 1. 16) for the two special cases discussed
previously in the examples.
In Example C10.1.1, (ClO.1.16) becomes
o ::;
max(m - n, k + m - n) - 1 = k + m - n - 1
+ k - 1 - nb, k + ns - na) - 1
with
ns
max(nc - k, na - 1)
Complement C10.l
421
+ nc - k, k - na + na - 1) - 1 = max(k - 1, nc - na) - 1
or equivalently
nc~k-2
Chapter 11
11.1 Introduction
In system identification both the determination of model structure and model validation
are important aspects. An overparametrized model structure can lead to unnecessarily
complicated computations for finding the parameter estimates and for using the estimated model. An underparametrized model may be very inaccurate. The purpose of
this chapter is to present some basic methods that can be used to find an appropriate
model structure.
The choice of model structure in practice is influenced greatly by the intended use of
the model. A stabilizing regulator can often be based on a crude low-order model,
whereas more complex and detailed models are necessary if the model is aimed at giving
physical insight into the process.
In practice one often performs identification for an increasing set of model orders (or
more generally, structural indices). Then one must know when the model order is
appropriate, i.e. when to stop. Needless to say, any real-life data set cannot be modeled
exactly by a linear finite-order model. Nevertheless such models often give good
approximations of the true dynamics. However, the methods for finding the 'correct'
model order are based on the statistical assumption that the data come from a true
system within the model class considered.
When searching for the 'correct' model order one can raise different questions, which
are discussed in the following sections.
Is a given model flexible enough? (Section 11.2)
Is a given model too complex? (Section 11.3)
Which model structure of two or more candidates should be chosen? (Section 11.5)
It
422
Section 11.2
423
Some typical tests will be described in Examples 11.1-11.3. The tests are formulated
for single output systems (ny = 1). For multivariable systems the tests have to be
generalized. This can be done in a fairly straightforward manner, but the discussion will
be confined to the scalar case to avoid cumbersome notation. The statistical properties of
the tests will be analyzed under the null hypothesis that the model assumptions actually
are satisfied. Thus all the distribution results presented below hold under a certain null
hypothesis.
Example 11.1 An autocorrelation test
This test is based on assumption AI. If E(t) is white noise then its covariance function is
zero except at L = 0:
424
Chapter 11
'* 0
l'
(11.2)
1
'E(L) = N
N-c
2:
+ L)e(t)
e(t
(11.3)
(=1
(cf. (3.25. Under assumption AI, it follows from Lemma B.2 that
Til') ~ 0
TE(O)
l'
,,2
'* 0
as N
Ef,2(t)
(11.4)
00
Te(L)
'E(O)
(11.5)
'*
According to (11.4) one can expect Xc to be small for 't 0 and N to be large provided
E(t) is white noise. However, what does 'small' mean? To answer that question a more
detailed analysis is necessary. The analysis will be based on assumption AI. Define
r=
~: 1) E(t)
1 N (E(t
N 2:
E(t - m)
(=1
('E~1))
:
(11.6)
TE(m)
where, for convenience, the inferior limit of the sums was set to 1 (for large N this will
have a negligible effect). Lemma B.3 shows that r is asymptotically Gaussian distributed.
(11.7)
where the covariance matrix is
lim ENrrf
N-~co
=
=
lim
N~oo
ff
IN
lim { N
1
N
EE(t - i)E(t)E(S)E(S - j)
s=1
(=1
(-1
2: 2:
N~oo
EE(t - i)E(t)E(S)E(S - j)
t=1 s=1
2: 2:
EE(t - i)E(t)E(S)E(S - j)
1=1 s=(+l
+
=
~ ~ EE(t -
lim
N->oo
{o +
i)E2(t)e(t - j) }
~f
(=1
,,40i
,j
Section 11.2
425
Hence
(11.8)
NrTrlr~(O) ~
x2 (m)
> 0 that
(11.10) I
P(x
> x~(m))
(11.11)
for some given a which typically is chosen between 0.01 and 0.1. Then, if
NrTr/f~(O) > x~(m)
NrTr/f~(O) o:S x~(m)
Evidently the risk of rejecting Ho when Ho holds (which is called the first type of risk) is
equal to a. The risk of accepting Ho when it is not true depends on how much the
properties of the tested model differ from Ho. This second type of risk cannot, in general,
be determined for the statistics introduced previously, unless one restricts considerably
the class of alternative hypotheses against which Ho is tested. Thus, in applications the
value of a (or equivalently, the value of the test threshold) should be chosen by
considering only the first type of risk. When doing so it should, of course, be kept in mind
that when a decreases the first type of risk decreases, but the second type of risk
increases. A frequently used value of a is a = 0.05. For some numerical values of xz,(m),
see Table B.1 in Appendix B.
426
Chapter 11
Xl:,
value. Since
v'Nx-r; - - S(O, 1) we have (for large N) X, ~ .%(0, liN) and hence P( IX, I ::s
1.96/v'N) = 0.95. The nun hypothesis (that rE(L) = 0) can therefore be accepted if
Ix,1 ::s 1.96/v'N (with an unknown risk), and otherwise rejected (with a risk of 0.05).
dis!
rw(L)
+ T)U(t) =
f,(t
(11,12)
There are good reasons for considering both positive and negative values of L when
testing the cross-correlation between input and residuals. If the model is not an accurate
representation of the system, it is to be expected that rw(t) for L ~ 0 is far from zero.
For T < 0, rw(T) mayor may not be zero depending on the autocorrelation of u(t);
for example, if u(t) is white then rfuC'L) = 0 for T < 0 even if the model is inaccurate,
Next, assume that the model is an exact description of the system. Then rw(T) = 0 for
T ~ O. However, it should be the case that rW(L) =1= 0 for L < 0, if there is feedback
acting on the process during the identification experiment. Hence one can also use the
test fof checking the existence of possible feedback (see Section 12.10 for a further
discussion of this point.)
As a normalized test quantity take
X
T
i\w(T)
(11.13)
-::-------'=-:C-'-:-::--;v;
where
flT)
1
N
N-max(,.O)
2:
1=
E(t
+ T)U(t)
(11.14)
1--min(O. T)
\
One can expect x, to be 'small' when N is large and assumptions A3 and A4 are satisfied.
To examine what 'small' means, one must perform an analysis similar to that of Example
11.1. For that purpose introduce
Ru = ~
f=m+1
(U(t
r = (fw('r: + 1)
, .. feller:
mT
~ (U(t - ~ - 1)
f= I
(11.15)
u(t - m)
0 for
::s
0)
E(t)
u(t - ;; - m)
Assume that E(t) is white noise (Assumption AI). In a similar way to the analysis in
Example 11.1, one obtains
Section 11.2
dis!
427
(11.16)
YNr - - JV(O, P)
with
t-1)
:
(u(t -
UC
lim E NrrT =
)...2
N~oo
1) ... u(t - m
(11.17)
u(t - m)
Thus from Lemma B .12
NrTp-1r ~
x2 (m)
and also
(11.18)
, /
dis!
dis!
--+
Remark 1. The residuals E(t, eN) differ slightly from the 'true' white noise sequence
E(t, 80 ), It turns out that this deviation, even if it is small and in fact vanishes as N --? 00,
invalidates the expressions (11.8) and (11.17) for the P matrices. As shown in Appendix
AI1.1, if E(t, eN) are considered, with eN being the PEM estimate of 80 , then the true
(asymptotic) covariance matrices are smaller than (11.8), (11.17). This means that if the
tests are used as described above the risk of rejecting Ho when it holds is smaller than
expected but the risk of accepting Ho when it is not true is larger. Expressed in another
way, if, in the tests above, E(t, eN) is used rather than E(t, 80), then the correct threshold
value, corresponding to a first type of risk equal to a, is smaller than x~(m). This fact is
not a serious practical problem, especially if the number of estimated parameters n8 is
much less than m (see Appendix A11.1 and in particular (Al1.1.18) for calculations
supporting this claim). Since the exact tests based on the expressions of the (asymptotic)
covariance matrices P derived in Appendix Al1.1 are more complicated than (11.9) and
(11.18), in practice the tests are often performed as outlined in Examples 11.1 and 11.2.
The tests will also be used in that way in some subsequent examples in this and the next
chapter.
II
Remark 2. The number m in the above correlation tests could be chosen from 5 up to
N/4. Sometimes the integer T in the cross-correlation test must be chosen with care. The
reason is that for some methods, for example the least squares method, fu('t) is
constrained by construction to be zero for some values of t. (For instance for the 'least
squares' model (6.12), fu(t) = for L = 1, ... , nb by the definition of the LS estimate;
III
cf. Problem 11.1. These values of t must not be considered in the r-vector.)
Chapter 11
428
O.
= {1
if E(i)E(i
0 if E(i)E(i
Then
N-l
:iN
2:
0i
i=1
(11.19)
:iN - - dt""(m, P)
with
m = E:iN
= (N -
P = E(:i N
m)2 = E:i~ - m 2 = E
2: 2:
;=1
= [(N -
1)2 - (N - 1)](EOi)2
= (N - 1)[E(0;) - (EOi)2]
OiOj - (N - 1)2/4
j=1
(N - l)E(oT) _. (N - 1)2/4
= (N --
1)/4
= NI4
According to this analysis we have (for large N)
__
_____________________________"
(11.~0
Section 11.2
- NI21
liNYNI2
429
:::::; 1.96
1.96
1.96
(11.21)
III
0.75
0.5
f,(1:)/7,(0)
0.25
o
o
J"-..-
~/
V
5
/'...
10
(a)
0.75
0.5
0.25
o
o
10
(b)
FIGURE 11.1 (a) Normalized covariance function of E(t). (b) Normalized covariance
function of u(t). (c) Normalized cross-covariance function.
Section 11.2
431
0.75
0.5
reu(t)/[f,,(O)rc(O) J1I2
0.25
+-------------~~~~~~----~
-10
10
(c)
The test quantity (11.18) is applied twice. First, the test quantity is used for T = 1
and m = 10. The numerical value obtained is 20.2. Since the test quantity is approximately X2 (1O) distributed, one finds from Table B.1 that the threshold value is
18.3 for a = 0.05 and 20.5 for a = 0.025. One can accept the null hypothesis (u(t) and
toes) uncorrelated for t ::::;: s) with a significance level of 0.025.
Second, the test is applied for T = -11 and m = 10. The numerical value obtained
was 296.7. Under the null hypothesis (u(t) and toes) uncorrelated) the test variable is
,approximately X2(10) distributed. Table B.1 gives the threshold value 25.2 for a = 0.005.
The null hypothesis is therefore rejected.
III
It should be noted that the role of the statistical tests should not be exaggerated. A
drastic illustration of the fact that statistical tests should not replace the study of plots of
measured data and model output and the use of common sense is given in the following
example.
Example 11.5 Statistical tests and common sense
The data were measured at a distillation column, u(t) being the reflux ratio and yet) the
top product composition. The following model structure was fitted to the data:
1
A(q-l)y(t) = B(q-l)U(t) + D(q 1) e(t)
with na = nb = nc = 2. A prediction error method was used to estimate the model
parameters. It was implemented using a generalized least squares algorithm (cf.
Model 1
Model 2
foCr)
1.0
1.0
1.0
1.0
FIGURE 11.2 Normalized covariance functions. The dashed lines give the 95 percent
significance interval.
50
100
150
200
::~rJ\~
5900
::~~
6300~
6100
Section 11.3
433
Complement C7.4). Depending on the initial conditions 8(0) for the numerical optimization, two different models were obtained. The correlation functions corresponding to
these two models are shown in Figure 11.2. From this figure both models seem quite
reasonable, since they both give estimated autocorrelation functions flr)!fiO) that
almost completely lie inside the 95 percent significance interval.
However, a plot of the measured signals and the model outputs (Figure 11.3) reveals
that model 1 should be selected. For model 2 the characteristic oscillating pattern in the
output, which is due to the periodic input used (the period of u(t) is about 80), is
modeled as a disturbance. A closer analysis of the loss function and its local minima
corresponding to the two models is given by Soderstrom (1974).
This example shows that one should not rely on statistical tests only, but should also
use graphical plots and common sense as complements. Note, however, that this
example is extreme. In most practical cases the tests on the correlation functions give a
+ Co(q-l)e(t)
(11.22)
where AO(q-l), BO(q--l) and CO(q-l) are assumed to be coprime. The system is further
assumed to be identified in the model structure
(11.23)
where
(1l.24)
Then it is easily seen that all models satisfying
A(q-l)
==
AO(q-l)M(q-l)
== BO(q-l)M(q-l)
C(q-l) == Co(q-l)M(q-l)
B(q-l)
M(q-l)
nm
(M(q-l)
= 1 + mlq-l + ... +
= min(na - nao, nb = 1 if nm = 0),
( 11.25)
rnnmq-nm
nb o, nc - ncO)
434
Chapter 11
give a correct description of the system, i.e. the corresponding values of 8 lie in the set
DT(J", A) (see Example 6.7). Note that ml, ... , mnm are arbitrary and that strict
inequalities (11.24) are crucial for nm :::;:, 1 to hold.
Assume that a prediction error method is used for the parameter estimation. The
effect of the overparametrization will be analyzed in two ways:
Determination of the global minimum points of the asymptotic loss function.
The singularity properties or, more generally, the null space of the information matrix.
To simplify the analysis the following assumptions are made:
The input signal is persistently exciting of order max(na + nb o, nao + nb). This type of
assumption was introduced in Chapter 7 (see assumption A2 in Section 7.5) when
analyzing the identifiability properties of PEMs.
The system operates in open loop. It was noted in Chapter 10 that a low-order
feedback will prevent identifiability. The assumption of open loop operation seems
only partially necessary for the result that follows to hold, but it will make the analysis
considerably simpler.
The two assumptions above are sufficient to guarantee that identifiability is not lost due
to the experimental condition. The point of interest here is to see that parameter identifiability (see Section 6.4) cannot hold true due to overparametrization of the model.
First evaluate the global minimum points. The asymptotic loss function is given by
vee)
= E2(t) =
E[C(~
I) {A(q-I)y(t) -- B(q-l)U(t)}
[ A(q-l) {Bo(q-l)
I) Ao(q I) u(t)
= EC(q
= E[e(tW
Co(q-l)}
B(q-I)
]2
Ao(q I) e(t) - C(q 1) u(t)
(11.26)
E [A(q-I)Bo(q-l) - AO(q-I)B(q-l) U(t)]2
Ao(q 1)C(q I)
The last equality follows since e(t) is white noise and therefore is independent of e(s),
s ~ t - 1. Thus the global minimum points must satisfy
[A(q-l)Bo(q-l) - AO(q-I)B(q-l)] u(t)
AO(q-I)C(q-l)
(w.p. 1)
(11.27)
(w.p. 1)
The true parameter values for which A(q-l) == AO(q--l), B(q-l) == BO(q-I), C(q-l) ==
CO(q-l), clearly form a possible solution to (11.27).
Since eel) is persistently exciting of any order (see Property 2 of Section 5.4), and u(t)
is persistently exciting of order p = max(na + nb o, nao + nb), it follows from Property 4
of Section 5.4 that
Section 11.3
1
Ao(q l)C(q 1) U(t)
435
1
Ao(q l)C(q 1) e(t)
and
are persistently exciting of order p and of any finite order, respectively. From Property 5
of Section 5.4 and (11.27) it follows that
== 0
(l1.28a)
A(q-l)CO(q-l) - Ao(q-l)C(q-l) == 0
(l1.28b)
A(q-1)Bo(q-l) - AO(q-l)B(q-l)
However, the identities (11.28) are precisely the same as (6.48). Therefore the solution is
given by (11.25). Thus (11.25) describes all the global minimum points of the asymptotic
loss function.
Next consider the information matrix. At this point assume the noise to be Gaussian
distributed. Since the information matrix, except for a scale factor, is equal to the
Hessian (the matrix of second-order derivatives) of the asymptotic loss function one can
expect its singularity properties to be closely related to the properties of the set of global
minima. In particular, this matrix must be singular when there are infinitely many nonisolated global minimum points.
The information matrix for the system (11.22) is given by (see (B.29) and
(7.79)-(7.81)):
I = A2 EtV(t)tV (t)
(11.29)
where
1
tV (t) = Co(q 1)
(11.30)
Let
X
= (al ...
ana
B1'"
Bnb
Y1'"
YnJ
a(q-l)
,=
2: aiq-i
nc
nb
B(q-l) =
2: 13;q-i
y(q-1)
;=1
i=l
2: Yiq-i
i=1
Ix
=0
(11.31)
1pT(t)X = 0
"*>
(w.p. 1)
(11.32)
It can be seen from (11.32) that if the system is controlled with a noise-free (low-order)
linear feedback ~(q-l)U(t) = a(q-l)y(t) with na < na and n~ < n~ then I is singular.
436
Chapter 11
Indeed, in such a case (11.32) will have (at least) one nontrivial solution a(q-l) =
= q-I~(q-l) and y(q-l) = O. Since J is proportional to the inverse of
the covariance matrix of the parameter estimates, its singularity implies that the system is
not identifiable. This is another derivation of the result of Section 10.3 that a system
controlled by a low-order noise-free linear feedback is not identifiable.
Now (11.22) and (11.32) give, under the general assumptions introduced above,
q-l6.(q-l), ~(q-l)
[a(q-l)
(w.p. 1)
(w.p. 1)
from which one can conclude as before (u (t), e(t) being pe) that
a(q-l)Bo(q-l) - AO(q-l)~(q-l)
== 0
(11.33)
a(q-l)Co(q-l) - AO(q-l)y(q-l) == 0
These identities are similar to (11.28). To see this more clearly, consider again (11.28)
and make the following substitutions:
A(q-l) = AO(q-l)
+ a(q-l)
=
C(q-I) =
BO(q-l)
~(q-l)
CO(q-l)
y(q-I)
B(q-I)
Then (11.28) is easily transformed into (11.33). Thus the properties of the null space of J
can be deduced from (11.25).
Note especially that if nm = 0 in (11.25) then x = 0 is the only solution to (11.31).
This means that the information matrix is nonsingular. However, if nm ;?; 1 then the
general solution to (11. 31) is given by
a(q-l)
==
Ao(q-l)L(q-l)
[)(q--l) == BO(q--l)L(q--I)
y(q-l) == Co(q-I)L(q-l)
where
L(q-l) = llq--l
- 1
Section 11.3
Note that when nc = nco = 0 ('the least squares case') there will be no pole-zero
cancellations or singular information matrices. In this case nm = 0 due to nc = nco
(see Problem 11.2).
II1II
R= ~
Z(t)F(q-l)T(t)
1=1
438
Chapter 11
Numerical problems
The main drawback of the methods described in this section lies in the difficulty of
designing good statistical test criteria and in the numerical problems associated with
evaluation of the rank of a matrix. The use of a singular value decomposition seems to be
the numerically soundest way for the test of non singularity or more generally for
determining the rank of a given matrix. (See Section A.2 for a detailed discussion of
singular value decomposition.) Due to their drawbacks these methods cannot be recommended as a first choice for SISO systems. For multivariable systems, however,
there are often fewer alternatives and rank tests have been advocated in many papers.
eN
eN.
wee)
W(8 o)
all 8
(11.35)
eN
The variance of the one-step prediction errors, when the model is applied to future
data. This possibility will be considered later in this section.
The variance of the multi-step prediction error (cf. Problem 11.5).
The deviation of the estimated transfer function from that of the true system. Such a
deviation can be expressed in the frequency domain.
Section 11.4
439
~-----------~----~-------------------------~8
eN
Assume that an optimal regulator is based on the identified model and applied to the
(true) system. If the model were perfect, the closed loop system would perform
optimally. Due to the deviation of the model from the true system the performance
of the closed loop system will deteriorate somewhat. One can take the performance
of the closed loop system as an assessment criterion. (See Problem 11.4 for an
illustration. )
In what follows a specific choice of the assessment criterion is discussed. Let W N be the
prediction error variance when the model corresponding to ON is used to predict future
data. This means that
(11.36)
In (11.36) and also in (11.37) below, the expectation is conditional with respect to past
data in the sense that
WN = E[2(t, ON)ION]
If the estimate ON were exact, i.e. ON = 80, then the prediction errors {E(t, 0N)} would
be white noise, {e(t)}, and have minimum variance A. Consider now how much the
prediction error variance W N is increased due to the deviation of 0N from the true value
00 , A Taylor series expansion of E(t, ON) around 80 gives
440
Chapter 11
= A
=
(1l.37)
The second term in (11.37) shows the increase over the minimal value A due to the
deviation of the estimate 8N from the true value 8 0 , Taking expectation of WN with
respect to the past data, which are implicit in 8N , gives
EWN = A + tr[E(8 N
00)(8 N
8N
(11.38)
Note that 'tjJ(t, 00 ) in (11.37), (11.38) is to be evaluated for the future (fictitious) data,
was found.
while 'tjJp(t, 8 0 ) in (11.39) refers to the past data, from which the estimate
If the same experimental condition is assumed for the past and the future data, then
the second order properties of 1V p(t, 8 0 ) and 'tjJ(t, 8 0 ) are identical. For such a case
(11.38), (11.39) give
eN
(11.40)
where p = dim O.
This expression is remarkable in its simplicity. Note that its derivation has not been
tailored to any special model structure. It says that the expected prediction error
variance increases with a relative amount of piN. Thus, there is a penalty in using models
with unnecessarily many parameters. This can be seen as a formal statement of the
parsimony principle.
In Complement C11.1 it is shown that the parsimony principle holds true under much
more general conditions. The expected value of a fairly general assessment criterion
N) will, under very mild conditions, increase when the model structure is expanded
beyond the true structure.
Wee
Section 11.5
associated with the (nested) model structures in question. This is conceptually the same
problem as that discussed in Section 4.4 for linear regressions.
TheF-test
Let Aland A 2 be two model structures, such that Ale A 2 (AI is a subset of j { 2; for
example, A1 corresponds to. a lower-order model than A2)' In such a case they are
called hierarchical model structures. Further let Vlv denote the minimum of V N(e) in the
structure j{; (i = 1, 2) and let Ai have Pi parameters. As for the static case treated in
Chapter 4, one may try
x
=N
v1-2 V~
(11.41)
VN
as a test quantity for comparing the model structures A] and A 2 . If x is 'large' then one
concludes that the decrease in the loss function from v1 to V~ is significant and hence
that the model structure j { 2 is significantly better than AI' On the other hand, when x
is 'small', the conclusion is that Aland ~It z are almost equivalent and according to
the parsimony principle the smaller model structure A1 should be chosen as the more
appropriate one.
The discussion above leads to a qualitative procedure for discriminating between j { 1
and A 2 . To get a quantitative test procedure it is necessary to be more exact about what
is meant by saying that x is 'large' or 'small'. This is done in the following.
First consider the case when J' ~ AI, i.e. Al is not large enough to include the true
system. Then the decrease v,k;- V~ in the criterion function will be 0(1) (that is, it does
not go to zero as N ~ (0) and therefore the test quantity x, (11.41), will be of magnitude
N.
vitI
L~
v,k;- V~ dist Z
V~ --7X (P2 - Pi)
(11.43) ]
The result (11.43) can be used to conceive a simple test for model structure selection. At
a significance level a (typical values in practice could range from 0.01 to 0.1) the smaller
model structure .At 1 is selected over A z if
x ~ X~(P2 - PI)
(11.44)
v,k;
~ VMl
(1/N)x~(P2 - PI)]
(11.45)
Chapter 11
442
The test (11.44) is closely related to the F-test which was developed in some detail in
Section 4.4 for the static linear regression case. It was shown there that
x
N- P2
N(P2 - PI)
is exactly F(P2 - P1, N - P2) distributed. This implies that x is asymptotically (for
N -7 (0) X2(P2 - PI) distributed as in (11.43) (see Section B.2 of Appendix B). Although the result on F-distribution refers to static linear regressions, the test procedure
is often called 'F-test' also for the 'dynamic' case where the X2 distribution is used.
where B(N, p) is a function of N and the number of parameters P in the model, which
should increase with p (in order to penalize too complex (overparametrized) model
structures in view of the parsimony principle) but should tend to zero as N -7 00 (to
guarantee that the penalizing term in (11.46) will not obscure the decrease of the loss
function VN(ON) with increasing underparametrized model structures). A typical choice
is B(N, p) = 2p/N.
An alternative is to use the criterion
(11.47)
where the additional term yeN, p) should penalize high-order models. The choice
yeN, p) = 2p will give the widely used Akaike's information criterion (AIC) (a justification for this choice is presented in Example 11.7 below):
(11.48)
It is not difficult to see that the criteria (11.46) and (11.47) are asymptotically
equivalent provided yeN, p) = N~(N, p). Indeed, for large N
log VN(ON)
1
Section 11.5
which shows that the logarithm of (11.46) asymptotically is proportional to (11.47). Since
the logarithm function is monotonically increasing, it follows that for large N the two
criteria will be minimized by the same model structures.
Several proposals for the terms yeN, p) in (11.47) have appeared in the literature. As
mentioned above, the choice yeN, p) = 2p corresponds to Akaike's information
criterion. Other choices, such as yeN, p) = p log Nand yeN, p) = 2pc log(log N), with
c ~ 1 being a constant, allow yeN, p) to grow slowly with N. As will be shown later in
this section, this is a way of obtaining consistent order estimates (cf. Example 11.9).
Example 11.7 The FPE criterion
Consider a single output system. The assessment criterion W N is required to express the
expected prediction error variance when the prediction of future data is based on the
model determined from past data. This means that
(11.49)
where the expectation is with respect to both future and past data. Since W N in (11.49) is
not directly measurable, some approximations have to be made. It was shown in Section
11.4 that (for a flexible enough structure)
W N = A(l
piN)
(11.50)
Now, A in (11.50) is unknown and must be replaced by some known quantity. Recall that
the loss function used in the estimation is
1 N
N 2.:
VN(8) =
I}(t, 8)
(11.51)
(=1
Hence the variance A could possibly be substituted by V N(e N)' However, V N(ON)
deviates from A due to the fit of 8N to the given realization. More exactly, due to the
fitting of the model Jlt(8 N ) to the realization at hand, VN C8 N ) will be a biased estimate
of A. To investigate the asymptotic bias VN(8 N) - A, note that
(11.52)
Here the first term is a constant. The expected value of the second term is zero, and the
expected value of the third term can easily be evaluated. Recall from (7.59) that
VN(8 N
Hence
80 ) ~ .K(O, 2A[V~(8oW-l)
444
-:21 E
tr{(8 o - 8 N )(8 o - 8 N )
Chapter 11
V~(8())}
-~ tr{2A[V~(8o)rlV"oo(8o)IN}
-A piN
It follows from the above calculations that an asymptotically unbiased estimate of A is
given by
(11.53)
The above estimate is identical to the unbiased estimate of A derived in the linear
regression case (see (4.13. Note that in contrast to the present case, for linear regressions the unbiasedness property holds for all N (not only for N ~ (0).
Combining (11.50) and (11.53) now gives a realizable criterion, namely
W N
(8 N ) ~J!H:i
~
1 - piN -
(10]
FPE
This is known as the final prediction error (FPE) criterion (Davisson, 1965, Akaike,
1969). Note that for large N (11.54) can be approximated by
WN
VN(8 N )[ 1 +
~/;NJ
~(N,
VNC8N)[ 1 +
p)
= 2pl N.
~]
2p
N.
Section 11.5
When the test quantity x is used, the smaller model structure, .At 1, is selected if
(11.55)
(see (11.45).
Assume that one of the criteria W N , (11.46), (11.47), is used to select one of the two
model structures .Al] and .Al2 with j{l C .Al2. If the criterion (11.46) is used, o/f{l is
selected if
i.e. if
(11.56)
Comparing (11.55) and (11.56), it is seen that the criterion (11.46) used as above can be
interpreted as a X2 test with a prespecified significance level. In fact it is easy to show that
X2(p _ P ) = N ~(N, P2) - ~(N, P1)
u
2
. 1
1 + ~(N, PJ)
Jill
(11.57)
is selected if
V N1
,:;;
(11.58)
By comparing (11.55) and (11.58) it can be seen that for the problem formulation
considered, the X2 test and the approach based on (11.47) are equivalent provided
X~(P2 - PI)
=N
(11.59)
The above calculations show that certain significance levels can be associated with the
FPE and AlC criteria when these are used to select between two model structures. The
details are developed in the following example.
Example 11.8 Significance levels for FPE and Ale
Consider first the FPE criterion. From (11.46) and (11.54),
~(N,
p)
= 2p/(N - p)
2(
Xu P2
_
/h
2N(p2 - PI)
N (N + PI)(N _ P2) = 2(P2 - PI)
446
Chapter 11
y(N, p) = 2p
and (11.59) gives
(11.61)
where again the approximation holds for large values of N. When P2 = P1 + 1 the risk of
overfitting is 15.7 percent, exactly as in the case of the FPE (this was expected since as
shown previously, AIC and FPE are asymptotically equivalent). Note that the above risk
of overfitting does not vanish when N ~ 00, which means that neither the AIC nor the
FPE estimates of the order are consistent.
III
As a further comparison of the model order determination methods introduced above,
consider the following example based on Monte Carlo simulations.
Example 11.9 Numerical comparisons of the F-test, FPE and Ale
The following first-order AR process was simUlated, in which e(t) is white noise of zero
mean and unit variance:
(11.62)
generating 100 different long realizations (N = 500) as well as 100 short ones (N = 25).
First- and second-order AR models, denoted below by j{t and j{z, were fitted to the
data using the least squares method. The criteria FPE and AIC as well as the F-test
quantity were computed for each realization. Using the preceding analysis it will be
shown that the following quantities are theoretically all asymptotically X2 (1) distributed:
f ~ vJv;~ V~
(N - 2)
(11.63)
(11.64)
(11.65)
That
f ~ X2(1)
= vJv =
V~ and use
Section 11.5
N[FPE(Ji t )
=
FPE(.A2 )] + 2
N [Viv(1 +
2PI ) N - PI
= N viv - V~ + 2N[Viv
A
I - 2[P2 - pd + 2
V~(l
2P2
N - P2
PI
N - PI
dist
V~
)l/A
l/A
+2
P2
N - P2
+2
1 - X-(1)
To derive the asymptotic distribution of (11.65), use (11.43) once more and proceed as
follows.
AlC(,Ad - AIC(Ji2 ) + 2
V~ - 2P2
viv
+2
= I ~X2(1)
d
1.0
Relative
frequency
0.5
500
(a)
FIGURE 11.5 (a) Normalized curves for AIC (.All) -- AIC (.Al 2 ) + 2 (the AIC will
select first-order models for the realizations to the left of the arrow). (b) Normalized
curves for N[FPE(.Ald - FPE(vPi 2 )] + 2 (the FPE criterion will select first-order
models for the realizations to the left of the arrow). (c) Normalized curves for f (the Ftest with a 5 percent significance level will select first-order models for the realizations
to the left of the arrow).
1.0
Relative
frequency
"'-_ .....Vj./'
_ _ .J
E,,,,,,im,,'" ",ultfoe N
2l
= 500
,.,.'
2
(b)
1.0
Relative
frequency
0.5
Experimental
result for N ~~ 500
3
( c)
Section 11.5
It can be seen from Figure 11.5 that the experimental distributions, except for that
of the FPE for the case N = 25, show very good agreement with the theoretical X2 (1)
distribution. Note also that the quite short realizations with N = 25 gave a result close
to the asymptotically valid theoretical results.
It was evaluated explicitly how many times a second (or higher) order model has been
chosen. The numerical results, given in Table 11.1, are quite congruent with the theory.
TABLE 11.1 Numerical evaluation of the risk of overfitting
when using Ale, FPE and the F-test for Example 11.9
Number of
samples
Criterion
N = 500
AIC
FPE
AIC
FPE
0.20
0.20
0.04
(0.16)
(0.16)
(0.05)
0.18
0.23
0.09
F-test
=
25
F-test
II
Consistency analysis
In the previous subsection it was found that the FPE and AIC do not give consistent
estimates of the model order. There is a nonzero risk, even for a large number of data
points, of choosing too high a model order. However, this should not be seen as too
serious a drawback. Both the FPE and AlC enjoy certain properties in the practically
important case where the system does not belong to the class of model structures
considered (see Shibata, 1980; Stoica, Eykhoff et at., 1986). More exactly, both AlC
and FPE determine good prediction models no matter whether or not the system belongs
to the model set.
In the following it will be shown that it is possible to get consistent estimates of the
model order by using criteria of the type (11.46) or (11.47). (An order selection rule is
said to be consistent if the probability of selecting a wrong order tends to zero as the
number of data points tends to infinity.)
It was found previously that the selection between two model structures with the
criteria (11.46) and (11.47) is equivalent to the examination of the inequality (11.55) for
X~(P2 - Pi) given by (11.57) and (11.59), respectively. The analysis that follows is based
on this simple but essential observation.
Consider first the risk of overfitting of the selection rules (11.46) or (11.47). Therefore
assume J E ./lt l C A 2 . The probability of choosing the too large model structure
will then be
P(VJv> VJ,;(1
>
X~(P2 - PI = a
where x is the F-test quantity defined in (11.43). To eliminate the risk of overfitting a
must tend to zero as N tends to infinity, or equivalently
450
Chapter 11
(11.66)
~ 00
i ,All
Jt 2. The probability
In this case, when the system does not belong to the model set, the difference vJv - V~
should be significant. More exactly, vJv - V~ = 0(1), i.e. vJv - V~ does not tend
to zero as N tends to infinity; otherwise the two models J{ I and .Al2 would be
(asymptotically) equivalent which would be a contradiction to the working hypothesis.
This implies that N(VJv - V~ )!V~ is of order N and thus X~(P2 - PI) must be small
compared to N, i.e.
as N
-7
(11.67)
00
The following example demonstrates how the requirements (11.66), (11.67) for a consistent order determination can be satisfied by appropriate choices of ~(N, p) or yeN, p).
Example 11.10 Consistent model structure determination
Consider first the criterion (11.46). Choose the term f3(N, p) as
~(N,
p)
(11.68a~
= ~ feN)
Itf(N) = 1 this gives approximately the FPE criterion. Now suppose thatf(N) is chosen
such that
feN)
00
and
fCN)/N
as N
~ 00
There are many ways to satisfy this condition. The functions fl (N)
log N are two possibilities.
Inserting (11.68a) into (11.57), for large N
(11.68b)
V Nand heN)
(11.69)
It is then found from (11.68b) that the consistency conditions (11.66), (11.67) are
satisfied.
Next consider the criterion (11.47). Let the term yeN, p) be given by
yeN, p)
2pg(N)
(11.70a)
Problems
The choice g(N)
g(N)
451
00
g(N)/N
as N
~ 00
(11.70b)
= N[y(N, P2)
(11.71)
It is then easily seen from (11.70b) that the consistency conditions (11.66), (11.67)
are satisfied.
II
Summary
Model structure determination and model validation are often very important steps in
system identification. For the determination of an appropriate model structure, it is
recommended to use a combination of statistical tests and plots of relevant signals. This
has been demonstrated in the chapter, where also the following tests have been
described:
.. Tests of whiteness of the prediction errors and of uncorrelatedness of the prediction
errors and the input (assuming that the prediction errors are available after the
parameter estimation phase.)
.. Tests for detecting a too complex model structure, for example by means of pole-zero
cancellations and/or singular information matrices. (The application of such tests is to
some extent dependent on the chosen class of model structures.)
II Tests on the values of the loss functions corresponding to different model structures.
(These tests require the use of a prediction error method.) The x2-test (sometimes also
called - improperly, in the dynamic case - the F-test) is of this type. It can be used to
test whether a decrease of the loss function corresponding to an increase of the model
structure is significant or not. The other tests of this type which were discussed utilize
structure-dependent terms to penalize the decrease of the loss function with increasing
model structure. These tests were shown to be closely related to the X2-test. By
properly selecting some user parameters in these tests, it is possible to obtain
consistent structure selection rules.
Problems
Problem 11.1 On the use of the cross-correlation test for the least squares method
Consider the ARX model
A(q-l)y(t)
where
B(q-I)U(t)
vet)
452
=
=
alq--l
b,q-l
+ ... +
+ ... +
Chapter 11
anaq-na
bnbq-nb
Assume that the least squares method is used to estimate the parameters a], ... , ana, b I ,
... , b nh . Show that
1, ... , nb
O.
Remark. Note that the problem illustrates Remarks 1 and 2 after (11.18). In particular
it is obvious that rElICt) and 1'",,(.) have different distributions.
BO(q-l)U(t)
e(t)
A o, Bo coprime
deg Ao
nao, deg Bo = nb o
deg A = na
E(t)
nao, deg B = nb
nb o
Assume that the system operates in open loop and that the input signal is persistently
exciting of order nb. Prove that the following results hold:
(a) The asymptotic loss function, EE2(t) , has a unique minimum.
(b) The information matrix is nonsingular.
(c) The estimated polynomials A and B are coprime.
Compare with the properties of ARMAX models, see Example 11.6.
Problem 11.3 Variance of the prediction error when future and past experimental
conditions differ
Consider the system
yet)
ayCt - 1) = bu(t - 1)
e(t)
Assume that the system is identified by the least squares method using a white noise
input u(t) of zero mean and variance 0 2 . Consider the one-step prediction error variance
as an assessment criterion. Evaluate the expected value of this assessment criterion
when u(t) is white noise of zero mean and variance a 2 0 2 (a being a scalar). Compare
with the result (11.40).
Remark. Note that in this case 'tjJ(t, So) =1= 'tjJp(t, 8 0 ) in (11.38), (11.39).
Problems
453
yet) + ay(t - 1)
bu(t - 1)
e(t)
ce(t - 1)
The minimum variance regulator based on the first-order ARMAX model above is
a - c
u(t) = - - - yet)
b
(i)
aoy(t - 1) = bou(t- 1)
eel)
coe(t - 1)
(ii)
Consider the closed loop system described by (i), (ii) and assume it is stable. Evaluate
the variance of the output. Show that it satisfies the property (11.35) of assessment
criteria W N. Also show that the minimum ofWN in this case is reached for many models.
Problem 11.5 Variance of the multi-step prediction error as an assessment criterion
(a) Consider an autoregressive process
AO(q-l)y(t)
= e(t)
= E(t)
of the process has been obtained with the least squares method. Let y(t + kl t)
denote the mean square optimal k-step predictor based on the model. Show that
when this predictor is applied to the (true) process, the prediction error variance is
W N = E[Fo(q-l)e(tW
+E
[Go(q:~(; ~(q-l)
e(t)
== Ao(q-l)Fo(q-l) + q-kGo(q-l)
== A(q-l)F(q-l) + q-kG(q-l)
deg Fo
deg F
= k -
EW~) = A2(1
".2 6)
+ a6) + 'N(4a
454
Chapter 11
e white noise
ce(t - 1)
Assume that this process is identified by the least squares method as an autoregressive
process
yet)
+ ... +
aty(t - 1)
any(t -- n)
-i>
E(t)
00).
(a) Find the prediction error variance EE2(t) for the model when n = 1, 2, 3. Compare
with the optimal prediction error variance based on the true system. Generalize this
comparison to an arbitrary value of n.
Hint. The variance corresponding to the model is given by
min
EE2(t) =
E[y(t)
aly(t - 1)
+ ... +
any(t -
n)f
(b) By what percentage is the prediction error variance deteriorated for n = 1, 2, 3 (as
compared to the optimal value) in the following two cases?
c
0.5
Case II: c
1.0
Case I:
aoy(t - 1) = e(t)
laol
< 1
E(t)
E(t)
Problems
./: yet) = bu(t - 1)
455
+ e(t)
where e(t) and u(t) are mutually independent zero-mean white noise sequences. The
variance of e(t) is 1,.2. Consider also the two model structures
AI: yet)
2:
y(t)
bluet - 1)
The estimates of the parameters of '~l and A 2 are obtained by using the least squares
method (which is identical to the PEM in this case).
0 2,
YN/",
(Ab-b
a )
for
'~2
= E {E[EAt(t, SAt W}
The inner expectation is with respect to e(t), and the outer one with respect to SAt.
Determine asymptotically (for N ~ (0) valid approximations for AAt, and A4't,
Show that the inequality AAt, ~ A4t', does not necessarily hold. Does this fact
invalidate the parsimony principle introduced in Complement Cl1.1?
Problem 11.9 On testing cross-correlations between residuals and input
It was shown in Example 11.2 that
dist
YNx, - - fiCO, 1)
where
X-c
fu('t)
-'-[1'-8(-0)=1'u-'-(-'-O)--,--]1""/2
Hence, for every L it holds asymptotically with 95 percent probability that Ix., I ~
1.96/V'N. By analogy with (11.9), it may be tempting to define and use the following test
quantity:
m
y gN
~1 X~+k =
N
fcCO)fu(O)
~1 f~u(L +
Ie)
instead of (11.18). Compare the test quantities y above and (11.9), Evaluate their means
and variances.
456
Chapter 11
Hint. First prove the following result using Lemma B.9. Let x ~ ,K(O, P) and set z =
xTQx. Then Ez = tr PQ, var(z) = 2 tr PQPQ. Note that z will in general not be X2
distributed.
Problem 11.10 Extension of the prediction error formula (11.40) to the multivariable case
Let {(t, 80 )} denote the white prediction errors of a multivariable system with (true)
parameters 8 0 , An estimate 8of 80 is obtained by using the optimal PEM (see (7.76) and
the subsequent discussion) as the minimizer of det{L:;:l (l, 8)T(t, 8)}, where N denotes
the number of data points. To assess the prediction performance of the model introduce
where the expectation is with respect to the future data used for assessment. Show that
the expectation of W,y(8) with respect to the past data used to determine 8, is
asymptotically (for N - ? (Xl) given by
EWN(8)
(det A)(1
+ piN)
(here A is the covariance matrix of (t, 8 0 ), and p = dim 8) provided the past
experimental conditions used to determine 8 and the future ones used to assess the
model have the same probability characteristics.
Bibliographical notes
Some general comparisons between different methods for model validation and model
order determination have been presented by van den Boom and van den Enden (1974),
Unbehauen and Gohring (1974), Soderstrom (1977), Andel et al. (1981), Jategaonkar
et al. (1982), Freeman (1985), Leontaritis and Billings (1987). Short tutorial papers on
the topic have been written by Bohlin (1987) and Soderstrom (1987). A rather extensive
commented list of references may be found in Stoica, Eykhoff et al. (1986). Lehmann
(1986) is a general text on statistical tests.
(Section 11.2) Tests on the correlation functions have been analyzed in a more general
setting by Bohlin (1971, 1978). Example 11.5 is taken from SMerstrom (1974). The
statistic (11.9) (the so-called 'portmanteau' statistic) was first analyzed by Box and Pierce
(1970). An extension to nonlinear models has been given by Billings and Woon (1986).
(Section 11.3) The conditions for singularity of the information matrix have been
examined in Soderstrom (1975a), Stoica and Soderstrom (1982e). Tests based on the
instrumental product matrix are described by WeUstead (1978), Wellstead and Rojas
(1982), Young et at. (1980), Soderstrom and Stoica (1983), Stoica (1981b, 1981c, 1983),
Fuchs (1987).
(Section 11.4) For additional results on the parsimony principle see Stoica et at.
(1985a), Stoica and Soderstn')m (1982f), and Complement Cll.1.
(Section 11.5) The FPE and AIC criteria were proposed by Davisson (1965) and
Akaike (1969,1971). See also Akaike (1981) and Butash and Davisson (1986) forfurther
discussions. For a statistical analysis of the AIC test, see Shibata (1976), Kashyap (1980).
Appendix All.l
457
The choice yeN, p) = P log N has appeared in Schwarz (1978), Rissanen (1978, 1979,
1982), Kashyap (1982), while Hannan and Quinn (1979), Hannan (1980,1981), Hannan
and Rissanen (1982) have suggested yeN, p) = 2 cp log (log N). Another more pragmatic
extension of the AIC has been proposed and analyzed in Bhansali and Downham (1977).
Appendix All.l
Analysis of tests on covariance functions
This appendix analyzes the statistical properties of the covariance function tests
introduced in Section 11.2. To get the problem in a more general setting consider
random vectors of the form
= - .- L z(t,
YN
(All.l.l)
8)(1, 8)
1=1
where z(t, 8) is an (m 11) vector that might depend on the estimated parameter vector 8
of dimension n8. The prediction errors evaluated at 8, i.e. the residuals, are denoted
(t, 8). For the autocovariance test the vector z(t, 8) consists of delayed residuals
, ((t-.l,8)
z(t, 8)
(A11.1.2a)
(1 - m,
8)
(U(t - ~1.:
z(t, 8)
8)
1). )
(Al1.1.2b)
u(t - 1} - m)
Note that other choices of z and E besides those above are possible. The elements of
z(t, 8) may for example contain delayed and filtered residuals or outputs. Moustakides
and Benveniste (1986), Basseville and Benveniste (1986) have used a test quantity of the
form (AI1.1.I) to detect changes in the system dynamics from a nominal model by using
delayed outputs in z(t).
In Section 11.2 the asymptotic properties of r were examined under the assumption
that
= 80 , As will be shown in this appendix, a small (asymptotically vanishing)
deviation of the estimate 8 from the true value 80 will in fact change the asymptotic
properties of the vector r.
In the analysis the following mild assumptions will be made.
AI: The parameter estimate e is obtained by the prediction error method. Then
asymptotically for large N (cf. (7.58))
1
8 - 80 = R;;l N
A
L
1=1
1jJ(t)e(t)
(AIl.1.3)
458
Chapter 11
where
1jJ(t) = _ (aE(t, 8) I
a8
)1'
(Al1.1.4)
0=0"
v'N
= _1_
v'N
= (1
[z(t, ( 0 ) +
aZ~8 8)1_
e-e o
t=l
(All. 1.5)
(=1
-R. R- 1)
ztp
tV
~- ~
v' N
(Z(t, ( 0
1jJ(t)
) e(t)
where
RztV
= Ez(t, 8o)1jJ1'(t)
(All.1.6)
(Al1.1.7)
where
(A11.1.8)
= J..,.2(R z - RZt/JR:;p 1 R'Ij'z)
and R z = Ez(t, 8o)z T(f, 80), R'Ij)z = R;~" and J..,.2 = Ee2(t). Note that P r differs from J..,.2Rn
which is the result that would be obtained by neglecting the deviation of 8 from 80
(cf. Section 11.2 for detailed derivations). A result of the form (AIL 1.8) for pure time
series models has also been obtained by McLeod (1978).
Next examine the test statistic
(AI 1.1.9)
which was derived in Section 11.2 under the idealized assumption 8 = 80 , In practice the
test quantity (A 11.1. 9) is often used (with J..,. 2 and R z replaced by estimates) instead of
the 'exact' X2 test quantity rTp; I r, which is more difficult to evaluate. It would then be
valuable to know how to set the threshold for the test based on (All.l.9). To answer that
question, introduce Xu through
(All. 1. 10)
Appendix All.1
459
In the idealized case where P, = ".2R z one would apparently have Xu = x~(m). For
(AlLL7), (AILL8) this holds true only if R Z 1j} = O. This condition is not satisfied with
the exception of the autocovariance test for an output error model:
yet)
model:
la I <
Ee 2 (t) = 1
(All.1.ll)
identified as a second-order autoregressive process. Let the z(t) vector consist of delayed
prediction errors, (cf. (AlL1.2a). Then
Rz = J
(1
0 ... 0) - (1
l --za2
0) -1
- a
= (1 0 ... 0) - (1 0) (
1
-a
-a
0
2
1- a
0 1 ... am -
0
2
a(l - a )
Hence the matrix P r is singUlar. A more direct way to realize this is to note that the
elements of the vector
(ZT(t)
..
= r;1I2
(All.1.l2)
f~
uV(O, J)
(A11.lol3)
460
Chapter 11
and
(All.l.14a)
where
(All.l.14b)
() =
=
det[s1 -- J]
det[s1 - 1
= (s -
det[s1 - p;i2Jp;.-1/2]
RZ1j,R;plR1/!zR;I]
l)tn de{1
+s
1 Rzt"R;plRtl,zR;l
(All.1.IS)
Here use has been made of the fact that J and TJT- 1 for some nonsingular matrix T,
have the same eigenvalues, and also the Corollary of Lemma A.S. From (All.1.1S) it
can be concluded that J has m - nO eigenvalues at s = 1. The remaining nO eigenvalues
satisfy
s(1)
1I2 R
1i2 ]
- 1 - s [RR-1R
R-1j'
1!'Z
Z
Z1i'
1j'
~ 1
~.
(All.1.16)
-1'
X ~ r
(I
0) _
m - nfl
dist
2(
r~x m
(Al1.1.17)
- nO )
(A.l1~
-------------------------
Remark. If R Z1I , has rank smaller than nO and this is known, then a tighter lower bound
on Xu can be derived. When rank R z1f, = q < nO, the matrix R;pli2R'fJzR;IRz1/!R;/2
which appears in (All.1.16) will be of rank q and will hence have nO - q eigenvalues in
the origin. Hence one can substitute m - nO by m - rank RZ1/! in (Al1.1.17), (AI1.1.I8).
III
Roughly speaking, the system parameters will determine where in the interval
(All.l.18) Xu will be situated. There is one important case where more can be said.
Consider a pure linear time series model (no exogenous input) and let z(t) consist of
delayed prediction errors (hence the autocorrelation test is examined). Split the vector
Rz
= A21
Rz'tj!
A2BT
R",
A2BBT + A2b.
where B is an (n8Im) constant matrix and 11b.11 tends to zero exponentially as m tends to
infinity, since the elements of ~ do so. From (All. 1.16),
s(J)
1 - s[(BBT
+ b.)-lBBT]
(A11.1.19)
= x~(m - n8)
00.
for m large
(AIL 1.20)
Such a result has been derived by Box and Pierce (1970) for ARIMA models using a
different approach. (See Section 12.3 for a discussion of such models) ..
Another case where (All.1.20) holds is the cross-correlation test for an output error
model. Then z(t, 8) is given by (Al1.1.2b) with t1 = O. In this case
Appendix Al1.2
Asymptotic distribution of the relative decrease in the criterion function
This appendix is devoted to proving the result (11.43).
Let 8L denote the true parameter vector in model structure j{i and let 8iv denote the
estimate, i = 1, 2. For both model structures 8b describes exactly the true system and
thus f(t, 8i) is the white noise process of the system. Hence
(Al1.2.1)
where vjy() denotes the loss function associated with Mi'
Next consider the following Taylor series expansion:
462
Chapter 11
(Al1.2.2)
Here use has been made of the fact that V;J(8hr) = () since ON minimizes VhrC8 i ). Note
also that the substitution above of v:0(Ohr) by V',~!(8h) has only a higher-order effect.
Now recall from (7.58) that
ON - 8 0 = - [V'~(eo)]-1[V1(e()]T
(Al1.2.3)
Using (Al1.2.1)-(Al1.2.3),
v1-
V~ = ~(e5
O~)TV::02(85)(85
O~)
V,0(85)[V::02(85)rl[v,0(85)]T
To continue it is necessary to exploit some relation between V1(8 1) and V~(82). Recall
that "It) is a subset of A 2 . One can therefore describe the parameter vectors in Aj by
appropriately constraining the parameter vectors in A 2 . For this purpose introduce the
function g:AJ -l> A2 (here Ai are viewed as vector sets) which is such that the model
A2 with parameters 8 2 = g(8 1 ) is identical to utt 1 with parameters 8 1 . Define the
derivative
(Al1.2.S)
85
= g(86)
V1(8 1) = V~(g(81
VJJ(8 1 ) = V!J(g(8 1S(6 1)
VI-}(oA) = v;J(oii)s(oA)
v'~l(oA)
(Al1.2.6)
= sT(oA)v::'?(8ii)s(ob)
V N = v;J(oii)
s(8b)
vA,.-
VFv =
=I
(Al1.2.7)
VN[P - S(STp-1S)-lST](VN)T
(All.2.8)
z = _1_ pl/2VN(VN)T
(All.2.9)
Set
V(2A)
where p1l2 is a matrix square root of P (i.e. P ~ pT/2p1l2). Then from (All.2.8),
(All.2.9)
dis!
Z --
(Al1.2.1O)
.%(0, I)
vA,.-
VFv =
~ 2~2 ZTp-T/2[p
_ S(STp- 1S)-lST]p-1I2Z
(AI1.2.ll)
=
where
A
P- TI2 S
tr[1 - A(ATA)-lAT]
(AIl.2.l2)
464
Chapter 11
1
. 2
~
V Kr - V~ _
(VN - V N ) ~ N
V~
-
dist
(AIlo2.13)
Complement Cll.l
A general form of the parsimony principle
The parsimony principle was used for quite a long time by workers dealing with empirical
model building. The principle says, roughly, that of two identifiable model structures
that fit certain data, the simpler one (that is, the structure containing the smaller number
of parameters) will on average give better accuracy (Box and Jenkins, 1976). Here we
show that indeed the parsimony principle holds if the two model structures under
consideration are hierarchical (i.e. one structure can be obtained by constraining the
other structure in some way), and if the parameter estimation method used is the PEM.
As shown in Stoica and Soderstrom (1982f) the parsimony principle does not necessarily
hold if the two assumptions introduced above (hierarchical model structures, and use of
the PEM) are relaxed; also, see Problem 11.8.
Let {..41(t, S.Al)} denote the prediction errors of the model ~4l(S~u). Here B.Al
denotes the finite-dimensional unknown parameter vector of the model. It is assumed
that B.4't belongs to an admissible set of parameters, denoted by D..41. The parameters 8..41
are estimated using the PEM
(CI1.lo1)
where N denotes the number of data points. Introduce the following assumption:
A: There exists a unique parameter vector in D~, say 8~, such that .Al (l, 8.?u)
= e(t)
where e(t) is zero-mean white noise with covariance matrix A > 0 (this assumption
essentially means that the true system belo~gs to ,At).
Then it follows from (7.59), (7.75), (7.76) that the asymptotic covariance matrix of the
estimation error V N(O..41 - 8.~) is given by
P~ = {E[C!....Al(t,
a8
B1JT
e=e.~
A-l aEdlCl,
a8
8)1
e=e~
}-l
(CI1.1.2)
Next we discuss how to assess the accuracy of a model structure </It. This aspect is
essential since it is required to compare model structures which may contain different
numbers of parameters. A fairly general measure of accuracy can be introduced in the
following way. Let W.u(8) be an assessment criterion as introduced in (11.35). It is
hence a scalar-valued function of 8 E D~, which is such that
Complement Cll.l
inf W/ft(8)
WJ(8~f)
465
(Cll.I.3)
8EDd{
Any norm of the prediction error covariance matrix is an example of such a function
WI(, but there are many other possible choices (see e.g. Goodwin and Payne, 1973;
Soderstrom et at., 1974). w(!141) can be used to express the accuracy of the model
.At ({tJ{). Clearly W.tI(8df ) depends on the realization from which fttl was determined.
To obtain an accuracy measure which is independent of the realization, introduce
(Cl1.1.4)
where the expectation is with respect to 8~. In the general case this measure cannot be
evaluated exactly. However, an asymptotically valid approximation of (Cll.1.4) can
readily be obtained (d. (11.37)). For N sufficiently large, a simple Taylor series
expansion around e.l~ gives
The second term in (Cll.1.5) is equal to zero, since according to (CI1.1.3) W4-{(8.(~)
O. Define
________
(C_1L L6]
where P is given by (Cl1.1.2).
Now, let .At 1 and ,At2 be two model structures, which satisfy assumption A.
be such that there exists a differentiable function
Furthermore, let .At 1,
g(8): Ddt,
-~
Ddt,
g(8) == Evll,(t, 8)
(C11.I.7)
(Note that the function gee) introduced above was also used in Appendix All.2.) That is
to say, any model in the set ~..ft I belongs also to the set ~t 2 which can be written as
(Cl1.1.8)
The model structures v1 and ,At2 satisfying (C11.1.8) will be called hierarchical or
nested. Clearly for hierarchical model structures the term WJt (8,lj{) in (Cll.1.6) takes
the same value. Hence, the criterion fJ{ can be used for comparing uti 1 and c2'
We can now state the main result.
466
Chapter 11
Lemma Cll.l.!
Consider the hierarchical model structures ./It 1 and ./It2 (with JI{ 1 c ./It2) the parameters
of which are estimated by the PEM (Cll.l.I). Then Jl{1 leads on the average to more
accurate results than Jl{2 does, in the sense that
(Cll.l~
Proof. Introduce the following Jacobian matrix:
Sf;;
[aga8(8)] Ie~(J~K,
8)1
o=OI1{(, -
a.4t,(t,
ag
g)1
g(J)~g(o'!.-,)s
(CIlolo1O)
Similarly, from
ae
and
(Cll.1. ll)
Since 8.~, = g(8:!il) and since W4t, achieves its minimum value at 8'~2' it follows that
aW4t,(g)!aglgc_g(H"It,) = O. Thus, from (CIl.Lll),
(CIlol.12)
It follows from (Cll.1.2), (Cll.lo6), (Cllolo1O) and (CI1.1.12) that the following
inequality must be proved:
(Cl1.l.13)
where
H ~
- WI!
.4t, (8vIf, )
p f;; ?4t,
Complement Cll.l
o~
467
tr[HP] -- tr[(STHS)(STp-1S)-1]
= tr{H[P - S(STp-1S)-IST]}
(Cll.1.14)
P - S(STp-1S)-lST
e;T{I - (e;-TS)(STe;-le;-TS)-l(STe;-I)}e;
e;T[I - A(A1A)-lAT]e;
where A = e;-Ts. Since the matrix 1- A(A 1A)-lA T is idempotent, it follows that Q is
nonnegative definite (see Section A.S of Appendix A). Thus, Q can be factored as
Q = rTr. From this observation and (Cll.l.14) the result follows immediately:
o=
tr HQ
tr
HrTr =
tr
r HrT ;: ;: 0
III
The parsimony result established above is quite strong. The class of models to which it
applies is fairly general. The class of accuracy criteria considered is also fairly general.
The experimental conditions which are implicit in PJi (the experimental condition during
the estimation stage) and in W:ft (the experimental condition used for assessment of the
model structure) were minimally constrained. Furthermore, these two experimental
conditions may be different. That is to say, it is allowed to estimate the parameters using
some experimental conditions, and then to assess the model obtained on another set of
experimental conditions.
Chapter 12
12.1 Introduction
The purpose of this chapter is to give some guidelines for performing identification in
practice. The chapter is organized around the following issues:
..
..
..
..
Note that for obvious reasons the user has to choose ff'before the experiment. After the
experiment and the data acquisition have been done the user can still try different model
structures and identification methods.
468
Section 12.2
fJe
469
points (see Property 3 in Section 5.4). If a signal is chosen which has strictly positive
spectral density for all frequencies, this will guarantee persistent excitation of a sufficient
order.
The discussion so far has concerned the form of the input. It is also of relevance to
discuss its amplitude. When choosing the amplitude, the user must bear in mind the
following points:
.. There may be constraints on how much variation of the signals (for example, u(t) and
yet)) is allowed during the experiment. For safety or economic reasons it may not be
possible to introduce too large fluctuations in the process .
.. There are also other constraints on the input amplitude. Suppose one wishes to
estimate the parameters of a linear model. In practice most processes are nonlinear
and a linear model can hence only be an approximation. A linearization of the nonlinear dynamics will be valid only in some region. To estimate the parameters of the
linearized model too large an amplitude of the input should not be chosen. On the
other hand, it could be of great value to make a second experiment with a larger
amplitude to test the linearity of the model, i.e. to investigate over which region the
linearized model can be regarded as appropriate .
.. On the other hand, there are reasons for using a large input amplitude. One can expect
that the accuracy of the estimates will improve when the input amplitude is increased.
This is natural since by increasing u(t), the signal-to-noise ratio will increase and the
disturbances will playa smaller role.
A further important comment on the choice of input must be made. In practice the
user can hardly assume that the true process is linear and of finite order. Identification
must therefore be considered as a method of model approximation. As shown in Chapter 2, the resulting model will then depend on the experimental condition. A simple but
general rule can therefore be formulated: choose the experimental condition that resembles the condition under which the model will be used in the future. Expressed
more formally: let the input have its major energy in the frequency band which is of
interest for the intended application of the model. This general rule is illustrated in the
following simple example.
Example 12.1 Influence of the input signal on model pe~formance
Consider the system
yet)
+ ay(t
- 1) = bu(t - 1)
+ e(t)
(12.1)
where e(t) is white noise with variance A2 . Also consider the following two experimental
conditions:
ff1 : u(t) is white noise of variance 0 2
ff2 : u(t) is a step function with amplitude
The first input has an even frequency content, while the second is of extremely low
frequency. Note that both inputs have the same power.
The asymptotic covariance matrix of the LS estimates of the parameters a and b is
given by, (cf. (7.66
470
Chapter 12
A2
N
d)
Ey2(t)
-Ey(t)u(t)
.-EY(t)U(t))--l ('; P
Eu 2 (t)
- e
(12.2)
Consider first the (noise-free) step response of the system. The continuous time
counterpart of (12.1) is (omitting the noise eel~
y + ay
== ~u
(12.3)
where
a=-..!.log(-a)
h
~ == - (1
(12.4)
and where h is the sampling interval. The step response is easily deduced from (12.3). It
is given by
(12.5)
Next examine the variance of the step response (12.5), due to the deviation
(d - a b - b)T. Denoting the step responses associated with the system and the model
by y(t, 0) and yet, 8), respectively,
, = yet, 0)
yet, 0)
a y(t,
+ [ ae
J'
0) (0 - 0)
and
J [,
. J[aoa yet, 8) JT
, = [ fie
a yet, 0) E (0 - 8)(8 - 0) r
var[y(t, 8)]
= [:0 yet, 0)
JPe[!e y(t, 0)
(12.6)
The covariance matrix P e can be evaluated as in (7.68) (see also Examples 2.3 and 2.4).
Figure 12.1 shows the results of a numerical evaluation of the step response and its
accuracy given by (12.6) for the two inputs ffl and ff2 It is apparent from the figure that
ff2 gives the best accuracy for large time intervals. The static gain is more accurately
estimated for this input. On the other hand, ffj gives a much more accurate step
response for small and medium-sized time intervals.
Next consider the frequency response. Here
be iw
'iw -
+ ae
be iOJ
elm,.
.
iO} = (1
iUl)2 [b(l + ae"O) - b(l + deW]
+ ae
+ ae
.
=
(1
e im
.
IW
+ ae iw)2 (-be
1 + ae 1m )
(d - a)
A
b- b
Section 12.2
75
100
75
100
FIGURE 12.1 Step responses of the models obtained using white noise (gel> upper
part) and a step function (R"2, lower part) as input signals. For each part the curves
shown are the exact step responses and the response one standard deviation. The
parameter values are 0 2 = Eu 2(t) = 1, ,,2 = Ee2(t) = 1, a = -0.95, b = 1, N = 100.
bei(J)
bei(J) 12
. - 1 + ae"o
.
Q(ill ) teE
1 + aeJ(J)
(12.7)
This criterion is plotted as a function of 0) in Figure 12.2 for the two inputs gel and ge2'
Note that the low-frequency input ge2 gives a good accuracy of the very low-frequency
model behavior (Q(O)) is small for 0) = 0), while the medium and high-frequency be
havior of the model (as expressed by Q (0) )) is best for gel'
Several attempts have been made in the literature to provide a satisfactory solution to the
problem of choosing optimal experimental conditions. It was seen in Chapter 10 that it
can sometimes pay to use a closed loop experiment if the accuracy must be optimized
under constrained output variance. If it is required that the process operates in open loop
then the optimal input can often be synthesized as a sum of sinusoids. The number of
sinusoids is equal to or greater than the order of the model. The optimal choice of the
472
Chapter 12
Q(W)
10-- 1
-t-----.-----.-------.------.------..------r......J
o
it
FIGURE 12.2 The functions Q(w) for u(t) white noise (gel) and step (gez). The
parameter values are OZ = EuZ(t) = 1, /,.2 = Ee2 (t) = 1, a = -0.95, b = 1, N = 100.
Note the logarithmic scale.
amplitudes and frequencies of the sinusoidal input signal is not, however, an easy matter.
This optimal choice will depend on the unknown parameters of the system to be
identified. The topic of optimal input design will not be pursued here; refer to Chapter 10
(which contains several simple results on this problem) and its bibliographical notes.
Next consider the choice of sampling interval. Several issues should then be taken into
account:
Pre filtering of data is often necessary to avoid aliasing (folding the spectral density).
Analog filters should be used prior to the sampling. The bandwidth of the filter should
be somewhat smaller than the sampling frequency. The filter should for low and
medium-sized frequencies have a constant gain and a phase close to zero in order not
to distort the signal unnecessarily. For high frequencies the gain should drop quickly.
For a filter designed in this way, high-frequency disturbances in the data are filtered
out. This will decrease the aliasing effect and may also increase the signal-to-noise
ratio. Output signals should always be considered for prefiltering. In case the input
signal is not in a sampled form (held constant over the sampling intervals), it may be
useful to prefilter it as well otherwise the high-frequency variations of the input can
cause a deterioration of the identification result (see Problem 12.15) .
Assume that the total time interval for an identification experiment is fixed. Then it
Section 12.2
may be useful to sample the record at a high sampling rate, since then more measurements from the system are collected.
Assume that the total number of data points is fixed. The sampling interval must then
be chosen by a trade-off. If it is very large, the data will contain very little information
about the high-frequency dynamics. If it is very small, the disturbances may have a
relatively large influence. Furthermore, in this last case, the sampled data may contain
little information on the low-frequency dynamics of the system.
As a rough rule of thumb one can say that the sampling interval should be taken as
10% of the settling time of a step response. It may often be much worse to select the
sampling interval too large than too small (see Problem 12.'14 for a simple illustration).
Very short sampling intervals will often give practical problems: all poles cluster
around the point z = 1 in the complex plane and the model determination becomes
very sensitive. A system with a pole excess of two or more becomes nonminimum
phase when sampled (very) fast (see Astrom et al., 1984). This will cause special
problems when designing regulators.
Having collected the data, the user often has to perform some precomputations. It is
for example advisable to perform some filtering to reduce the effect of noise.
As already stated, high-frequency noise can cause trouble if it is not filtered out before
sampling the signals (due to the aliasing effect of sampling). The remedy is to use analog
lowpass filters before the signals are sampled. The filters should be designed so that the
high-frequency content of the signals above half the sampling frequency is well damped
and the low-frequency content (the interesting part) is not very much affected.
As a precomputation it is common practice to filter the recorded data with a discretetime filter. The case when both the input and the output are filtered with the same filter
can be given some interesting interpretation. Consider a SISO model structure
+ H(q-l; 8)e(t)
(12.8)
E2(t)
= J-r\q-l;
=
(12.10)
Let El(t) denote the prediction error for the model (12.8). Then
E2(t) ,= F(q-l)Ej(t)
(12.11)
Let the unknown parameters 8 of the model be determined by the PEM. Then, for a
large number of data points, 8 is determined by minimizing
(12.12a)
in the case of unfiltered data, and
474
Chapter 12
(12.12b)
in the filtered data case. Here <PE,(W; 8) denotes the spectral density of l(t). Thus the
flexibility obtained by the introduction of F(q-l) can be used to choose the frequency
bands in which the model should be a good approximation of the system. This is quite
useful in practice where the system is always more complex than the model, and its
'global' approximation is never possible.
yet)
1.0u(t - 1)
e(t)
0.7e(t - 1)
(12.13)
= x(t) + m
was simulated, generating u(t) as a PRBS of amplitude 1 and e(t) as white noise of zero
mean and unit variance. The number of data points was N = 1000. The number m
accounts for the mean value of the output. The prediction error method was applied for
two cases m = 0 and m = 10. The results are shown in Table 12.1 and Figure 12.3. (The
plots show the first 100 data points.)
TABLE 12.1a Parameter estimates for Example 12.2, m = 0
Model order
Parameter
True value
Estimated value
-0.8
1.0
0.7
-0.784
0.991
0.703
b
c
Parameter
True value
Estimated value
a
b
c
-0.8
1.0
0.7
-0.980
1.076
0.618
al
a2
bl
b2
-1.8
0.8
J.()
-1.0
-0.3
-0.7
-1.788
0.788
1.002
-0.988
-0.315
-0.622
Cl
C2
10
-10+-----------------r-------______~
50
100
(a)
50
100
(b)
FIGURE 12.3 Input (upper part), output (1, lower part) and model output (2, lower
part) for Example 12.2. (a) m = O. (b) m = 10, model order = 1 (c) m == 10, model
order = 2. (d) Prediction errors for m = 0 (upper), m = 10, first-order model
(middle), and m == 10, second-order model (lower).
50
100
(c)
()
-4+-__________.______. -______________-;
50
10
o
50
(d)
100
Section 12.3
477
It is clear from the numerical results that good estimates are obtained for a zero
mean output (m = 0), whereas m =1= 0 gives rise to a substantial bias. Note also the large
spike in the residuals for t = 1, when m = 10.
The true system is of first order. The second-order model obtained for m = 10 has an
interesting property. The estimated polynomials (with coefficients given in Table 12.1b)
can be written as
A(q-l)
B(q-I)
(1 - q-l)1.002 + 0.0l5q--2
C(q-I)
(12.14)
The small second-order terms can be ignored and the following model results:
(l - 0.788q--l)(1 - q-l)y(t)
1.002(1 - q-l)U(t - 1)
+ (1 + 0.685q-l)(1 - q-I)e(t)
(12.15)
A theoretical justification can be given for the above result for the second-order model.
The system (12.13) with m = 10 can be written as
+ 0.8q-2)y(t)
(l.Oq-l - l.Oq-2)U(t)
+ (1 - O.3q-l - O.7q-2)e(t)
(12.16)
The 'true values' given in Table 12.1b were in fact found in this way.
The model (12.14) (or (12.15) is a good approximation of the above second-order
description of the system. Nevertheless, as shown in Figure 12.3c, the model output is
far from the system output. This seemingly paradoxical behavior may be explained as
follows. The second-order model gives no indication on the mean value m of yet). When
its output was computed the initial values were set to zero. However, the 'correct' initial
values are equal to the (unknown) mean m. Since the model has a pole very close to 1,
the wrong initialization is not forgotten, which makes its output look like Figure 12.3c
(which is quite similar to Figure 12.3b). Note that the spikes which occur at t = 1 in the
middle and lower parts of Figure 12.3d can also be explained by the wrong initialization
with zero of the LS predictor.
By introducing the difference or delta operator
A = 1 _ q-l
(12.17a)
l.OAu(t - 1)
+ (1 + O.7q-l)Ae(t)
(12.17b)
The implications of writing the model in this way, as compared to the original representation (12.13) are the following:
.. The model describes the relation between differenced data rather than between the
original input and output.
478
Chapter 12
.. The constant level m disappears. Instead the initial value (at t = 0) of (12.17b) that
corresponds to the integrating mode, ~ = 1 - q-l will determine the level of the
output y(t) for t > O.
lID
Remark 1. In what follows an integrated process, denoted by
1
yet) = ~ u(t)
(12.18a)
will mean
(12.18b)
As a consequence
y(t) = yeO) +
2:
(12. 18c)
u(k)
k=l
~ u(t) =
2:
u(k)
k=-oo
would not be suitable, since even with u(t) a stationary process, yet) will not have tinite
variance for any finite t.
II1II
Remark 2. Consider a general ARMAX model for differenced data. It can be written
as
(12.19a)
To model a possible drift disturbance on the output, add a second white noise term, v(t),
to the right-hand side of (12.19a). Then
A(q-l)~y(t) = B(q-l)~U(t)
C(q-l)~e(t)
+ vet)
(12.19b)
Note that the part of the output corresponding to vet) is 1![A(q-l)~]V(t), which is a
nonstationary process. The variance of this term will increase (approximately linearly
with time, assuming the integration starts at t = O. When A(q-l) = 1, such a process is
often called a random walk (cf. Problem 12.13).
Next one can apply spectral factorization to the two noise terms in (12.19b). Note that
both terms, i.e. C(q-l)~e(t) and vet), are stationary processes. Hence the spectral
density of the sum can be obtained using a single noise source, say C(q-l)W(t) where
C(q--I) is a polynomial with all zeros outside the unit circle and is given by
C(Z)C(Z-l)A~, =
(12.19c)
(cf. Appendix A6.1). (A~ is the variance of the white noise w(t).)
Hence the following model is obtained, which is equivalent to (12.19b):
A(q-l)~y(t)
= B(q-l)~U(t) +
C(q-l)W(t)
(12.19d)
Section 12.3
479
(12.1ge)
Such models are often called CARIMA (controlled autoregressive integrated moving
average) models or ARIMAX. The disturbance term, C(q-l)/[A(q-l)(l - q-l)]W(t), is
called an ARIMA process. Its variance increases without bounds as t increases. Such
models are often used in econometric applications, where the time series frequently are
nonstationary. They constitute a natural way to describe the effect of drifts and other
nonstationary disturbances, within a stochastic framework. Note that even if they are
nonstationary due to the integrator in the dynamics, the associated predictors are
..
asymptotically stable since C(z) has all zeros strictly outside the unit circle.
The mean values (or their effect on the output) can be estimated in two ways.
Fit a polynomial trend
(12.20)
to the output by linear regression and similarly for the input. Then compute new
(,detrended') data by
yet)
= yet) - y*(t)
aCt)
= /,let) - u*(t)
and thereafter apply an identification method to yet), U(t). If the degree r in (12.20) is
chosen as zero, this procedure means simply that the arithmetic mean values
1
y*
2: yet)
1=1
1
u* = N
2: u(t)
1=1
480
Chapter 12
are computed and subtracted from the original data. (See Problem 12.5 for an
illustration of this approach.) Note that (12.20) for r > 0 can also mode! some drift in
the data, not only a constant mean .
Estimate the means during the parameter estimation phase. To be able to do so a
model structure must be used that contains an explicit parameter for describing the
effect of some possibly nonzero mean values on yet). The model will have the form
yet)
(12.21)
where m(8) is one (or ny in the multivariable case) of the elements in the parameter
vector 8. This alternative is illustrated in Problem 12.3-12.5. The model (12.21) can
be extended by replacing the constant m(8) by a polynomial in t, where the coefficients
depend on 8.
Handling nonzero means with stochastic models
In this approach a model structure is used where the noise filter has a pole in z
means that the model structure is
yet)
1. This
(12.22a)
where H(q-l; 8) has a pole in q-l = 1. This is equivalent to saying that the output
disturbance is modeled as an ARIMA model, which means that
H(q-l. 8) = H(q-l; 8)
,
1- q 1
(12.22b)
for some filter H(q-\ 8). The model can therefore be written as
(12.22c)
(cL (12.19d, e. Therefore, one can compute the differenced data and then use the
model structure (12.22c) which does not contain a parameter m(8).
Assume that the true system satisfies
(12.23)
where m accounts for the nonzero mean of the output. The prediction error becomes
e(t, 8)
(12.24)
It is dearly seen that m does not influence the prediction error e(t, 8). However, e(t, 8) =
e(t) cannot hold unless H (q-\ 8) = HsCq-')L1. This means that the noise filter must have
a zero in z = 1. Such a case should be avoided, if at all possible, since the predictor will
not then be stable but be significantly dependent on unknown initial values. (The
parameter vector 8 no longer belongs to the set 0), (6.2).)
Note that in Example 12.2 by increasing the model order both C(q--l; 8) and
H(q-I; 8) got approximately a pole and a zero in z = 1. Even if it is possible to get rid of
Section 12.4
481
nonzero mean values in this way it cannot be recommended since the additional poles
and zeros in z = 1 will cause problems.
yet)
G,.(q-l)U(t) +
7~q~J e(t)
(12.25)
Here the disturbance is a nonstationary process (in fact an ARIMA process). The
variance of yet) will increase without bound as t increases.
Consider first the situation where the system (12.25) is identified using an ARMAX
model
(12.26a)
To get a correct model it is required that
CCq-l)
HsCq-l)
(12.26b)
A(q--l) = 1 _ q-l
which implies that ~ = 1 - q -1 must be a factor of both A (q -\) and B (q -1). This should
be avoided for several reasons. First, the model is not parsimonious, since it will contain
unnecessarily many free parameters, and will hence give a degraded accuracy. Second,
and perhaps more important, such models are often not feasible to use. For example, in
pole placement design of regulators, an approximately common zero at z = 1 for the
A- and B-polynomial will make the design problem very ill-conditioned.
A more appealing approach is to identify the system (12.25) using a model structure of
the form (12.22). The prediction error is then given by
c:(c, 8)
= H-1(q-l;
=
8)[~y(t)
- C(q-l;
8)~u(t)]
8)]~u(t)
(12.27)
+ H-l(q--l; 8)~ft.(q-l)e(t)
where it is assumed that H--I(q-l; 8) is asymptotically stable. Here one would like to
cancel the factor ~ that appears both in numerator and denominator. This must be done
with some care since a pole at z := 1 (which is on the stability boundary) can cause a
nondecaying transient (which in this case would be a constant). Note however that
(1/~)e(t) can be written as (cf. (12.18))
~ e(t) =
Xo
2: e(i)
i=l
where
Xo
(12.28a)
482
Chapter 12
= Ll [XO +
~ e(i)]
(12.28b)
e(t)
(12.29)
+ [j--l(q-l; 8)H\.(q-l)e(t)
is a stationary process and the PEM loss function is well behaved.
1.0u(t - 1)
+ 0.2e(t -
2)
where u(t) is a PRBS of amplitude 1 and e(t) a zero mean white noise process with unit
varIance.
The number of data points was taken as N = 1000. The system was identified using a
prediction error method within an ARMAX model structure
(12.30)
The polynomials in (12.30) are all assumed to have degree n. This degree was varied
from 1 to 3. The results obtained are summarized in Tables 12.2-12.4 and Figure 12.4.
It is obvious from Tables 12.2-12.4 and the various plots that a second-order model
should be chosen for these data. The test quantity for the F-test comparing the secondand the third-order models is 4.6. The 5% significance level gives a threshold of 7.8l.
Hence a second-order model should be chosen. Observe from Table 12.3 that for n = 3
the polynomials A, Band C have approximately a common zero (z = -0.6). Note from
Table 12.2 and Figure 12.4a, third part, that the estimated second-order model gives a
very good description of the true system.
III
A
yet) - O.8y(t - I)
= 1.0u(t -
1)
+ eel) + O.7e(t - 1)
Loss
1713
Parameter
True value
-0.822
0.613
0.389
a
b
c
526.6
-1.5
0.7
at
a2
bl
1.0
b2
0.5
-1.0
0.2
CI
C2
524.3
Estimated value
-1.502
0.703
0.938
0.550
-0.975
0.159
-0.834
-0.291
0.462
0.942
1.151
0.420
-(J.308
-0.455
0.069
al
az
a3
bl
b2
b3
CI
C2
C3
TABLE 12.3 Zeros of the true and the estimated polynomials, Example 12.3
Polynomial
True system
n=l
A
0.750
B
C
-0.500
0.276
0.724
iO.371
Estimated models
n = 2
0.822
0.751
-0.389
-0.586
0.207
0.768
iO.373
0.748
-0.662
-0.611
-0.615
0.144
0.779
= 3
iO.373
iO.270
TABLE 12.4 Test quantities for Example 12.3. The distributions for
the correlation tests refer to the approximate analysis of
Examples 11.1 and 11.2
Model
order
Test quantity
n = 1
changes of sign
in {E(l)}
test on r,(t) = 0
,= 1, ... ,10
test on r"u(') = 0
,= 2, . ",11
test on rw(t) = 0
t = -9, ... ,0
n = 2
changes of sign
in {E(t)}
test on riLl = ()
t = 1, ... , 10
test on 'u,(') = 0
t = 3, ... ,12
test on rw (') = 0
,= --9, ... ,0
Numerical
value
Distributions
(under null
hypothesis)
95%
[99'7'0]
confidence
level
459
,/V(500,250)
212.4
X2 (1O)
(468,530)
[(459,540) ]
18.3 [23.2]
461.4
X2(1O)
18.3 [23.2]
3.0
X2(10)
18.3 [23.2]
A.r( 500,250)
514
12.2
X2(1O)
(468,530)
[(459,540) 1
18.3 [23.2]
4.8
X2(1O)
18.3 [23.2]
4.7
X2(1O)
18.3 [23.2]
484
Chapter 12
where u(t) is a PRES of amplitude 1 and e(t) white noise of zero mean and unit
variance. The number of data points was N = 1000. The system was identified using the
least squares method in the model structure
with
A(q-I) = 1
Ulq-I
+ ... +
+ .,. +
B(q-I) = b1q-1
unq-n
b ll q-Il
50
15
-15
________________. -______________
50
100
~
15
-15~
________________r -____________
50
100
~
15
-15~
______________'__- r______________~
50
100
(a)
FIGURE 12.4(a) Input (first part). output (1) and model output (2) for n = 1 (second
part), n = 2 (third part) and n = 3 (fourth part). Example 12.3. The first 100 data
points are plotted.
f,(,)
rw (')
r,(O)
0.75
\1[1',(0)1',,(0) 1
0.75
0.5
0.5
0.25
0.25
-0.25
5
-0.25
-10
10
10
(b)
FIGURE 12.4(b) Correlation functions for the first-order model, Example 12.3.
Lefthand curve: the normalized covariance function re(T,) of the residuals E(t). Righthand curve: the normalized cross-covariance function fw(T,) between E(t) and u(t).
fE(T)
'w(t)
1',.(0)
\I[7,(0)f u(O)}
0.8
0.8
0.6
0.6
0.4 t
0.4
0.2
0.2
......--.
()
r--..
'V
()
'"'"
V'
()
/'
t
10
-10
(c)
........,
V" v
t
10
486
Chapter 12
The model order was varied from n = 1 to n = 5. The results obtained are summarized in
Tables 12.5, 12.6 and Figures 12.5, 12.6.
When searching for the most appropriate model order the result will depend on how
the model fit is assessed.
First compare the deterministic parts of the models (i.e. [B(q-l)!A(q-I)]U(t). The
model outputs do not differ very much when the model order is increased from 1 to 5.
Indeed, Figure 12.5 illustrates that the estimated transfer functions B(q-l)!A(q-l), for
all n, have similar pole-zero 'configurations to that of the true system. Hence, as long as
only the deterministic part of the model is of importance, it is sufficient to choose a firstorder model. However, note from Table 12.5 that the obtained model is slightly biased.
Hence, the estimated first-order model may not be sufficiently accurate if it is used for
other types of input signals.
If one also considers the stochastic part of the model (i.e. [lIA(q-I)]e(t) then the
situation changes. This part of the model must be considered when evaluating the prediction ability of the modeL The test quantities given in Table 12.6 are all based on the
stochastic part of the modeL Most of them indicate that a fourth-order model would be
adequate.
Model order
Loss
7R6
Parameter
True value
Estimated value
-O.R
1.0
-O.86R
0.917
b
2
603
al
a2
hI
h2
-1.306
0.454
0.925
-0.541
551
al
a2
a,
bl
-1.446
0.R36
--0.266
0.935
-0.673
0.358
b2
h,
4
533.1
al
a2
a,
a4
bl
h2
b,
h4
5
530.0
ill
a2
a]
a4
as
hi
h2
h]
h4
hs
-1.499
0.992
-0.530
0.169
0.93R
-0.729
0.469
-0.207
-1.513
1.035
-0.606
0.2Rl
-0.067
0.938
-0.743
0.501
-0.25R
0.100
Section 12.4
Model order
Loss
F-test
Degrees of
freedom
95%
confidence
interval
10
(467,531)
787
356
257
15.3
438
108
13.1
305
2
603
94
551
482
50.2
4.1
494
22.1
4.9
502
20.0
3.3
34
533.1
5.8
530.0
To illustrate how well the different models describe the stochastic part of the system,
the noise spectral densities are plotted in Figure 12.7.
It can be seen from Figure 12.7 that the fourth-order model gives a closer fit than the
first-order model to the true spectral density of the stochastic part of the system. III
yet)
(12.31)
can in fact cover cases with a time delay. To see this write
G(q-l; 8)
gi(8)q-i
(12.32a)
i=1
where the sequence {g;(8)} is the discrete time impulse response. If there should be a
delay of k (k ~ 1) sampling intervals in the model, it is sufficient to require that the
parametrization satisfies
g;(8) == 0
i = 1, 2, ... , (k - 1)
(12.32b)
n = 1
n = 2
n = 4
n = 3
.I.
n = 5
o
-l+L~~ML~~~~~~~-L~~LU~L-~
50
100
-10+----------------~1----------------_4
50
100
FIGURE 12.6 Input (upper part), output (1, lower part) and model output (2, lower
part) for model order n = 4, Example 12.4.
Section 12.5
Time delays
489
100...,..----------------------------.,
<p(w)
50
True system
FIGURE 12.7 Noise spectral density 11 + O.7e iw I 2/(2nll - O.8e i w[2) of the true system
and the estimated spectra ~2/(2nIA(eii'w)12) of the first- and fourth-order models.
With this observation it can be concluded that the general theory that has been
developed (concerning consistency and asymptotic distribution of the parameter
estimates, identifiability under closed loop operation, etc.) will hold also for models with
time delays.
If the constraint (12.32b) is not imposed on the model parametrization, the theory still
applies since ../ E .At under weak conditions. The critical point is to not consider a
parametrized model having a larger time delay than the true one. In such a case . f no
longer belongs to .At and the theory collapses.
The following example illustrates how to cope with time delays from a practical point
of view.
Example 12.5 Treating time delays for ARMAX models
Assume that estimates are required of the parameters of the following model:
(12.33)
where
=
B(q-l) =
C(q-l) =
A(q-l)
b1q-l
+ .. , + bnq-n
Cjq-l
k being an integer, k
?;
+ .. , + cnq-n
1. By writing the model as
490
Chapter 12
q--(k-l)U(t)
U(t
+ 1)
k, k
+ 1,
+ 2, ... , N
2. Next estimate the parameters of the ARMAX model applying a standard method (i.e.
one designed to work with unit time delay) using u(t) as input signal instead of u(t).
Note that one could alternatively shift the output sequence as follows:
I'. Compute yet) = yet + k - 1)
t = 1,2 ... , N - k + 1
2'. Estimate the ARMAX model parameters using yet) as output instead of yet).
Such a procedure is clearly not limited to ARMAX models. It may be repeated for
various values of k. The determination of k can be made using the same methods as those
used for determination of the model order. Various methods for order estimation were
III
examined above; see also Chapter 11 for a detailed discussion.
It should be noted that some care should be exercised when scanning the delay k and the
model order n simultaneously. Indeed, variation of both k and n may lead to non-nested
model structures which cannot be compared by using the procedures described in Section
11.5 (such as the F-test and the AIC). To give an example of such a case, note that the
model structures (12.33) corresponding to {k = 1, n = I} and to {k = 2, n = 2} are not
nested. In practice it is preferable to compare the model structures corresponding to a
fixed value of n and to varying k. Note that such structures are nested, the one
corresponding to a given k being included in those corresponding to smaller values of k.
= C(q-I;
(12.34)
(12.35)
and possibly also their gradient with respect to 8. In order to compute (1, 8), ... ,
E(N, 8) from the measured data y(l), u(l), ... , yeN), u(N) using (12.35) some initial
conditions for this difference equation are needed. One can then proceed in at least
two ways:
.. Set the initial values to zero. If a priori information is available then other more
appropriate values can be used .
.. Include the unknown initial values in the parameter vector.
A further, but often more complicated, possibility is to find the exact ML estimate (see
Initial conditions
Section 12.6
491
Complement C7.7). The following example illustrates the two possibilities above for an
ARMAX model.
Example 12.6 Initial conditions for ARMAX models
Consider the computation of the prediction errors for an ARMAX model. They are
given by the difference equation
(12.36)
For convenience let an the polynomials A(q-l), B(q-I) and C(q-I) have degree n. The
computation of E(t, 8) can then be done as follows:
E(t, 8)
(12.37)
where
e=(a[ ... an
E(t - 1, e)
...
E(t - n, e
Cj . . . Cn)T
When proceeding in this way, cp(l) is needed to compute 10(1, e) but this vector contains
unknown elements. The first possibility would be to set
yet) = 0
u(t) = 0
E(t, e)
for
t ~
The second possibility is to include the unknown values needed in the parameter
vector, which would give
8'
8 = (y(O)
(~)
(12.38a)
... y( -n
(12.38b)
This makes the new parameter vector 8' of dimension 6n. This vector is estimated by
minimizing the usual loss function (sum of squared prediction errors).
The second possibility above is conceptually straightforward but leads to unnecessary
computational complications. Since the computation of the prediction errors is needed as
an intermediate step during an optimization with respect to e (or 8') it is of paramount
importance not to increase the dimension of the parameter vector more than necessary.
Now regard (12.36) as a dynamic system with u(t) and y(t) as inputs and E(t, 8) as
output. This system is clearly of order n. Therefore it should be sufficient to include only
n initial values in the parameter vector, _which makes a significant reduction as compared
to the 3n additional values entered in 8 (12.38b).
One way to achieve the aforementioned reduction would be to extend e with the first
n prediction errors in the criterion. Consider therefore the modified criterion
N
Vee')
2:
t~n+l
2(t, 8)
(12.39)
492
Chapter 12
where
8
(1, 8)
8'=
(12.40)
fen, 8)
-Cl
x(t + 1) =
-CII
(t, 8)
(1
bi
1
yet)
(12.41)
b ll
() ... 0) x(t)
u(t) +
x(t) -
+ yet)
8'-
( 8)
x(I)
(12.42)
which is of dimension 4n. It is clearly seen from (12.41) how to compute (1, 8), ... ,
(N, 8) from the data y(1), u(I), ... , yeN), u(N) for any value of 8'. When implementing (12.41) one should, of course, make use of the structure of the companion
matrix in order to reduce the computational effort. There is no need to perform multiplication with elements that are known a priori to be zero. When using (12.41), (12.42)
the parameter estimates are determined by minimizing the criterion
N
(12.43)
t~1
In contrast to (12.39) the first n prediction errors 1'(1, W), ... , fen, 8') are now included
in the criterion.
II1II
Finally, it should be noted that the initial conditions cannot be consistently estimated
using a PEM or another estimation method (see Astrom and Bohlin, 1965). See also
Problem 12.11. This is in fact quite expected since the consistency is an asymptotic
property, and the initial conditions are immaterial asymptotically for stable systems.
However, in practice one always processes a finite (sometimes rather small) number of
samples, and inclusion of the initial conditions in the unknown parameter vector may
improve substantially the performance of the estimated modeL
Section 12.8
Note that here the choice of model structure is implicitly involved, since the methods
above are tied to certain types of model structure.
In summary, the user must make a trade-off between required accuracy and computational effort when choosing the identification method. This must be done with the
purpose of the identification in mind. In practice other factors will influence the choice,
such as previous experience with various methods, available software, etc.
494
Chapter 12
used there is a potential risk that the algorithm is stuck at a local minimum. Example 11.5
demonstrated how this can happen in practice. Note that this type of problem is linked to
the use of prediction error methods. It does not apply to instrumental variable methods.
It is hard to analyze the possible existence of local minima theoretically. Only a few
general results are available. They all hold asymptotically, i.e. the number of data, N, is
assumed to be large. It is also assumed that the true system is included in the model
structure. The following results are known:
.. For a scalar ARMA process
all minima are global. There is a unique minimum if the model order is correctly
chosen. This result is proved in Complement C7.6 .
.. For a multivariable MA process
yet)
C(q-l)e(t)
there is a unique minimum if the signal-to-noise ratio is (very) high, and several local
minima if this ratio is (very) low.
For the SISO output error model
8(q-l)
yet)
all minima are global if the input signal is white noise. This result is proved in
Complement C7.5.
In all the above cases the global minimum of the loss function corresponds to the true
system.
If the minimization algorithm is stuck at a local minimum a bad model may be
obtained, as was illustrated in Example ll.5. There it was also illustrated how the misfit
of such a model can be detected. If it is found that a model is providing a poor description
of the data one normally should try a larger model structure. However, if there is reason
to believe that the model corresponds to a nonglobal minimum, one can try to make a
new optimization of the loss function using another starting point for the numerical
search routine.
The practical experience of how often certain PEM algorithms for various model
structures are likely to be stuck at local minima is rather limited. It may be said,
however, that with standard optimization methods (for example, a variable metric
method or a Gauss-- Newton method; see (7.87)) the global minimum is usually found for
an ARMAX model, while convergence to a false local minimum occasionally may occur
for the output error model
Section 12.9
yet)
Robustness
B(q-l)
F(q-l) u(t)
495
+ e(t)
When the loss function is expected to be severely multimodal (as, for example, is the
PEM loss function associated with the sinusoidal parameter estimation, Stoica, Moses
et at., 1986), an alternative to accurate initialization is using a special global optimization
algorithm such as the simulated annealing (see, for example, Sharman and Durrani,
1988). For this type of algorithms, the probability of finding the global minimum is close
to one even in complicated multimodal environments.
12.9 Robustness
When recording experimental data occasional large measurement errors may occur.
Such errors can be caused by disturbances, conversion failures, etc. The corresponding
abnormal data points are called outliers. If no specific action is taken, the outliers will
influence the estimated model considerably. The outliers tend to appear as spikes in the
sequence of prediction errors {ECt, O)}, and will hence give large contributions to the
loss function. This effect is illustrated in the following two examples.
Example 12.7 Effect of outliers on an ARMAX model
Consider an ARMAX process
A(q-l)y(t) = B(q-l)U(t)
+ C(q--I)e(t)
(12.44a)
where e(t) is white noise of zero mean and variance A? Assume that there are
occasionally some errors on the output, so that one is effectively measuring not yCt) but
z(t)
= y (t) + v(t)
(12.44b)
'* s .
These assumptions imply that vet) is white noise. From (12.44a, b),
A(q-l)Z(t)
B(q-l)U(t)
+ C(q-l)e(t) + A(q-l)V(t)
(12.44c)
= B(q-l)U(t) +
C(q--I)W(t)
(12.44d)
where w(t) is white noise of variance A~v, and C(q-l) and A~v are given by
(12.44e)
If the system (12.44a, b) is identified using an ARMAX model then (asymptotically, for
an infinite number of data points) this will give the model (12.44d). Note that A(q-I) and
496
Chapter 12
B(q~l)
remain unchanged. Further, from (12.44e), when 0 2 _____ 0 (the effect of outliers
tends to zero) then C(z) = C(z), i.e. the true noise description is found. Similarly, when
0 2 _____ 00 (the outliers dominate the disturbances in (12.44a)), then C(z) = A(z) and the
III
filter H(q~l) = C(q~l)/A(q~l) = 1 (as intuitively expected).
Example 12.8 Effect of outliers on an ARX model
The effect of outliers on an ARX model is more complex to analyze. The reason is that
an ARX system with outliers can no longer be described exactly within the class of ARX
models. Consider for illustration the system
yet) + ayCt - 1)
= bu(t - 1) + e(t)
z(t) = yet) + vet)
(12.45a)
with u(t), e(t) and v(t) mutually uncorrelated white noise sequences of zero means and
variances o~, A~, A~ respectively. Let the system be identified with the LS method in the
model structure
z(t) + az(t - 1)
bu(t - 1) + E(t)
(12.45b)
( b~')
(EZ 2 (t)
-Ez(t)u(t)
= (EZ2(t)
-EZ(t)U(t)~I(-EZ(t + l)Z(t)
Ez(t + l)u(t)
Eu 2 (t)
(~)~l(-EZ(t +2 1)z(t)
bou
0u
a=
b=
band
aEy~~ ~ aa
Ey2(t) +
(12.45c)
A~ ~
The scalar a satisfies 0 :s: a :s: 1. Specifically it varies monotonically with },~, tends to 0
when A~ tends to infinity, and to 1 when A~ tends to zero. Next examine how the noise
filter lJ(q~l) differs from 1. It will be shown that in the present case, for all frequencies
l_.~ _
I__
1+
(:/eIUl
11 :s: I1 +1.
ae
HD
-11
(12.4Sd)
This means that in the presence of outliers the estimated noise filter lIA(q~l) is closer to
1 than the true noise filter lIA(q--l). The inequality (12.45d) is proved by the following
equivalences:
a 2 11 + ae i <U12
o :s:
o :s:
:s: 11 + aae i "'12 <? a 2 (1 + a2 + 2a cos w) :s: 1 + a2a 2 + 2aa cos w <?
1 - a 2 + 2aa(1 - a)cos
(1 - a)
(J) <?
<?
+ 2a(1 + a cos w)
III
Section 12.9
Robustness
497
There are several ways to cope with outliers in the data; for example:
.. Test of outliers and adjustment of erroneous data .
.. Use of a modified loss function.
In the first approach a model is fitted to the data without any special action. Then
the residuals ECt,
of the obtained model are plotted. Possible spikes in the sequence
{E(l,
are detected. If IE(t, 8) I is abnormally large then the corresponding output y (t)
is modified. A simple modification could be to take
e)
e)}
yet) :=
.y(tlt -
e)
1,
The data string obtained as above is used to get an improved model by making a new
parameter estimation.
For explanation of the second approach (the possibility of using a modified loss
function) consider single output systems. Then the usual criterion
N
Vee)
2: E2(t, e)
(12.46)
can be shown to be (asymptotically) optimal if and only if the disturbances are Gaussian
distributed (cf. below).
Under weak assumptions the ML estimate is optimal. More exactly, it is asymptotically statistically efficient. This estimate maximizes the log-likelihood function
log L(e)
Using the model assumption that {E(t, (1)} is a white noise sequence with probability
density function f(E(t)), it can be shown by applying Bayes' rule that
N
log L(e) =
2: Jog f(E(t
t=
(12.47)
Vee)
= -
2: log f(E(t
t=
(12.48)
feE)
1
= -----.
V(2n)A
and hence
-logf(E(t, e
This means that in the Gaussian case, the optimal criterion (12.48) reduces to (12.46). If
there are outliers in the data, then feE) is likely to decrease more slowly with lEI than
498
Chapter 12
in the Gaussian case. This means that for the optimal choice, (12.48), of the loss function
the large values of IE(t, 8)1 are less penalized than in' (12.46). In other words, in (12.48)
the large prediction errors have less influence on the loss function than they have in
(12.46). There are many ad hoc ways to achieve this qualitatively by modifying (12.46)
(note that the function/( E) in (12.48) is normally unavailable. For example, one can take
N
2: t(E(t, 8))
V(8) =
(12.49a)
t=l
with
(12.49b)
or
E2
{
I () =
a2
if
if
2 ~ a 2
2
>
a2
(12.49c)
Note that for both choices (12.49b) and (12.49c) there is a parameter a to be chosen by
the user. This parameter should be set to a value given by an expected amplitude of the
prediction errors.
Sometimes the user can have useful a priori information for choosing the parameter a.
If this is not the case it has to be estimated in some way.
One possibility is to perform a first off-line identification using the loss function
(12.46). An examination of the obtained residuals {E(t, 8)}, including plots as well as
computation of the variance, may give useful information on how to choose a. Then a
second off-line estimation has to be carried out based on a modified criterion using the
determined value of a.
Another alternative is to choose a in an adaptive manner using on-line identification.
To see how this can be done the following example considers a linear regression model
for ease of illustration.
Example 12.9 On-line robustification
Consider the model
yet)
<pT(t)8 + E(t)
(12.50a)
2:
~(s )E2(S, 8)
(12.50b)
s=1
The choice of the weights ~(s) is discussed later. Paralleling the calculations in Section
9.2, it can be shown that the minimizer O(t) of (l2.50b) can be computed recursively in t
using the following algorithm:
8(t)
8(t - 1) + K(t)E(t)
Model verification
Section 12.
K (t)
= P( t
pet)
499
(12.50c)
P(t - 1)
VI ( 8 (t ,= Vt - 1 (8 (t -- 1)) +
l/!3(t)
2(t)
+ ~cp';C-'C--"-t)P-"--C-t----l-)-cp(ct)
(12.S0d)
Note that
~I = [V1(8(t) )/t] 112
(12.S0e)
2(1)
if 1(t)1 ~ Y~t
(12.50f)
12.10 Model
ver~fication
500
h(t) - Yo
YI(t) - Yo
Chapter 12
(12.51a)
and take
(12.51b)
II
II
as an index for the degree of nonlinearity. For a (noise-free) linear system 'Y] = 0,
whereas for a nonlinear system 'Y] > 0.
Test of time invariance. A convenient way of testing time invariance of a system is
to use data from two different experiments. (This may be achieved by dividing the
recorded data into two parts.) The parameter estimates are determined in the usual
way from the first data set. Then the model output is computed for the second set of
data using the parameter estimates obtained from the first data set. If the process is
time invariant the model should 'explain' the process data equally well for both data
sets. This procedure is sometimes called cross-checking. Note that it is used not only
for checking time invariance: cross-checking may be used for the more general
purpose of determining the structure of a model. When used for this purpose it is
known as 'cross-validation'. The basic idea can be stated as follows: determine the
models (having the structures under study) that fit the first data set, and of these select
the one which best fits the second data set. The FPE and AIC criteria may be interpreted as cross-validation procedures for assessing a given model structure (as was
shown in Section 11.5; see also Stoica, Eykhoff et at., 1986).
Test for the existence of feedback. In Chapter 11 it was shown how to test the
hypothesis
(12.52)
Such a test can be used to detect possible feedback.
Assume that u (t) is determined by (causal) feedback from the output Y (t) and that
the residual E(t) is a good estimate of the white noise which drives the disturbances.
Then the input u(t) at time t will in general be dependent on past residuals but
independent of future values of the residuals. This means that
> 0
and in general
if there is feedback.
Testing for or detection of feedback can also be done by the following method due to
Caines and Chan (1975). Use the joint input-output method described in Section 10.5 to
estimate the system. Let the innovation model (10.42), (10.43) so that
z(t) =
(y(t))
u(t)
H(q-l; 8)z(t)
Ez(t)P(s) = Ai\s
A block diagonal
(12.53)
Concluding remarks
Section 12.12
H(O, 8)
(1 0)
Ho
501
(q
(I + GF)-lH
G(I + FG)-l.K)
+ GF)-lH (1 + FG)-lK
0 - (
, ) -F(I
-1.
A-
( Ae
\ 0
0)
(12.54)
Aw
Clearly
(lib
<?
F = 0
To apply this approach it is sufficient to estimate the spectral density of z(t) with some
parametric method and then form the innovation filter li(q-\ En based on the
parametrization used. By using hypothesis testing as described in Chapter 11 it can then
be tested whether (lib is zero.
502
Chapter ]2
.. The package should have some commands for performing nonparametric identification (for example spectral analysis and correlation analysis).
.. The package should have some commands for performing parametric identification
(for example the predicton error method and the instrumental variable method) linked
to various model structures .
The package should have some commands for model validation (for example some
tests on residual correlations) .
.. The package should have some commands for handling of models. This includes
simulation as well as transforming a model from one representation to another (for
example transformation between transfer function, state space form, and frequency
function).
Problems
Problem 12.1 Step response of a simple sampled-data system
Determine a formula for the step response of (12.1) (omitting e(t. Compare the result
with (12.5) at the sampling points.
Problems
503
yet) + aoYCt -
1)
yet)
bou(t - 1)
yet) +
my
mu
u(t) = u(t)
e(t)
where u(t) and e(t) are mutually uncorrelated white noise sequences with zero means and
variances 0 2 and 1..2 , respectively.
The system is to be identified with the least squares method.
(a) Determine the asymptotic LS estimates of a and b in the model structure
,At]:
yet) + ay(t - 1)
bu(t - 1)
+ E(t)
Show that consistent estimates are obtained if and only if my and m" satisfy a certain
condition. Give an interpretation of this condition.
(b) Suppose that the LS method is applied to the model structure
+ ay(t - 1)
J/t2: yet)
bu(t - 1)
+ c + E(t)
a, b and c.
.VCt) + aY(l - 1)
bu(t - 1)
+ e(t)
y(t) = y(t) + m
where u(t) and e(t) are mutually independent white noise sequences of zero means and
variances ()"2 and 1..2 , respectively.
(a) Assume that the system is identified with the least squares method using the model
structure
yet) + ay(t - 1)
bu(t - 1)
[1
e(t)
Y=
IV 2.:
y(t)
1=1
is computed and subtracted from the output. Then the least squares method is
applied to the model structure
(y(t) _.
+ a(y(t - 1) - y)
bu(t - 1)
e(t)
504
Chapter 12
a~y(t
b~u(t
- 1) =
- 1)
+ E(t)
Derive the asymptotic parameter estimates and show that they are biased.
(d) Assume that an instrumental variable method is applied to the model structure
~y(t)
a~y(t
- 1)
= b~u(t
- 1)
+ vet)
z(t)
(u(t - 1)
u(t - 2))T
Show that the parameter estimates are consistent and find their asymptotic
covariance matrix. (Note the correlation function of the noise v(t).)
(e) Compare the covariance matrices found in parts (a), (b) and (d).
Problem 12.5 Linear regression with nonzero mean data
Consider the 'regression' model
y (t)
t = 1,2, ...
(i)
where yet) and ep(t) are given for t = 1,2, ... , N, and where the unknown parameters 8
and a are to be estimated. The scalar-valued parameter a was introduced in (i) to cover
the possible case of nonzero mean residuals (see, for example Problem 12.3). Let
1
Y=
IV
2: yet)
qJ =
1=1
yet)
R =
ep(t)
1=1
epCt)
yet) - y
IV 2:
ep(t) - qJ
1
ep(t)Cf"(t)
t=l
r =
IV L
(Ht)y(t)
1=1
and
rnTe'
0.=),"-''1''
Show that the LS estimates of e and a in (i) are given by
result.
(ii)
(l - aq-l)y(t)
e(t)
where la I < 1 and {e(t)} is white noise of zero mean and variance )".2 and the two following representations of yet):
y(l )(t) =
2:
i=O
aie(t - i)
Problems
505
and
xCt + 1)
= ax(t) + e(t +
y(2)(t) = x(t)
1)
t?
x(O) being a random variable of finite variance and which is uncorrelated with e(t).
Show that the difference y(1l(t) - yC 2 l(t) has a variance that decays exponentially to
zero as t tends to infinity.
Problem 12.7 Accuracy of PEM and hypothesis testing for an ARMA(1, 1) process
Consider the ARMA model
y(t)
1)
where a and c have been estimated with a PEM applied to a time series yet). Assume that
the data is an ARMA(l, 1) process
yet)
+ aoy(t -
laol <
1,
leo I <
1)
e(t)
+ coe(t -
1, Ee(t)e(s)
1)
A2 bt "
(a) What is the (asymptotic) variance of li - c? Use the answer to derive a test for the
hypothesis ao = Co.
(b) Suggest some alternative ways to test the hypothesis ao = co.
Hint. Observe that for ao = co, yet) is a white process.
(c) Suppose ao = -0.707, Co = O. How many data points are needed to make the
standard deviation of i1 equal to 0.01?
Problem 12.8 A weighted recursive least squares method
Prove the results (12.S0c, d).
Problem 12.9 The controller form state space realization of an ARMAX system
Verify that (12.41) is a state space realization of the ARMAX equation (12.36).
Hint. Use the following readily verified identity:
(1
where
Problem 12.10 Gradient of the loss function with respect to initial values
Consider the approach (12.41), (12.42) for computing the prediction errors. Derive the
derivative of the loss function
N
V(8') =
2:
f:2(t,
8')
t=l
506
e(t)
Chapter 12
lei
ce(t - 1)
Assume that c is known and the prediction errors are determined as in (12.41):
x(t
+ 1)
--cx(t) - cy
E(t)
x(t)
t =
+ yet)
1,2, ... , N
Let the initial value x(l) be determined as the minimizing element of the loss function
(a) Determine the estimate x(l) as a function of the data {y(t)};;:', and c.
(b) x(1) is an estimate of e(1)- y(l) ,= -ce(O). Evaluate the mean square error
E[x(l) + ce(O)f
and show that W does not tend to zero as N tends to infinity. Hence x(l) is not a
consistent estimate of the initial value.
Problem 12.12 Choice of the input signal for accurate estimation of static gain
Consider the system
y(t) + aoy(t - 1)
bou(t
1)
+ e(t)
Ee(t)e(s)
= ,,20 t, s
,
s=
A
bu(t - 1) + c(t)
ao) can be estimated as
----
1 + ii
02
(Note that in both cases Eu 2 (t) =0 0 2 .) Which case will give the smallest variance"
Evaluate the variances numerically for
ao
-0.9, bo = 1, '),
= 1,
0 =
.~ _ S =~__ ~
1 + &
1 + ao
__ ---bo(a - ao) +
-'
(1
(b -
,
a)(1
bo)(l
ao)
ao)
=-
8-
ao)
,5 can be
Problems
507
y(t)
1 __ -q--f e(t)
yeO)
where e(t) is white noise with zero mean and unit variance. Show that
var(y(t))
q-l)~l
y(t)
(1 _
yeO)
y(--l)
aq 1) e(t)
lat < 1
I
where e(t) is white noise with zero mean and unit variance. Show that the variance of
y(t) for large t is given by
t
var(y(t = (1 +
Hint. First show that yet)
where
1- (-a)k+1
hk
= --1 + a
dx
--axdt
+ dw
> 0)
(i)
which is a special case of (6.37), (6.38). In 0), w is a Wiener process with incrementa!
variance Edw 2 = rdt, and a is the unknown parameter to be estimated. Let x be observed
at equidistant sampling points. The discrete-time observations of x satisfy the following
difference equation (see (6.39)):
x(t + 1)
where
ax(t)
e(t)
t =
1,2, ...
(ii)
508
Chapter 12
and where h denotes the sampling interval. Note that to simplify the notation of (ii) h is
taken as the time unit.
The parameter a of (ii) is estimated by the LS method. From the estimate a of a one
can estimate a as
a=
-llog(a)
h
In
the
Rernark. See Astr()m (1969) for a further discussion of the choice of sampling interval.
Problem 12.15 Effects on parameter estimates of input variations during the sampling
interval
Consider a first-order continuous time system given by
dV + ay
-~-
dt
= ~u
(ex > 0)
(i)
Assume that the input is a continuous time stationary stochastic process given by
du
-+ yu
dt
(y
> 0)
Oi)
yet)
+ ay(t-
h) = bu(t -- h)
t =
h, 2h, ...
yet) + ay(t -- h)
= bu(t -. h)
+ (t)
Derive the asymptotic estimates of a and b and compare with the parameters of the
model in (a). How will the parameter y influence the result?
Hint. The system (i), (ii) can be written as
x = Fx -t- v
where
Problems
x(t)
(y(t)
u(t)
(-Uo
F =
~)
-y
(~)
Since vet) is white noise, the covariance function of the state vector for
given by
Ex(t + T)XT(t)
where P
509
?: 0 is
= efTp
Bibliographical notes
Some papers dealing in general terms with practical aspects of system identification are
Astrom (1980), Isermann (1980) and Bohlin (1987). Both the role of prefiltering the data
prior to parameter estimation and the approximation induced by using a model structure
not covering the true system can be analyzed in the frequency domain; see Ljung
(1985a), Wahlberg and Ljung (1986), Wahlberg (1987).
(Section 12.2) Optimal experimental conditions have been studied in depth by Mehra
(1974,1976), Goodwin and Payne (1977), Zarrop (1979a, b). Some aspects on the choice
of sampling interval have been discussed by Sinha and Puthenpura (1985).
(Section 12.3) For the use of ARIMA models, especially in econometric applications,
see Box and Jenkins (1976), Granger and Newbold (1977).
(Section 12.5) Multivariate systems may often have different tire delays in the different channels from input to output. Vector difference equations which can accommodate
this case of different time delays are discussed, for example, by Janssen (1987a, 1988).
(Section 12.7) The possibility of performing a number of passes through the data with
an on-line algorithm as an off-line method has been described by Ljung (1982), Ljung
and Soderstrom (1983), Solbrand et ai. (1985). It has frequently been used for instrumental variable identification; see e.g. Young (1968, 1976).
(Section 12.8) Proofs of the results on local minima have appeared in Astrom and
Soderstrom (1974) (see also Complement C7.6), ARMA processes; Stoica and Soderstrom (1982g), multivariable MA processes; Soderstrom (1974), the 'GLS' structure;
Soderstrom and Stoica (1982) (see also Complement C7.5), output error models. Stoica
and Soderstrom (1984) contains some further results of this type related to k-step
prediction models for ARMA processes.
(Section 12.9) Robust methods that are less sensitive to outliers have been treated for
linear regressions by Huber (1973), for off-line methods by Ljung (1978), and for on-line
algorithms by Polyak and Tsypkin (1980), Ljung and Soderstrom (1983) and Tsyptin
(1987).
(Section 12.l0) The problem of testing time invariance can be viewed as a form of
fault detection (test for a change of the system dynamics). See WHisky (1976) and
Basseville and Benveniste (1986) for some surveys of this field. The problem of testing
for the existence of feedback has been treated by, e.g., Bohlin (1971), Caines and Chan
(1975, 1976).
510
Chapter 12
Appendix A
[A
13CD]{A-
= J
13CDA --I
[13
= J
13CDA~l
BC[C~l
13CDA -l13][C~j
DA~l13][C~j
DA ~l BrlDA ~l
DA-I13rlDA~j
=I
which proves (A.1).
For some remarks on the historical background of this result, see KaiJath (1980).
The matrix inversion lemma is closely related to the following result on the inverse of
partitioned matrices.
Lemma A.2
Consider the matrix
(~ ~)
(A.2)
where A and C are square matrices. Assume that A and C - DA ~j 13 are nonsingular.
Then
512
Appendix A
(A.3)
(A
D
0) + (.
(D~-l ~) + (
B) { O
(A -O
I
C
A I-1 B) (C - DA-1B)-1(-DA- 1
C - DA-IB
(D~-I ~) + C)(-DA-
I)}
)(C - DA-1B)-1(-DA-- 1 1)
J)
G~)
II
(0o 0)' + (
C- 1
-C-ID
)(A _ BC-1D)-I(I
-BC- I )
(A.4)
(A.S)
II
The next result concerns the rank and positive (semi) definiteness of certain symmetric
matrices.
Lemma A.3
Consider the symmetric matrix
(:T
~)
(A.6)
rank S
= m
(A.7)
SectionA.l
where Xl is (n 11) and X2 is (m 11). If S is positive definite this form is stricti y posi tive for all
(x T xI) =1= O. If S is positive semidefinite the quadratic form is positive (;::::0) for all
(xT xI). Set
X2
+ C-'BTXl
(xT
(xT
xI)s
(:~)
= 0
Thus the solutions to the original equation span a subspace of dimension n - rank
(A -- BC-lB'). Equating the two expressions for the dimension of the subspace
describing the solution gives
x}(A - BC-1B1)Xl
0,
Thus the solutions to the original equation span a subspace of dimension n-rank
(A - BC- IB T). Equating the two expressions for the dimension of the subspace
describing the solution gives
n + m - rank S
n - rank(A - BC-IB T)
which trivially reduces to the first equality in (A.7). The second equality is proved in a
l1li
similar fashion.
Corollary. Let A and B be two positive definite matrices and assume
A;::::B
Then
A-I :os; B- 1
Proof. Set
(~ B~l)
514
Appendix A
A - (B-1)-1
B- 1
A-I
[Ez 2(t)z Tet)] - l [ EZ2(t)zICt)][ EZ1 (t)z i'(t)]-l ~ [Ez 1(t)z T(t)
(A.S)
(A.9)
Proof. Clearly,
T
E ( Z1(t) (Zj(t)
Z2(t)
T
Z2(t
j'I,
(Zll
Z12)
~T
Z 12 Z22
~O
ZI/
Zll - Z12Z22IZT2
Set M
III
SectionA.l
where A and D are square matrices. Assuming the inverses occurring below exist,
det P
(A. 10)
Proof. A proof will be given of the first equality only, since the second one can be
shown similarly. Assume that A is nonsingular. Then
det P
{pG
det
det(A
C
A;lB)}
1)
D - CA- B
II
det(I"
AB) = det(Im
(A.11)
BA)
Proof. Set
s=
I"
--B
det S
det(Im
BA) = det(In
AB)
II
Corollary 2. Let P be a symmetric (nln) matrix and let Pi denote its upper left (iIi)
block. Then the following properties are equivalent:
"*
Proof. First consider the implication (i) => (ii). Clearly P > 0 => X TpX > 0 all x O. By
specializing to x-vectors where all but the first i components are zero it follows that
Pi> O. This implies that all eigenvalues, say 'A/PJi=l, of Pi are strictly positive. Hence
i
det Pi =
Il P'j(P
i )}
> 0
1, ... , n
j=l
Next consider the implication Oi) => (i). Clearly PI = det PI> O. The result will follow
if it can be shown that
Pk
=> P k + 1
>
516
(P~
b
Appendix A
b)
e
PHI
> 0, which
II
The tinallemma in this section is a result on Choiesky factors of banded block matrices.
Lemma A.6
Let P be a symmetric, banded, positive definite matrix
RI
sF
p=
51
R2
sT
S2
R,
0
(A.12)
S",-1
S;~'_I
Rill
where all Si are lower triangular matrices. Then there is a banded Cholesky factor,
i.e. a matrix L of the form
LI
QI
o
L2
Q2
L =
L]
(A.l3)
o
with all Li lower triangular and nonsingular and a!l Qi upper triangular matrices, and
where L obeys
(A.14)
Proof. First block-diagonalize P as follows. Let Al = R I . Clearly Al > 0. Next set
o
J
SectionA.l
L'l. j
L'l. 2
S2
si
R3
Sm~l
where
L'l. 2
sT L'I.]ISj
R2 ~~
Since P > 0 it holds that L'l. 2 > O. Proceeding in this way for k
I
1, ... , m - 1 with
o
I
(block row k
-S1L'1. k 1 I
+ 1)
o
I
one gets
By construction all the matrices {L'l. k } are hence symmetric and positive definite.
Next the banded Cholesky factor will be constructed. First standard Cholesky factorization of {L'l.d is performed: Introduce lower triangular matrices {Ld satisfying
k = 1, ... , m
Clearly all the {Ld matrices are nonsingular by construction. Then take L as in (A.13)
with
k=l, ... , m - l
Since Lk, Sk and Lkl are lower triangular, it follows that Qk are upper triangular. The
matrix L so constructed is therefore a banded, lower triangular matrix. It remains to
verify that L is a Cholesky factor, i.e. LLT = P. Evaluating the blocks of LLT it is easily
found that
(LLT) I I
= LILi =
L'I.,
P jj
Sl-lL~~L~\Si-l
S;r:-1L'l.i~-\Si-l
+ LiLT
+ L'l. i = Ri
Pii
2, ... , m
518
Appendix A
l
(LLT)i+j,i =
0 =
f>
Pi+j,i
i = 1, ... , m -
1, ... , m .- 1
rJ
Ll
.
\ 0
The Euclidean spaces q/l m and q/l n can be described as direct sums
=
q/l n =
q/lm
(A. IS)
GD
X =
C)
G)
SectionA.2
519
In this case the matrix A is singular and has rank 1. Its nullspace, JY(A), is spanned by
the vector (1 -If while the orthogonal complement is spanned by (1 I)T. Examine
now how the above vector x can be written as a linear combination of these two vectors:
which gives u]
U2
= 0.5. Hence
In a similar way it is found that the range g( (A) is spanned by the vector (1 2f and
its orthogonal complement, g( (A )-1, by the vector (2 -I?'. To decompose
the given vector b, examine
which gives ~l
0.6, ~2
0.2. Hence
( ' 0.4 )
-0.2
(A.16)
g( (A)1- = ./V(AT)
<=> xTz
= 0,
= 0,
Vz
Z
<=> xTAy = 0,
<=> y1A T x
Vy
E
0i: (A)
= Ay,
Vy
0,
<=> A TX = 0 <=> X
'r/y
g;cm
E (.Jilin
E
q)?'"
./v(AT)
Lemma A.S
The restriction of the linear transformation A: q)? m
and has an inverse.
-Jo
.JIl n to q)? (A T)
-Jo
520
Appendix A
xz,
Remark. It follows that the spaces PI? (A) and g?, (A T) have the same dimension. This
dimension is equal to the rank of the matrix A. In particular,
rank A
rank A
Ax
(A.17)
Lemma A.9
Consider the system (A.17).
= b] +
b2
Here AX2 and b l belong to g?, (A) while b2 belongs to JV(A 1) = g?, (A )..L. Clearly
solutions exist if and only if b2 = O. Hence it is required that bEg?, (A). Since b is
arbitrary in g?,n, for a solution to exist we must have 01; (A) = 0?", i.e. rank A = n (see
remark to Lemma A.S).
According to Lemma A.S, the solution is unique if and only if Xl = O. This means
precisely that c/V(A) = {O}, which is equivalent to rank A = m. (See remark to
Lemma A.S.)
III
The pseudoinverse (also called the Moore- Penrose generalized inverse) of A is defined
as follows.
Definition A.I
The pseudoinverse A t of A is a linear transformation At: 0? n ~ g?, m such that
(i) X
(ii) x
E
E
III
Remark. A! is uniquely defined by the above relations. This follows since it is defined
for all x E 01;n (see (A.15. Also the first relation makes perfect sense according to
Lemma A.S. Note that from the definition it follows that
Section A.2
all x
= gil
521
(A).1. and of
E [l!ln
which can be used as a definition of A-I for nonsingular square matrices. In fact, if A
is square and nonsingular the above definition of A t easily gives At = A -I, so the
pseudoinverse then becomes the usual inverse.
..
The description (A.15) of [l!l m and [l!l n as direct sums and the properties of the
pseudo inverse are illustrated in Figure A.l.
Now consider linear systems of equations. Since they may not have any exact solution
a least squares solution will be considered. Introduce for this purpose the notation
Vex)
IIAx -
bl1 2 = (Ax
- b)T(Ax - b)
(A.IS)
x =Afb
(A.19)
Lemma A.IO
X2) -
522
Appendix A
= m.
Then
Ai = (A'lA)-IA T
(A.20)
= AT(AA1)-1
(A.21)
Proof. It will be shown that (A.20), (A.21) satisfy Definition A.1. For part (i),
AlA = (ATA)-IA'lA = 1
= 0 and
hence
(A TA)-IA 1'x = 0
which completes the proof of (A.20). For part (ii), note that an arbitrary element x of
g7l (A 1) can be written as x = ATy. Hence
AIAx = AT(AAT)-IAATy = ATy = X
II
Lemma A.12
(i)
(ii)
(iii)
(iv)
+ X2
E g71m.
= AiAx2 = Xz
Then
= b l + b2
E g7l n.
Then
= AAib l = b l
II
SectionA.2
523
(A.22)
U2:.V T
(A.23b)
(A.23c)
and where r
0::;
min(n, m).
Proof. The matrix ATA is clearly nonnegative definite. Hence all its eigenvalues are
. . or zero. D eno t e th em b Y 01,
2 02,
2 ... , Om
2 were
h
posItIve
01 >
-- 02 >
-- ... 0,. > 0 -- 01'+1 -... = Om' Let v 1, V2, . . . , V m be a set of orthonormal eigenvectors corresponding to these
eigenvalues and set
2
AVi=OiVi
ATAv;=O
i=l, ... ,r
i=r+l, ... ,m
(A.24a)
A T AV2 = 0
(A.24b)
which gives
AV2 = 0
(A.24c)
= AV1D- 1
(A.2Sa)
according to (A.24a). Hence the columns of U1 are orthonormal vectors. Now 'expand'
this set of orthonormal vectors by taking U as an (nln) orthogonal matrix
524
Appendix A
Then
(A.25b)
Next consider
=L
using (A.25a), (A.24a), (A.24c), (A.25b), (A.23a). Premultiplying this relation by V
and postmultiplying it by VT finally gives (A.22).
II
Definition A.2
The factorization (A.22) is called the singular value decomposition of the matrix A. The
elements OJ, ... , Or and the possibly additional min(m, n) - r zero diagonal elements of
L are called the singular values of A.
II
Remark. It follows by the construction that the singular value decomposition (A.22)
can be written as
(A.26)
The matrices on the right-hand side have the following dimensions: VI is (nlr), D is (rlr)
(and nonsingular), V(' is (rim).
II
Using the singular value decomposition one can easily find the pseudoinverse. This is
done in the following lemma.
Lemma A.14
Let A be an (nlm) matrix with a singular value decomposition given by (A.22)-(A.23).
Then the pseudoinverse is given by
(A.27a)
where
1
Lt = ( D- 00)
\0
(A.27b)
Proof. The proof is based on the characterization of Arb as the uniquely given shortest
Section A.2
vector that minimizes Vex) = IIAx - bl1 2 (cf. Lemma A.lO). Now set Y = V 1 'x, c = UTb.
Since multiplication with an orthogonal matrix does not change the norm of a vector,
= IlLY - cl1 2 =
~ {OiYi - Ci}2
i=1
+ ~
cT
;=r+1
= c/o;
1, ... , r
Yi arbitrary, i = r + 1, ... , m
Since
Yi
IIYII
Ilxll,
y, must be characterized by
i=r+1, ... ,m
which means
y=
LtC
Vy
Since also i
VLtC = VLtUTb
to
be
Remark. Using the singular value decomposition (SVD) one can give an interpretation
of the spaces fiCA), ~(AT), etc. Change the basis in ~n by rotation using U and
similarly rotate the basis in '?Am using V T . The transformation A is thereafter given by
the diagonal matrix L. The number r of positive singular values is equal to the rank of A.
In the transformed ~m space, the first r components correspond to the subspace :?l (AT)
while the last m - r refer to the nullspace fiCA). Similarly in the transformed space
~ n, the first r components describe ~ (A) while the last n - r are due to .A/(AT) ...
An alternative characterization of the pseudoinverse is provided by the following lemma.
Lemma A.IS
Let A be a given (n 1m) matrix. Consider the following equations, where X is an (m In)
matrix:
AXA = A
XAX=X
(AX)T
AX
(XA)'f
XA
(A.28)
At,
ULV T , and
Appendix A
LIT = L
flY=Y
yTLT = LY
(A.29)
Next partition L as
where D is square and nonsingular (cf. (A.23a. If A is square and nonsingular all the 0
blocks will disappear. If A is rectangular and of full rank two of the 0 blocks will
disappear. Now partition Y as:
Y12)
Yn
Using these partitioned forms it is easy to deduce that (A.29) becomes
(D~lD ~) = (~ ~)
(A.30a)
(A.30b)
(A.30c)
DYi'I) = (Yu D
( DY~
o
0
Y21 D
00)
(A.30d)
(cf. (A.27a.
Remark 1. The result can also be proved by the use of the geometric properties of the
pseudoinverse as given in Definition A.1, Figure A.1 and Lemma A.12.
..
Remark 2. The relations (A.28) are sometimes used as an alternative definition of the
pseudoinverse .
..
SectionA.3
= 1 - 2ww T
(A.31)
II
Lemma A.16
A Householder matrix is symmetric and orthogonal. The transformation x
geometrically a reflection with respect to the plane perpendicular to w.
Qx means
=Q
It is orthogonal since
QQT = Q2
= (I
1 - 4wwT
+ 4WWTWW T = 1
Qx = x - 2W(WTX)
A geometric illustration is given in Figure A.2.
/
/
FIGURE A.2 Illustration of the reflection property CQx is the reflection of x with
respect to the plane perpendicular to w).
II
528
Appendix A
o
c
s
1
Q=
-s
(A. 32)
c
SectionA.3
The next lemmas show how the elementary orthogonal matrices can be chosen so that a
transformed vector gets a simple form.
Lemma A.IS
Let x be an arbitrary vector. Then there is a Householder transformation such that
(A.33)
Ilxll.
Here, A =
(1) (1 -
"0
= QA
..
= A
2wt)
.
-2W1W2
..
0-2W1Wn
Therefore
2, ... , n
Further, A is given by
xTx
so A =
Ilxll.
X'QTQX
1.'(1 0 ... 0)
In particular,
Wj
(i) ~
A'
Lemma A.I9
Let x be an arbitrary vector of dimension n. Then there exists an orthogonal matrix of the
form
(A. 34)
with {Q i} being Given's rotation matrices such that
530
Appendix A
(A.35)
Here, A =
Ilxll.
Xl
* O. Then for
Ql
choose i
= 1,
j = n and take
[xT + X~]1/2
which gives
c
1
1
For Q2 take i
1, j
-.I'
and
= n -
Xii)
= [X\I)2
+ x~12W/2
Then
[X (I)2
1
X(I)2]1/2
n--I
X2
1
-s
= 1, j =
(k-I)
XI
+ 1 - k and
J
giving
X(k-I)
C = -[-;-(k-;---'I-;-')2~+
(k - I )2] 112
XI
Xj
[ XI2
X~-k+1
+
X2
Xn--k
...
+ Xn2]112
o
o
Section A.3
For k
n - 1,
The stated result now follows since the product of orthogonal matrices is an orthogonal
matrix.
0. (Should X = 0, {Qi} can be chosen arbitrarily.)
Assume next that Xl = 0, but x
Say Xk 0. Then first permute elements 1 and k of X by taking i = 1, j = k, and using the
previous rules to obtain c and s. When thereafter successively zeroing the elements in the
vector x, note that there is already a zero in the kth element. This means that X can be
transformed to the form (A.35) still in (n -- 1) steps. (This case illustrates the principle of
using pivot elements.)
III
The two previous lemmas will now be extended to the matrix case, effectively describing
the QR method.
Lemma A.20
Let A be an (n 1m) matrix, n ~ m. Then there exists an orthogonal matrix Q such that
QA =
(~)
(A.36)
where R is an (mlm)-dimensional upper triangular matrix and 0 is an ((n -- m)lm)dimensional zero matrix.
QrnQrn-l ... Ql
Let
First choose QI (using for example Lemma A.18 or Lemma A.19) such that
which gives
QJA ~ (all) ... a~)
Then choose Q2 so that it leaves the first component unchanged but else transform a~[) as
Appendix A
o
Proceeding in this way,
a (l)
II
QA
( I)
(I)
a 21
..
ami
(2)
a22
...
(2)
am2
a0n)
rnfn
o
which completes the proof.
II1II
Ilxll will
IIAII ~
sup IIAxl1
nCo Ilxll
(A. 38)
II
IIAxl1
IIAllllxl1
all x
(A.39)
The next lemma establishes that the introduced norm satisfies the usual properties
((i)-(iii) below) of a norm.
Lemma A.21
Let A, Band C be matrices of dimensions (n 1m), (nlm), (mlp) respectively and let A be a
scalar. Then
(i)
(ii)
(A.40a)
(A.40b)
SectionA.4
(iii)
(iv)
533
(A.40c)
(A.40d)
IIA
BII -,
- :~~
IIAx + Bxll
Ilxll
::::; sup
#0
IIAxl1 +
Ilxll
6
=
sup
x*o
lJ!3xll
Ilxll
IIAII + IIBII
III
Lemma A.22
IIAII
= 1.
1 by Definition A.S.
III
The following lemma shows the relationship between the singular values of a matrix and
its norm as previously defined. Some related results will also be presented. In what
follows A will be an (n 1m )-dimensional matrix with singular values
01 ;?: 02 ;?: ... ;?: Or
> 0=
0r+l
= ... =
0min(m.n)
Lemma A.23
The matrix norms of A and A t are given by
IIAII = 01
IIA til = 1/O
(A.41a)
(A.41b)
IIAII
IIAxl1
:~~ Txlr
sup
y*O
IILyl1
IlVyll
:~~
= sup
y*()
IluLV'IXII
Ilxll
IILyl1
Ilyll
VTx.
By Definition A.S,
534
Appendix A
Note that the equalities hold if Yl = 1 and Yi = for i = 2, ... , m. The norm of A is thus
equal to the largest singular value of A. Since At = V}:t U T , the pseudoinverse A t has
singular values
lIo b
... ,
lion 0, ... , 0
The largest of these is liar which then must be the norm of A'.
Lemma A.24
Let A be an (nlm) matrix of rank m and let x
Ilxll
Ilxll
~
;::=
oillbil
am Ilbll
(A.42a)
(A.42b)
Proof. The first inequality, (A.42a), is immediate from (A.41a) and (A.39). To verify
(A.42b) consider the following calculation with c = VTb:
..
Definition A.6
The condition number of a matrix A is given by
C(A) =
IIAIIIIA til
(A.43)
The next topic concerns the effect of rounding errors on the solution of linear systems of
equations. Consider a linear system of equations
Ax
(A.4S)
where A is an (nln) nonsingular matrix. When solving (A.4S) on a finite word length
machine, rounding errors are inherently affecting the computations. In such a case one
cannot expect to find the exact solution x of (A.4S). All that can be hoped for is to get the
exact solution of a (fictitious) system of equations obtained by slightly perturbing A and b
in (A.4S). Denote the perturbed system of equations by
(A
oA)(x
ox)
(b
+ Ob)
(A.46)
Section A.4
535
C(A)lIoAII/IIAII
< 1
(A.47)
Then
Iloxll
C(A)
1 -
C(A)lloAIIIIIAII
{lloAII
IIObII}
l!biI
(A.48)
Ox
= A ~lob - A-1oA(x
+ ox)
Iloxll
~
~
and hence
(1 -
IIA .-l/l/loAII)lIox/l
IIA -lll{IIObII +
1-
IIA -l/l/lOAII
/loAIII~I/)
(A.49a)
now)
= 1 - C(A)lIoAII/IIAIl
~ IIAIIIIA~lllnf~l" + /l1~n
which easily can be reformulated as (A.48).
II
In loose terms, one can say that the relative errors in A and b can be magnified by a
factor of C(A) in the solution to (A.46). Note that the condition number is given by the
problem, while the errors IloAIl and IIObII depend on the algorithm used to solve the linear
system of equations. For numerically stable algorithms which include partial pivoting the
relative errors of A and b are typically of the form
536
IlbA11 =
liA1f
Appendix A
Ylf(n)
where Yl is the machine precision and f(n) is a function that grows moderately with the
order n. In particular, the product YlC(A) gives an indication of how many significant
digits can be expected in the solution. For example, if Yl = 10- 7 and etA) = 10 then
YlC(A) = 10- 6 , and 5 to 6 significant digits can be expected; but if C(A) is equal to 104
then YlC(A) == 10- 3 and only 2 or 3 significant digits can be expected.
The following result gives a basis for a preliminary discussion of how to cope with the
more general problem of a least squares solution to an inconsistent system of equations
of the form (A.45).
Lemma A.26
Let A be a matrix with condition number C(A) and Q an orthogonal matrix. Then
C(ATA) = [C(A)F
C(QA) = C(A)
(A50a)
(A.50b)
U2:V T
Then
ATA
The singular values of ATA are thus at, a~, .... The result (A50a) now follows from
(A.44). Since QU is an orthogonal matrix the singular value decomposition of QA is
given by
QA
(QU)2:V T
Hence, QA and A have the same singular values and (A.50b) follows immediately .
Lemma A.26 can be used to illustrate a key rule in determining the LS solution of an
overdetermined system of linear equations. This rule states that instead of forming and
solving the corresponding so-called normal equations, it is much better from the point of
view of numerical accuracy to use a QR method directly on the original system. For
simplicity of illustration, consider the system of equations (A.45), where A is square and
nonsingular,
Ax
(A51)
Using a standard method for solving (A.51) the errors are magnified by a factor C(A)
(see (A48)). If instead the normal equations
(A. 52)
are solved then the errors are magnified by a factor [C(AW (cf. (A.50a. There is thus a
considerable loss in accuracy when numerically solving (A52) instead of (A.51). On
the other hand if the QR method is applied then instead of (A51) the equation
SectionA.5
Idempotent matrices
537
QAx = Qb
is solved, with Q orthogonal. In view of (A.50b) the error is then magnified only by a
factor of C(A).
In the discussion so far, A in (A.51) has been square and nonsingular. The general case
of a rectangular and possibly rank-deficient matrix is more complicated. Recall that
Lemma A.25 is applicable only for square nonsingular matrices. The following result
applies in the general case.
Lemma A.27
Let A be an (n 1m )-dimensional matrix of rank m. Also let the perturbed matrix A
have rank m. Consider
+ oA
(A.53a)
x + Ox = (A + oA) t (b + Ob)
(A.53b)
Then
I ox II
W.;::;
C(A)
{[
~
lib -- Axil] lIoAIl
1 - C(A)lIoAIliIiAIl
1 + C(A) IIAllllxll
(A.54)
IIbll IIObII}
+ IIAllllxl1 "Wf
Proof. See Lawson and Hanson (1974).
II
Remark. Note that for A square (and nonsingular), lib - Axil = 0 and IIAllllxll ~
Hence (A.54) can in this case be simplified to (A.4S).
Ilbli.
II
= P.
II
(A.55)
538
Appendix A
(A.56)
Proof. Let Abe an arbitrary eigenvalue of P and denote the corresponding eigenvector
bye. Then Pe = Ae. Using P = p2 gives
Ae
= Pe
p 2e
= P(Ae)
= APe = A2e
Since e =I=- it follows that A(A - 1) = 0, which proves part (i). Let P have eigenvalues AJ,
... , An. The rank of P is equal to the number of nonzero eigenvalues. However,
n
tr P
Ai
is, in view of part (i), also equal to this number. This proves part (ii). Part (iii) then
follows easily since one can always diagonalize a symmetric matrix. To prove part (iv)
let x be an arbitrary vector in f?ll n. It can then be uniquely decomposed as x = x I + X2
where XI lies in the nullspace of P, JV(P), and X2lies in the range of P, f?ll (P) (d. Lemma
A.7). The components XI and X2 are orthogonal (xT X2 = 0). Thus PXl = 0 andx2 = pz for
some vector z. Then Px = PXI + PX2 = PX2 = p 2 z = pz = X2, which shows that P is the
orthogonal projector of f?ll n onto f?ll (P).
II
Example A.2 An orthogonal projector
Let F be an (nlr)-dimensional matrix, n ? r, of full rank r. Consider
In - F(FTF)-IFT
(A.57)
Then
p2
P = tr In - tr F(FTF)-IFT
=
n - tr(FTF)(FTF)-1
= n - tr I r = n - r
This shows that P has rank n - r. P can in fact be interpreted as the orthogonal projector
of PJl" on the null space cAl(FT). This can be shown as follows. Let X E JV(FT). Then
Px = x - F(FTF)-IFTx = x. Also if x E f?ll(F), (which is the orthogonal complement
to .X(FT); see Lemma A.7), say x = Fz, then Px = x - F(FTF)-IFTFz = x - Fz = O.
II
SectionA.5
r-m
= In - F1 (FTFI )-IFT
P2 = In - F(FTF)-IFT
PI
(A.59)
PI - P2
=
=
0)
vCn~r
0,. V
0
un-'
(A. 60)
0)
Ir- m o
0
Om
VT
(A.61)
Proof. The diagonalization of P2 in (A.60) follows from Lemma A.28 part (iii). Let
V
(VI
V2 )
ll-r
The matrices VI and V 2 are not unique. Let VI and V 2 denote two matrices whose
columns are the orthonormal eigenvectors of P2 associated with A = 1 and, respectively,
A = O. Then set VI = VI and V 2 = V 2 W where W is some arbitrary orthogonal matrix.
Clearly (A.60) holds for this choice. The matrix W will be selected later in the proof.
First note that
P2(I - PI)
= [I - F(FTF)-lFT]Fl(F}Flr-lF}
F2)(~~~1 ~~
,2122
(FT FI)-IFT
21
")--lFT
(F'I F2 ) ( ~)(F} F1 )-lFT
= F1 (F TF
1 1
"1'-
P2 (P 1
Pz) = P2 P j
P~ = P2 P I
P2 = -P2 U - PI) = 0
and
P 1 P2
(PI - PZ )2
= (P2 P )T = pi = P2
= PI - P I P2 - P2 P I +
j
P~ = PI - P I P2
P2 P I + P2
PI -- P2
540
Appendix A
rank(P, - P 2 )
tr(P, - P2 ) = tr PI - tr P2
(n - m) - (n - r)
r - m
Now set
(A.62)
where the matrices A, B, C have the dimensions
A: n - r)l(n - r)),
B: n - r)lr)
C: (rlr).
and
VTCV
(1 10- 111
r-m
)}
r - m
Om}
m
P~ = U
-
(1 0) (0 0) (1 0)
0
VI
UT
=U
(0 0)
()
UT
III
ao
al ...
alia
+ nb)l(na +
nb is defined as
0
} nh cow,
J'(A, B)
an
al ...
bo b l ... b nb
ana
(A.63)
0
} na rows
bo b l ... b nb
l1li
Sylvester matrices
Section A.6
541
Lemma A.30
The rank of the Sylvester matrix '/(A, B) is given by
rank ./(A, B)
= na + nb -
(A.64)
=0
(A.65)
Let
B(z)A(z) + A(z)B(z) == 0
(A. 66)
A(z) == Ao(z)L(z)
B(z) == BoCz)L(z)
L(z) == loz" +
+ ... + In
[IZ,,-l
= na - n, deg Bo
nb - n.
B(z)Ao(z) == -A(z)Bo(z)
Since both sides must have the same zeros, it follows that the general solution can be
written as
A(z) == Ao(z)M(z)
B(z) == -Bo(z)M(z)
where
M(z)
mtz,,-l
+ ... +
mn
has arbitrary coefficients. This means that x lies in an n-dimensional subspace. However,
this subspace is A/'(./T(A, B)) (cf. (A.65)), and thus its dimension must be equal to
na + nb - rank '/(A, B) (cf. Section A.2). This proves (A.64).
III
= 0, rank '/(A, B)
Corollary 2. Consider polynomials A(z) and B(z) as in Definition A.I, and the
following matrix of dimension na + nb + 2k)i(na + nb + k), where k ~ 0:
Appendix A
} nb
ao
.7(A, B) =
b o bl
al'"
k rows
ana
(A.67)
b nb
} na + k rows
o
Then
rank .7(A, B) = na + nb + k - n
(A.68)
13(z) = zkB(z)
Clearly A (z) and 13(z) have precisely n + k common zeros (k of them are 0cated at
z = 0). Next note that
'/(A, B) :::: (.7(A, B)
0)
(na
+ nb + 2k) - (k + n)
+ nb + k - n
III
(A.69)
Lemma A.31
Assume that A, B, C and D are matrices of compatible dimensions. Then
SectionA.7
(A B)(C D) = AC BD
(A.70)
Lemma A.32
Let A and B be nonsingular matrices. Then
(A B)-1
= A-I
B- 1
(A.71)
Lemma A.33
Let A and B be two matrices. Then
(A.72)
Proof. By Definition A.9,
The following definition introduces the notation vec(A) for the vector obtained by
stacking the columns of matrix A.
Definition A.IO
Let A be an (min) matrix, and let ai denote its ith column.
544
Appendix A
II
Lemma A.34
Let A, Band C be matrices of compatible dimensions. Then
vec(ABC) = (C T @ A) vec(B)
(A.73)
ABc2'"
Ci
ABcn )
Further,
Clj)
b)
m ( :.
Cmj
= [cJ
2: cijAbi
i=1
@ AJ vec(B)
II
= 1, i
1,
Section A.8
(D
Ai)
~ log Ai
where {Ai} are the eigenvalues of S, achieves its maximum value for S = 1. Now, for
all positive A,
log A ~ A-I
with equality if and only if A = 1. Thus
log det S
tr S - n = 0
S =
(~ ~T)
[S-IJI1 ~ 1
with equality if and only if 1\1
can prove that
= O.
S,
etc., one
[S-IL ~ 1, i = 1, ... , n
with equality if and only if S
(iii) It follows from (ii) that
= 1.
But
Thus
Amax(S-I) ~ 1
with equality if and only if S
1.
II
Appendix A
Bibliographical notes
Most of the material in this appendix is standard. See for example Pearson (1974), Strang
(1976), Graybill (1983) or Golub and van Loan (1983) for a further discussion of
matrices. The method of orthogonal trianguiarizatiol1 (the QR method) is discussed
in Stewart (1973). Its application to identification problems has been discussed, for
example, by Peterka and Smuk (1969), and Strejc (1980).
Appendix B
x* (as n
00)
l.
Xn
Xn
--?
v1/(m, P)
II
x* w.p. 1
X"
--?
x* in probability
=?
x"
--?
x* in distribution
"f4
Xn
--?
x* in mean square
There follow some ergodicity results. The following definition will be used.
Definition B.2
Let x(t) be a stationary stochastic process. It is said to be ergodic with respect to its firstand second-order moments if
547
548
IV
2: x(t)
(=
1
N
Appendix B
-?
Ex(t)
(B.I)
N
2: x(t + T)X(t)
(=
-?
Ex(t
T)X(t)
II
- ? 00.
Ergodicity results are often used when analyzing identification methods. The reason is
that many identification methods can be phrased in terms of the sample means and
covariances of the data. When the number of data points tends to infinity it would then
be very desirable for the analysis if one could substitute the limits by expectations as in
(B.I). It will be shown that this is possible under weak assumptions. The following result
is useful as a tool for establishing some ergodicity results.
Lemma B.l
Assume that x(t) is a discrete time stationary process with finite variance. If the
covariance function rxCT) - ? 0 as ITI - ? 00 then
1
IV 2:
x(t)
-?
Ex(t)
(N-?
(B.2)
00)
(=1
ZI(t)
G(q-I)el(t)
Z2(t)
= H(q-l)e2(t)
(B.3)
where
G(q-I) =
H(q-I) =
2: giq-i
2: g; <
i=O
i=O
2: hiq-i
2: h; <
i=O
i=O
and
e(t) = (e l
(t)
e2(t)
00
(B.4)
00
Section B.l
549
1
N
2: Zj(t)Z2(t)
00
-?
EZ j(t)Z2(t) =
Q(Jj(J2
1=1
2:
(B.6)
gihi
i=O
Proof. Define
x(t) = Z 1(t)Z2(t)
fhen x(t) is a stationary stochastic process with mean value
;=0 j=O
i=O
The idea now is, of course, to apply Lemma B.1. For this purpose one must find the
covariance function rAT) of xCt) and verify that it tends to zero as T tends to infinity.
,\Jow
rAT)
Q2(JT(J~Oi))k.l
(JT(J~Oi.k+'tOj.I+'t
(2Q2
+ l)(JT(J~]Oi.jOk.lOi.k+'t
550
Appendix B
oio~
+
[i gkgk+~] [i hlhl+~J
k=O
1=0
+ [!! - (2g2 +
l)oio~{~ g;h;gi+Thi+T]
1200
100
I~o
00
~ ~o gk~
gkgk+"[
6) gk ~
00
gJ+, =
L ~ 00.
00
gJ
0,
00
which proves that l.k=Ogkgk+, ~ as L tends to infinity. In similar ways it can be proved
that the remaining sums also tend to zero.
It has thus been established that rxCL) ~ 0, ILl ~ 00. The result (B.6) follows from
II
Lemma B.l.
The next result is a variant of the central limit theorem due to Ljung (1977c).
Lemma B.3
Consider
1
XN
= -
YN
z(t)
(B.7)
1=1
z(t) = <p(t)v(t)
(B.8)
In (B.8), <p(t) is a matrix and vet) a vector. The entries of <p(t) and vet) are stationary, possibly correlated, ARMA processes with zero means and underlying white
noise sequences with finite fourth-order moments. The elements of <p(t) may also contain a bounded deterministic term.
Assume that the limit
P = lim EXNxl;
(B.9)
N--,>oo
XN
dist
__
fiCO, P)
(B.lO)
II
Section B.l
551
Lemma B.4
Let {xn} be a sequence of random variables that converges in distribution to F(x). Let
{An} be a sequence of random square matrices that converges in probability to a
nonsingular matrix A, and {b n } a sequence of random vectors that converges in probability to b. Define
(B.ll)
Then Yn converges in distribution to F(A-1(y - b.
Proof. The lemma is a trivial extension to the multivariable case of the scalar result,
III
given for example by Chung (1968, p. 85) and Cramer (1946, p. 245).
The lemma can be specialized to the case when F(x) corresponds to a Gaussian distribution, as follows.
J
x
-00
1
1
(2Jt)mI2(det(p112
exp [ -21x ,Tp- x
= det A
X
JY
-00
'J dx'i
x=A-'(y-h)
(2Jt)mI2(det(p112
= J~oo
b)T(APAT)-l(y' - b) dy'
III
In (B.ll) the new sequence Yn is an affine transformation of x". For rational functions
there is another result that concerns convergence in probability. It is given in the
following and is often referred to as Slutsky's lemma.
Lemma B.S
Let {x n } be a sequence of random vectors that converges in probability to a constant x.
Let fO be a rational function and suppose that f(x) is finite. Then f(x n ) converges in
probability to f(x).
III
Appendix B
f x
(B.I2)
m) T P -1 (x - m) ]
The following lemma shows that the parameters m and P are the mean and the
covariance matrix, respectively.
Lemma B.6
Assume that x - JV(m, P). Then
Ex = m
(B.13)
-Jq/in .) =
Ex --
1
m (2nt12
J[+
.o/l"
P lI2 y] exp
1 P 1/2
dy + (2nt12
[I
Y
-"2 YT]
(det P) lI2 dy
dy
=m
The second integral above is zero by symmetry. (This can be seen by making the variable
substitution y ~ -y). Also
E(x - m)(x - m)T =
fq/i"
fq/i
(x - m)(x - m)Tf(x)dx
p1I2yyTp1/2
-Jq/in
Qkl -
[IT]
1
YkYt (2nt12
exp -Zy Y dy
g pl/2Qpl/2
Section B.2
553
L (2~)'"
yl
ex r [
-~Yl]dY, [&i
Yi ]
The first integral on the right-hand side gives the variance of a ~(o, 1) distributed
random variable, while the other integrands are precisely the pdf of a v,V(O, 1)
distributed random variable. Hence Qkk = 1, which implies Q = I, and the stated result
(B.13) follows.
III
Remark 1. The moment generating function evaluated at z = iw is the Fourier transform of the probability density function. From uniqueness results on Fourier transforms
it follows that every probability density function has a unique moment generating
function. Thus, if two random variables have the same moment generating function, they
will also have the same probability density function.
III
Remark 2. The name 'moment generating function' is easy to explain for scalar
processes. The expected value E[Xk] is called the kth-order moment. By series
expansion, (B.14) gives
cp(z)
[i ~:
XkJ
(B.1S)
k=()
Thus
(icp~)1
az z=o
E[Xk].
Appendix B
[~o ~! xl J
(Z1
[1 +
=E
(ZjXj
Z2X 2)
~! (ZjXj
Z2X2)2
;!
(ZIXj
= ~zlziE(XIXi).
EXlxi
+
E
Z2X2)3
[;!
+ ... ]
3 (ZjXd(Z2X2)2]
ZIZ~ in the series expansion of cp(z). As a further example, consider EXjX2X3X4 and
set Z
(ZI
cp(z)
Z2
Z3
Z4).
Then
= E[ ... + ~(ZIXj + Z2 X2 +
Ex j X2X3X4
Z3 X 3
Z4 X 4)4
+ ...
The following lemma gives the moment generating function for a Gaussian random
variable.
Lemma B.7
Assume that x
~ <-if/em,
cp(z) = exp[zTm +
~ZTpZJ
(B.16)
cp(z)
= f271H exp[zTxlf(x)dx
=
.0/1"
(2 )fl!2(d1
Jt
x exp[zTm
exp[zTm
et
1 -- m - pz) TP -I(x - m - pz )] dx
p)1I2 exp [ -2(x
~ZTpZ J
~ZTpZ J
Pz, P).
Lemma B.B
Assume that x ~ .X(m, P) and set y = Ax + b for constant A and b of appropriate
dimensions. Then y ~ Af"(Am + b, APAI').
Section B.2
555
= exp[zTb]E[exp[(ATz)TxJ]
= exp[zTb]cpAATz)
+ iZTAPATZ]
= exp[zTb]explzTAm
~z T(APAT)Z]
exp [zT(Am + b) +
By the unique correspondence between mgfs and probability density functions, it follows
II1II
from Lemma B.7 that y ~ .%(Am + b, APA T).
Lemma B.9
Let the scalar random variables Xl, X2, X3, X4 be jointly Gaussian with zero mean values.
Then
(B.17a)
Proof. Let e[f(z)] denote the coefficient of ZlZ2Z3Z4 in the series expansion of some
function fez). According to Remark 3 of Definition B.4, the moment EX1X2X3X4 is
-e[cp(z)]. Set Pij = Ex;x;. Then from Lemma B.7
nXtX2X3X4
= e[cp(z)]
-e[;!
zTPz
f]
Proof. Set y;
Hence
EXj X2X3 X4
Xl,
II1II
= Xi - mi
and Pij
= 0,
Ex;. Then
(P 12 P34
556
(P 14
mlm2)(P34
mlm4)(P23
m3 m4)
Appendix B
(P 13
mj m 3)(P24
m2 m4)
Remark. An extension of Lemma B.9 and its corollary to the case of complex
matrix-valued Gaussian random variables {xd 1=1 has been presented by Janssen and
Stoica (1987).
II
Lemma B.1O
Uncorrelated Gaussian random variables are independent.
j .(x ) =
(2n )"12 [
fA Pii J
II
1/2
exp [ --1
2
p..
i= 1
(Xi - mJ
fA (2nP1iJ1I2 exp [1
-2
Pii
11
~n (Xi - mi)2J
II
2]
fA fJXi)
n
where fi(xJ is the probability density function for the variable Xi' This proves the lemma.
Remark 1. Note that a different proof of the above lemma can be found in Section B.6
(see Corollary to Lemma B.I7).
II
Remark 2. An alternative proof of Lemma B.6 can now be made as follows. Set
-- m). By Lemma B.8, Y ~ ./V(O, 1). lIenee by Lemma B.lO, Yi ~ ,/f~O, 1)
and different Yi: s are independent. Since EYi =, 0, var(Yi) = 1, it holds that Ey = 0,
cov(y) = 1. Finally since x = m + p I/2y it follows that Ex = m, cov(x) = P.
II
= p- 1I2 (X
IS
said to be X2
II
Lemma B.n
Let x ~ x2 (n). Then its probability density function is
(B.I8)
Section B.2
Proof. Set x
as follows
= y"ly where y
.
d
f(x)
= dx P(y y ~ x)
557
dJ
1
. (2Jt)"12 exp
dx
[IT']
-::? y dy
yTy""x
where Sn is the area of the unit sphere in rJJl n. Continuing the calculations,
f (x )
Sn
JVx
(2Jt)nI2 dx
exp
o
[1
-2 r 2J r n-1 dr
S 1 [1]_
1)/2
III
Lemma 8.12
Let x ~ JV(m, P) be of dimension n. Then
(x - m)Tp- 1(x - m)
x2(n)
(B.19)
Proof. Set z = P- 1I2 (x - m). According to Lemma B.8, z .~ fiCO, 1). Furthermore,
(x - m)Tp-l(x - m) = ZTZ and (B.19) follows from Definition B.S.
III
Lemma 8.13
Let x be an n-dimensional Gaussian vector, x ~ ~.y(O, 1) and let P be an (n In)dimensional idempotent matrix of rank r. Then x T Px is x2 (r) distributed.
Proof. By Lemma A.28 one can write
P = UTJ\U
_ (Ir 0)
J\ --
XTpX = XTUTJ\UX
Y T J\y
2.: yf
;=1
III
558
Appendix B
Definition B.6
Let Xl ~ x2 (nj), X2 ~ x2(n2) be independent. Then (xJ/nl)/(x2/n2)
distributed with nl and n2 degrees of freedom, written as
IS
said to be F
Lemma 8.14
Let Y ~ F(nt, n2)' Then
as n2
(B.20)
--'> 00
Proof. Set
where Xl> X2, nl and n2 are as in Definition B.6. According to Lemma B.2, X2/n2 --'> 1
as fl2 ~ 00, with probability one. Hence, according to Lemma BA, in the limit,
dis!
nlY - -
Xl ~
X (flt).
III
Remark. The consequence of (B.20) is that for large values of n2, the F(nt, fl2)
distribution can be approximated with the X2(fll) distribution, normalized by nt.
III
This section ends with some numerical values pertaining to the X2 distribution. Define
X~(fl) by
where
~~
olen)
This quantity is useful when testing statistical hypotheses (see Chapters 4, 11 and 12).
Table B.1 gives some numerical values of x~.(n) as a function of a and n.
As a rule of thumb one may approximately take
X5.05(fl) = n
+ V (8fl)
It is not difficult to justify this approximation. According to the central limit theorem, the
X2 distributed random variable X ~~?=lY7 (see Definition B.5) is approximately (for large
enough fl) Gaussian distributed with mean
n
Ex =
>. EYT =
.:....J
i~
fl
and variance
n
E(x - n? = E
fl
2: 2: YTyJ
- fl2
;=1 j=l
2: 2: (EYT)(EYT) + 2: Ey; ;= j j= I
(Fj
;=1
n2 = n(n - 1)
3n - n 2
2n
Section B.3
559
0.05
0.025
0.01
0.005
1
2
3
4
5
3.84
5.99
7.81
9.49
11.1
5.02
7.38
9.35
11.1
12.8
6.63
9.21
11.3
13.3
15.1
7.88
10.6
12.8
14.9
16.7
6
7
8
9
10
12.6
14.1
15.5
16.9
18.3
14.4
16.0
17.5
19.0
20.5
16.8
18.5
20.1
21.7
23.2
18.5
20.3
22.0
23.6
25.2
11
12
13
14
15
19.7
21.0
22.4
23.7
25.0
21.9
23.3
24.7
26.1
27.5
24.7
26.2
27.7
29.1
30.6
26.8
28.3
29.8
31.3
32.8
16
17
18
19
20
26.3
27.6
28.9
30.1
31.4
28.8
30.2
31.5
32.9
34.2
32.0
33.4
34.8
36.2
37.6
34.3
35.7
37.2
38.6
40.0
8MAP =
(B.2l)
III
560
Appendix B
= p(xI8)p(8)
(B.22)
p(x)
The conditional pdf p(xI8), with x fixed, is called the likelihood function. It can be
interpreted as giving a measure of the plausibility of the data under different parameters.
To evaluate p(8Ix) we also need the 'prior pdf' of the parameters, p(8). While p(xI8) can
be derived relatively easily for a variety of situations, the choice of p(8) is a controversial
topic of the MAP approach to estimation (see e.g. Peterka, 1981, for a discussion).
Finally, the pdf p(x) which also occurs in (B.22) may be evaluated as a marginal
distribution
p(x) =
Jo/Idime
p(x, 8)dEl
p(x, 8)
= p(xI8)p(8)
Note that while p(x) is needed to evaluate p(8Ix), it is not necessary for determining
(since only the numerator of (B.22) depends on 8).
The maximum likelihood (ML) approach to 8-estimation is conceptually different
from the MAP approach. Now x is treated as a random variable and e as unknown but
fixed parameters.
SMAP
Definition B.8
The ML estimate of 8 is
SML
(B.23)
Thus within the ML approach the value of 8 is chosen which makes the data most
plausible as measured by the likelihood function. Note that in cases where the prior pdf
has a small influence on the a posteriori pdf, one may expect that SML is close to SMAP'
~n fact,
noninformative priors, i.e. for p(8) = constant, it follows from (B.22) that
8 ML = 8 MAP .
tor
cov
>0
[E(a log
ae L)T a l~g
a8 LJ--1
J
= _
[E a2 ae
log LJ-I
2
(B.24)
Section B.4
(integration over
e=
(B.25)
L(x, e)dx
8)dx
(B.26)
(J}l n,
f 8(x)L(x,
o=
I
f (x, e)dx f a
f 8(x)~~(x, f 8(x) a
~~
8)dx
a log a~(x,
8)
(B.27)
= E[8(x) - eJ {J
log a~(x, 8)
(B.28)
a log
ae
~)
i
E[8(x) - e][8(x) x
8rr -
{E[8(X) _ eJ a l~~ L}
E[8(x) - e][8(x) _
8]T >~
0=
f a210ga~(X'
e)
L(x,
e)d\:
L)T(a
log
a8
f (a
e)-)L(X,
8)dx
or
_ Ea2 log L. = E (a
a8 2
log
ae
L)
I
III
562
Appendix B
cov(e) ~ r l
The right-hand side of (B.24) is called the Cramer-Rao lower bound.
E8
= y(8)
(B.30)
Then
cov(8)
>:
a8
(B.31)
08
The equality in (B.24) is still applicable. This result is easily obtained by generalizing the
proof of Lemma B .15. When doing this it is necessary to substitute (B .26) by
y(8)
8(x)L(x, 8)dx
and (B.28) by
ay(8) = E[8( ) _ 8]
a8
x
a log Lex,
a8
8)
II
<D8
+e
(B.32)
where e is Gaussian distributed with zero mean and covariance matrix ",.21. Let N be the
dimension of Y. Assume that 8 and ",.2 are unknown. The Cramer- Rao lower bound can
be calculated for any unbiased estimates of these quantities.
In this case
Section B.4
-2/...1(
2Y
N
N
2
- <p8) T (Y - <p8) - 2"
log 2lt - 2" log /...
a L(Y, 8, /...2)
a/...21og
a2
a a
a2
2
a(/...2)2Iog
L(Y, 8, /...)
1
T
/...2 (Y -- <PEl) <P
1 T
-/...2<P <p
= --
1(
)T
/...4 Y -- <p8 <P
1 (Y - <p8
. )T( Y - <p8 ) + 2/...4
N
/...6
Thus
~<pT<p
,,2
N
2,,-4
var(/...-)
2A,4
It can now be seen that the least squares estimate (4.7) is efficient. In fact, note from the
above expression for a logUa8 that under the stated assumptions the LS estimate of 8
coincides with the ML estimate.
Consider the unbiased estimate S2 of /...2 as introduced in Lemma 4.2,
Appendix B
S2
(N - n)2E(s2?
2: 2: 2: 2: eiPijejekPkle/
i=l j=l k=l 1=1
2: 2: 2: L
k
= [
Pii
= [(tr p)2
( 2) _
var s
Pkk + 2
+2
tr p]!-.4
= N - n,
(N - n)2 + 2(N - n)
(N _ n)2
~~
PikPkil',4
it follows that
14 _
14 _
I\.
I\.
N - n
This is slightly larger than the Cramer- Rao lower bound. The ratio is [!-.4/(N - n )]/[!-.4/N]
= N/(N - n), which becomes close to one for large N. Thus the estimate S2 is
asymptotically efficient.
Example B.2 Cramer- Rao lower bound for a linear regression with correlated residuals
Consider now the more general case of a linear regression equation with correlated
errors,
<1>0 + e
~ ~;fI(O,
R) with R known
(B.33)
The aim is to find the Cramer- Rao lower bound on the covariance matrix of any
unbiased estimator of e. Simple calculations give
L(Y, 0)
log L(Y, 0)
<l>8)TR-l(y - <1>0)
1
TIN
1
= -2(Y
- <1>0) R- (Y- <1>8) - 2" log 2n: - :2 log(det R)
a2
T-I
<I>
Section B.5
565
expression above for 0 log L(Y, 8)/08 that under the stated assumptions the ML estimate
of 8 is identical to the Markov estimate (4.17), (4.18).
III
(B.34)
where x is some function of y. One can regard Q(x) as a matrix-valued measure of the
extent of fluctuations of x around x = x(y). The following lemma shows how to choose x
so as to 'minimize' Q(x).
Lemma "8.16
Consider the criterion Q(x) and let
x
Then
x denote
E[xIY]
x is
(B.35)
Q(x) ~ Q(x)
for all
(B.36)
= x,
= (x -
= (x
xx
T -
xxT
x.
III
It follows from (B.36) that the conditional expectation x minimizes any scalar-valued
monotonically nondecreasing function of Q, such as tr Q or det Q. Moreover, it can be
shown that under certain conditions on the distribution p (x Iy), x minimizes many other
loss functions of the form E[J(x - x)ly]' for some scalar-valued function fO (Meditch,
1969). There are also, however, 'reasonable' loss functions which are not minimized by
X. To illustrate this fact, let dim x = 1 and define
arg
min
x
E[lx - xlly]
F(x)
E[!x - xlly]
f:oo (x -
f~oo Ix -
x)p(xly)dx +
xlp(xly)dx
{.o (x -
x}p(xly)dx
566
Appendix B
Now recall the following formula for the derivative of integrals whose limits and
integrand depend on some parameter y:
dd {JV(Y) f(x,
y
y)dx
} = JV(Y) af(xa; y ) dx +
u(y)
v'(y)f(v(y), y) - u'(y)f(u(y), y)
u(y)
pIlei) = 2p(xly)
x is uniquely defined
by
where
Rxy ) > 0
Ry
Then the conditional distribution of x given y, p (x Iy), is Gaussian with mean
i(B.37)
and covariance matrix
(B.38)
Proof.
Section B.6
567
p(x, y)
= (2n)<nx+ny)/2(det R)1/2
(B.39)
RxyR-;;l) R
I
(!J 0)
0)
= (Rx - RxyR-;;lRyx
--Ry Ryx l O R y
(B.40)
(B.4l)
and that
(B.42)
x - X)T
P- 1
(y - y?) ( 0
0) ( =;A)
R-;;l
Note that here (B.37) has been used as a definition of X. Inserting (B.4l) and (B.42) into
(B.39) gives
p(x, y) =
y)TR-;;l(y - y)
J
(B.43)
xyrp-l(x - X)]
Since the first factor in (B.43) is recognized as p(y), it follows that the second must be
p(xly). However, this factor is easily recognized as .AI(x, P). This observation concludes
the proof.
III
P=
It follows from (B.37) and Section B.5 that under the Gaussian hypothesis the MVE of x
given that y has occurred, x = E[xl y], is an affine function of y. This is not necessarily
true if x and yare not jointly Gaussian distributed. Thus, under the Gaussian hypothesis
the linear MVE (or the BLUE; see Chapter 4) is identical to the general MVE. As a
Appendix B
simple illustration of this fact the results of Complement C4.2 will be derived again using
the formula for conditional expectation introduced above. For this purpose, reformulate
the problem of Complement C4.2 to include it in the class of problems considered above.
Let 0 and y be two random vectors related by
o= 0 +
101
Y = <PO +
102
To determine the MVE 0 = E[OI y] of 0, assume further that 0 and yare Gaussian
distributed. Thus
It follows from Lemma B.16 and (B.37) that the MVE of 0 is given by
8=
E[OIY] =
8+
p<pT(<pp<p T
S)-l(y - <P8)
p=
P -- p<pT(S
<pp<p 1 )-1<pp
These are precisely the results (C4.2.4), (C4.2.S) obtained using the BLUE theory.
As a further application of the fact that the BLUE and MVE coincide under the
Gaussian hypothesis, consider the following simple derivation of the expression of the
BLUE for linear combinations of unknown parameters. Let 0 denote the unknown
parameter vector (which is regarded as a random vector) and let Y denote the vector of
data from which information about 0 is to be obtained. The aim is to determine the
BLUE of AO, where A is a constant matrix. Since for any distribution of data
MVE(AO) = E[AOIY]
AE[OIY] = A MVE(8)
(B.44)
This property is used in the following section when deriving the Kalman - Bucy filter
equations.
Section B. 7
569
(B.4Sb)
where Ab Ck are (nonrandom) matrices of appropriate dimensions, and
EWk W = Qk 6 k,p
EVkV
(B.4Se)
R k 6 k,p
EWkV~' = 0
=0
First consider the problem of determining the BLUE of Xk given the new measurement
Yk' Denote the BLUE by i k , This problem is exactly of the type treated in Complement
C4.2 and reanalyzed at the end of Section B.6, Thus, it follows for example from
(C4.2.4), or (B.37) that i k is given by
,tk
Next consider the problem of determining the BLUE of Xk+I' Since Wk is uncorrelated
with {Yb Yk-I, " ,} and, therefore, with (i k - Xk), it readily follows from (B.44) and
the state equation (B.4Sa) that the BLUE of Xk+l is given by
Thus, the BLUE Xk of Xk and its covariance matrix P k obey the following recursive
equations, called the Kalman-Bucy filter:
XHI
= AkXk
+ Akfk(Yk - CkXk)
T
fk = PkCk(R
k
PHI =
+ CkPkCTk )-I
Ak(Pk - fkCkPk)AI
+ Qk
Finally, recall from the discussion at the end of the previous section that under the
Gaussian hypothesis the BLUE is identical to the general MVE.
Appendix B
fk
1 N-k
N ~ y(t)y(t
k)
1=\
('!k
= fklfo
be sample estimators of rk and fh. Assume that rk (and Qk) converge exponentially to
zero as k ~ 00 (which is not a restrictive assumption once the stationarity of yet) is
accepted). Since many identification techniques (such as the related methods of correlation, least squares and instrumental variables) can be seen as maps {sample covariances
or correlations} ~ {parameter estimates}, it is important to know the properties of the
sample estimators Uk} and {Qd (in particular their accuracy).
This section will derive the asymptotic variances-covariances
(Jkp
o~
k, p <
00
From {(Jkp} one can directly obtain the variances-covariances of the sample correlations
Skp
o~
k, p <
00
it follows that
(B.46)
In the following paragraphs formulas will be derived for (Jkp and SkI" firstly under the
hypothesis that yet) is Gaussian distributed. Next the Gaussian hypothesis will be relaxed
but yet) will be restricted to the class of linear processes. The formula for (Jkp will change
slightly, while the formula for Sk" will remain unchanged. For ARMA processes (which
are special cases of linear processes) compact formulas will be given for (Jkp and Skp'
Finally, some examples will show how the explicit expressions for {(Jkp} and {Skp} can be
used to assess the accuracy of {fb Qd (which may be quite poor in some cases) and also
to establish the accuracy properties of some estimation methods based on {fd or {Qk}'
Gaussian processes
Under the Gaussian assumption it holds by Lemma B.9 that
(B.47)
Section B.8
571
IV
N-k N-p
2: 2:
IV 2:
s= 1
t=1
2:
[l'r-srt-s+k-p + l't-s-pl't-s+d
(B.48)
r=l .1'=1
=~
-.;=-N
where the neglected terms tend to zero as N ~ 00. Since r[ converges exponentially to
zero as T -"" 00, it follows that the term in (B .48) which contains 11:'1 tends to zero as
N ~ 00 (cf. the calculations in Appendix A8.I). Thus,
(B.49)
,,(=-00
(B.SO)
,,(=--00
-2~M~d?'t+k + 2Qk~M~~)
The expressions (B.49) and (B.SO) for akp and
Skp
Linear processes
Let yet) be the linear process
yet)
2:
(B.Sla)
hke(t - k)
k=O
Ee(t) = 0
Assume that hk converges exponentially to zero as k
00.
Since
(B.52)
the assumption on {h k } guarantees that l'k converges exponentially to zero as k ~ 00. No
assumption is made in the following about the distribution of yet).
A key result in establishing (B.49) (and (B.SO)) was the identity (B.47) for Gaussian
random variables. For non-Gaussian variables, (B.47) does not necessarily hold. In the
present case,
572
k)y(s)y(s
2: 2: 2: 2:
00
00
Appendix B
+ p)
(B.S3)
00
hihih,hmEe(t - i)e(t
k - j)e(s - /)e(s
+p
- m)
k - j)e(s - l)e(s
= 'A4(Oi,/ __ kO',m_p
+p
- m)
Ot-i,s-,Ot+k-j,s+p-m
Ot--i,s+p-rr,t)t+k-j,s-t)
(B.54)
where /-t
(B.SSa)
where
Ys,t
2: hihi+khs-t+ihs--I+i+p
(B.5Sb)
;=0
and where hk = 0 for k < 0 (by convention), Note that (B.55) reduces to (B.47) since for
Gaussian variables /-t = 3'A4. From (B.53)-(B.55) it follows similarly to (B.48), (B.49)
that
(B.56)
T=-OO
?;:
c<
i=O
2: 1.1 2:
1:=-00
2: 11 2:
,=1
Ihihi+khr+ihT+i+,,1
i=-T
i=O
Ihihi+kht+ihT+i+,,1
00
and 0
Section B.8
2T
~ 1_la
2, ]
00
2.:
= const
1_la21T1 <
00
T=-oo
N->oo
2.: 2.:
I_I,
00
YS,I
t=1 s=l
2.: 2.:
h;hi+kh,+ih,+i+p
't=-oo i=-oo
00
00
2.: 2.:
(B.58)
(hihi+k)(hjhj+p) = A4rkrp
i=-oo j=-oo
Thus the formula (B.50) for Skp continues to hold in the case of possibly non-Gaussian
linear processes.
ARMA processes
From the standpoint of applications, the expressions (B.49) or (B.59) for (hp are not very
convenient due to the infinite sums which in practical computations must be truncated.
For ARMA processes a more compact formula for (Jkp can be derived, which is more
convenient than (B.59) for computations. Therefore, assume that yet) is an ARMA
process given by (B.51), where
(B.60)
with C(q-l) and A(q-l) monic polynomials in q-l. Define
'"(=-00
Using
{l3d
(Jkp
3) rk rp
(B.61)
Appendix B
k=-oo
1:=-00
(B.62a)
k=-co
where
1
<t>(z) ~ 2n
(B.62b)
{l3d
are the
yet) == 1
1
. ._j e(t)
-- aq
lal
< 1
where to simplify the notation it is assumed that the variance of the white noise sequence
{e(t)} is given by
Ee 2 (t)
= (1 --
a2)
This in turn implies that the covariances of yet) are simply given by
k
0, 1, .,.
Thus
{l3d
130 + 132k
must be evaluated. Now
rb which is given by
(cf. (B.61
Section B.8
= (1 _
a2)2 (k
(k
+ 1) -
(k - 1)a 2
- a
1 _ a2
_ k(k +---2
1+ a
-a
575
1 - a
which gives
It can be seen that for la I close to l, 0kk may take very large values. This implies that the
sample covariances will have quite a poor accuracy for borderline stationary
autoregressions.
II
The above expression for var(a) will now be derived using the results of this section.
From (B.46),
var(a) = var(Ql)
(B.64)
ro = 1
1 + a2
000=
2~0 = 2 1 _ a2
+ 4a 2
11
010
= 2~1
(B.65)
a4
1 _ a2
= 2a ( 1
1 + a2 )
4a
a 2 = 1 _ a2
+1_
1 + 4a 2 - a 4
.-
8a 2
1- a
2a 2 (l
a 2)
= 1 - a2
Appendix B
which agrees with (B.63). It is interesting to note that while the estimates Po and PI have
poor accuracy for lal close to 1, the estimate a obtained from Po and
has quite
a good accuracy. The reason is that the estimates Po and PI become highly correlated
iii
when la I -7 1.
'1
i = 1, ... , m
(B.66)
where N denotes the number of samples. Natural estimates of 80 and P can be formed as
(B.67a)
P=
am
2: N(e
i -
8)(8 i
--
8)T
(B.67b)
i=I
The estimate 8 is precisely the arithmetic mean. In (B.67b) am is a scale factor, which
typically is taken as 11m or lI(m - 1).
According to (B.66) it follows directly that
8~
(B.68)
which also shows the mean and the covariance matrix of the estimate 8.
The analysis of Prequires longer calculations. The mean value of P is straightforward
to find:
m
EP =
amN
2: E[8 e
i iT -
ei 8T
8e iT + 8e T ]
i=1
= amN
minI
i m
(8 0 86 + PIN) - amN
In
2:
2: Ee
2:
i=1
i= I
;=1
eFr
Section B.9
am
N
m
mE (e' + ~m)
~
iJi e'"' +
= al1lm(N808~f
a m m(N8 0 8;r
P) - 2a m N[8 0 86
u m Nm(AoA6
PIN
(m -
577
P/(Nm)
1)80 8;n
Plm)
which gives
(B.69)
The estimate P is hence unbiased if am is chosen as
a
(B.70)
=--
m-l
= a;n N2E
= a~,N2
111
I-'~cl
v=l
2: 2:
In
In
2: 2:
{[E(S; - 8;)(8; -
OJ)][E(SX -
Od(S[ - 81)]
OJ)(eX -
OdD
E(8; -
0;)(8Y --
OJ)
ES;SY -
E8)'8 j
1 m
f' - E8
'"
~lmL..
k=l
EO;fJY +
1
fJkJ - -m
E L..
'"
k=!
E8 i Sj
Msv
If
(B.n)
578
Appendix B
80 ,i 80,j + b/-lyPijlN
-'~I[{8()'i8(),j +
+
=
(eo,i80,j
+ PijlNm)
(PijIN)(b/-lY - 11m)
(B.73)
EP'/'ki
= a;n
+
rn
~l=l
y= I
L L
Pik(b~lY
11m) + Pi/(b/-lv
lIm)Pjk (b/-lv
11m)}
(B.74)
Inserting this result into (B.71) gives
(B.75)
-)
"
+ PiiPija7:n(m -
(B.76a)
1)
(B. 76b)
~-
I'-,
=
E[Fii - Piif
pi;
= II
l
2
= m
-1
U
In
(m - 1)}2 + 2a2 (m - 1)
In
(B.77)
(B.78)
~min =
Inin
~ = __2_}
am
rn
(B.79a)
is obtained for
1
am
(B.79b)
= m + "1
For large values of m the difference between (B.78) and (B.79a) is only marginal.
If m is large it is to be expected that approximately (cf. the central limit theorem)
Pii ; . P ii
~ .%(0, ~)
(B.80)
II
P ii
= 1.96Y2 = 2.77
Ym
(B.81)
Ym
Note that the exact distribution of P for finite m is Wishart (cf. Rao, 1973). Such a
distribution can be seen as an extension of the X2 distribution to the multivariable case.
Bibliographical notes
For general results in probability theory, see Cramer (1946), Chung (1968), and for
statistical inference, Kendall and Stuart (1961), Rao (1973), and Lindgren (1976).
Lemma B.2 is adapted from Soderstrom (1975b). See also Hannan (1970) and Ljung
(1985c) for further results on ergodicity (also called strong laws of large numbers) and for
a thorough treatment of the central limit theorems.
Peterka (1981) presents a maximum a posteriori (or Bayesian) approach to identification and system parameter estimation.
The books Astrom (1970), Anderson and Moore (1979), Meditch (1969) contain deep
studies of stochastic dynamic systems. In particular, results such as those presented in
Sections B.5-B.7 are discussed in detail. Original work on the Kalman filter appeared in
Kalman (1960) and Kalman and Bucy (1961).
For the original derivation of the results on the asymptotic variances of sample
covariances, see Bartlett (1946, 1966). For a further reading on this topic, consult
Anderson (1971), Box and Jenkins (1976), Brillinger (1981) and Hannan (1970).
REFERENCES
Akaike, H. (1969). Fitting autoregressive models for prediction, Ann. Ins!. Statist. Math., Vol. 21,
pp. 243~247.
Akaike, H. (i 971). Information theory and an extension of the maximum likelihood principle, 2nd
International Symposium on Information Theory, Tsahkadsor, Armenian SSR. Also published
in Supplement to Problems of Control and Information Theory, pp. 267~281, 1973.
Akaike, H. (1981). Modern development of statistical methods. In P. Eykhoff (ed.), Trends and
Progress in System Identification. Pergamon Press, Oxford.
Alexander, T.S. (1986). Adaptive Signal Processing. Springer Verlag, Berlin.
Andel. J .. M.G. Perez and A.I. Negrao (1981). Estimating the dimension of a linear model,
Kybernetika, Vo/. 17, pp. 514~525.
Anderson, B.D.O. (1982). Exponential convergence and persistent excitation, Proc. 2ist iEEE
Conference on Decision and Control, Orlando.
Anderson, B. D.O. (1985). Identification of scalar errors-in-variables models with dynamics,
Automatica, Vol. 21, pp. 709~716.
Anderson, B.D.O. and M. Deistler (1984). Identifiability in dynamic errors-in-variables models,
Journal of Time Series Analysis, Vol. 5, pp. 1-13.
Anderson, B.D.O. and M.R. Gevers (1982). Identifiability of linear stochastic systems operating
under linear feedback, Automatica, Vol. 18, pp. 195~213.
Anderson, B.D.O. and J.B. Moore (1979). Optimal Filtering. Prentice Hall, Inc., Englewood
Cliffs.
Andersson, P. (1983). Adaptive forgetting through multiple models and adaptive control of car
dynamics, Report LIU- TEK LIC 1983:10, Department of Electrical Engineering, Linkoping
University. Sweden.
Anderson, T.W. (1971). Statistical Analysis of Time Series. Wiley, New York.
Ansley, C.F. (1979). An algorithm for the exact likelihood of a mixed autoregressive-moving
average process, Biometrika, Vol. 66, pp. 59~65.
Aoki, M. (1987). State Space Modelling of Time Series. Springer Verlag, Berlin.
Aris, R. (1978). Mathematical Modelling Techniques. Pitman, London.
Astrom, K.J. (1968). Lectures on the identification problem ~ the least squares method, Report
6806, Division of Automatic Control, Lund Institute of Technology, Sweden.
Astrom, K.l. (1969). On the choice of sampling rates in parametric identification of time series,
Information Sciences, Vol. 1, pp. 273--278.
Astrom, K.J. (1970). introduction to Stochastic Control Theory. Academic Press, New York.
Astrom, K.J. (1975). Lectures on system identification, chapter 3: Frequency response analysis,
Report 7504, Department of Automatic Control, Lund Institute of Technology, Sweden.
Astrom, K.J. (1980). Maximum likelihood and prediction error methods, A utomatica , Vol. 16,
pp. 551 ~574.
Astrom, K.J. (1983a). Computer aided modelling, analysis and design of control systems ~ a
perspective, iEEE Control Systems Magazine, Vo!. 3, pp. 4-15.
Astrom, K.J. (1983b). Theory and applications of adaptive control ~ a survey, Automat/ca, Vol.
19, pp. 471--486.
Astrom, K..T. (1987). Adaptive feedback control, Proc. IEEE, Vol. 75, pp. 185~217.
Astrom, K.J. and T. Bohlin (1965). Numerical identification of linear dynamic systems from
580
References
581
582
References
References
583
584
References
Finigan, B.M. and LH. Rowe (1974). Strongly consistent parameter estimation by the introduction
of strong instrumental variables, IEEE Transactions on Automatic Control, Vo!' AC-19, pp.
825-830.
Fortescue, T.R., L.S. Kershenbaum and B.E. Ydstie (1981). Implementation of self-tuning
regulators with variable forgetting factors, Automatica, Vol. 17, pp. 831-835.
Freeman, T.G. (1985). Selecting the best linear transfer function model, Automatica, Vol. 21,
pp. 361-370.
Friedlander, B. (1982). Lattice filters for adaptive processing, Proc. IEEE, Vol. 70, pp. 829-867.
Friedlander, B. (1983). A lattice algorithm for factoring the spectrum of a moving-average process,
IEEE Transactions on Automatic Control, Vol. AC-28, pp. 1051-1055.
Friedlander, B. (1984). The overdetermined recursive instrumental variable method, IEEE
Transactions on Automatic Control, Vol. AC-29, pp. 353-356.
Fuchs, 1.J. (1987). ARMA order estimation via matrix perturbation theory, IEEE Transactions on
Automatic Control, Vol. AC-32, pp. 358-361.
Furht, B.P. (1973). New estimator for the identification of dynamic processes, IBK Report, Institut
Boris Kidric Vinca, Belgrade, Yugoslavia.
Furuta, K., S. Hatakeyama and H. Kominami (1981). Structural identification and software
package for linear multivariable systems, Automatica, Vol. 17, pp. 755-762.
Gardner, G., A.C. Harvey and G.D.A. Phillips (1980). An algorithm for exact maximum
likelihood estimation of autoregressive moving average models by means of Kalman filtering,
Appl. Stat., Vol. 29, pp. 311-322.
K.F. Gauss (1809). Teoria Motus Corporum Coelestium in Sectionibus Conicus Solem Ambientieum. Reprinted translation: Theory of the motion of the heavenly bodies moving about the sun
in conic sections. Dover, New York.
Gertier, 1. and Cs. Banyasz (1974). A recursive (on-line) maximum likelihood identification
method, IEEE Transactions on Automatic Control, Vol. AC-19, pp. 816-820.
Gevers, M.R. (1986). ARMA models, their Kronecker indices and their McMillan degree,
International Journal of Control, Vol. 43, pp. 1745-1761.
Gevers, M.R. and B.D.O. Anderson (1981). Representation of jointly stationary stochastic
feedback processes, International Journal of Control, Vol. 33, pp. 777-809.
Gevers, M.R. and B.D.O. Anderson (1982). On jointly stationary feedback-free stochastic
processes, IEEE Transactions on Automatic Control, Vol. AC-27, pp. 431-436.
Gevers, M. and L. Ljung (1986). Optimal experiment designs with respect to the intended model
application, Automatica, Vol. 22, pp. 543-554.
Gevers, M. and V. Wertz (1984). Uniquely identifiable state-space and ARMA parametrizations
for multivariable linear systems, Automatica, Vol. 20, pp. 333-347.
Gevers, M. and V. Wertz (1987a). Parametrization issues in system identification, Proc. IFAC 10th
World Congress, Munich.
Gevers, M. and V. Wertz (1987b). Techniques for the selection of identifiable parametrizations for
multivariable linear systems. In C.T. Leondes (ed.), Control and Dynamic Systems, Vol. 26:
System Identification and Adaptive Control- Advances in Theory and Applications. Academic
Press, New York.
Gill, P., W. Murray and M.H. Wright (1981). Practical Optimization. Academic Press, London.
Glover, K. (1984). All optimal Hankel-norm approximations of linear multivariable systems and
their L oo-error bounds, International Journal of Control, Vol. 39, pp. 1115-1193.
Glover, K. (1987). Identification: frequency-domain methods. In M. Singh (ed.), Systems and
Control Encyclopedia. Pergamon, Oxford.
Gnedenko, B.V. (1963). The Theory of Probability. Chelsea, New York.
Godfrey, K.R. (1983). Compartmental Models and Their Application. Academic Press, New York.
Godfrey, K.R. and J.l. DiStefano, III (1985). Identifiability of model parameters, Proc. 7th
IFAClIFORS Symposium on Identification and System Parameter Estimation, York.
Godolphin, E.J. and J .M. Unwin (1983). Evaluation of the covariance matrix for the maximum
References
585
likelihood estimator of a Gaussian autoregressive moving average process, Biometrika, Vol. 70,
pp. 279-284.
Gohberg, I.C. and G. Heinig (1974). Inversion of finite Toeplitz matrices with entries being
elements from a non-commutative algebra, Rev. Roumaine Math. Pures Appl., Vol. XIX,
pp. 623-665.
Golomb, S.W. (1967). Shift Register Sequences. Holden-Day, San Francisco.
Golub, G.H. and C.F. van Loan (1983) Matrix Computations. North Oxford Academic, Oxford.
Goodwin, G.C. (1987). Identification: experiment design. In M. Singh (ed.), Systems and Control
Encyclopedia. Pergamon, Oxford.
Goodwin, G.c. and R.L. Payne (1973). Design and characterization of optimal test signals for
linear single input-single output parameter estimation, Proc. 3rd IFAC Symposium on
Identification and System Parameter Estimation, the Hague.
Goodwin, G.c. and R.L. Payne (1977). Dynamic System Identification: Experiment Design and
Data Analysis. Academic Press, New York.
Goodwin, G.C., P.J. Ramadge and P.E. Caines (1980). Discrete time multi-variable adaptive
control, IEEE Transactions on Automatic Control, Vol. AC-25, pp. 449-456.
Goodwin, G.c. and K.S. Sin (1984). Adaptive Filtering, Prediction and Control. Prentice Hall,
Englewood Cliffs.
de Gooijer, J. and P. Stoica (1987). A min-max optimal instrumental variable estimation method
for multivariable linear systems, Technical Report No. 8712, Faculty of Economics, University
of Amsterdam.
Granger, C.W.J. and P. Newbold (1977). Forecasting Economic Time Series. Academic Press,
New York.
Graybill, F.A. (1983). Matrices with Applications in Statistics (2nd edn). Wadsworth International
Group, Belmont, California.
Grenander, U. and M. Rosenblatt (1956). Statistical Analysis of Stationary Time Series. Almqvist
och Wiksell, Stockholm, Sweden.
Grenander, U. and G. Szego (1958). Toeplitz Forms and Their Applications. University of
California Press, Berkeley, California.
Gueguen, C. and L.L. Scharf (1980). Exact maximum likelihood identification of ARMA models:
a signal processing perspective, Report ONR 36, Department Electrical Engineering, Colorado
State University, Fort Collins.
Guidorzi, R. (1975). Canonical structures in the identification of multivariable systems,
Automatica, Vol. 11, pp. 361-374.
Guidorzi, R.P. (1981). Invariants and canonical forms for systems structural and parametric
identification, Automatica, Vol. 17, pp. 117-133.
Gustavsson, 1., L. Ljung and T. Soderstrom (1977). Identification of processes in closed loop:
identifiability and accuracy aspects, A utomatica , Vol. 13, pp. 59-75.
Gustavsson, I., L. Ljung and T. Soderstrom (1981). Choice and effect of different feedback
configurations. In P. Eykhoff (ed.), Trends and Progress in System Identification. Pergamon,
Oxford.
Haber, R. (1985). Nonlinearity test for dynamic processes, Proc. 7th IFAClIFORS Symposium on
Identification and System Parameter Estimation, York.
Haber, R. and L. Keviczky (1976). Identification of nonlinear dynamic systems, Proc. 4th IFAC
Symposium on Identification and System Parameter Estimation, Tbilisi.
Hagglund, T. (1984). Adaptive control of systems subject to large parameter changes, Proc. IFAC
9th World Congress, Budapest.
Hajdasinski, A.K., P. Eykhoff, A.A.M. Damen and A.J.W. van den Boom (1982). The choice
and use of different model sets for system identification, Proc. 6th IFAC Symposium on Identification and System Parameter Estimation, Washington DC.
Hannan, E.J. (1969). The identification of vector mixed autoregressive moving average systems,
Biometrika, Vol. 56, pp. 222-225.
586
References
References
587
Jenkins, G.M. and D.G. Watts (1969). Spectral Analysis and Its Applications. Holden-Day, San
Francisco.
Jennrich, R.I. (1969). Asymptotic properties of non-linear least squares estimators, Annals of
Mathematical Statistics, Vol. 40, pp. 633-643.
Jones, R.H. (1980). Maximum likelihood fitting of ARMA models to time series with missing
observations, Technometrics, Vol. 22, pp. 389-395.
Kabaila, P. (1983). On output-error methods for system identification, IEEE Transactions on
Automatic Control, Vol. AC-28, pp. 12-23.
Kailath, T. (1980). Linear Systems. Prentice Hall, Englewood Cliffs.
Kailath, T., A. Vieira and M. Morf (1978). Inverses of Toeplitz operators, innovations, and
orthogonal polynomials, SIAM Review, Vol. 20, pp. 1006-1019.
Kalman, R.E. (1960). A new approach to linear filtering and prediction problems, Transactions
ASME, Journal of Basic Engineering, Series D, Vol. 82, pp. 342-345.
Kalman, R.E. and R.S. Bucy (1961). New results in linear filtering and prediction theory,
Transactions ASME, Journal of Basic Engineering, Series D, Vol. 83, pp. 95-108.
Karlsson, E. and M.H. Hayes (1986). ARMA modeling of time varying systems with lattice filters,
Proc. IEEE Conference on Acoustics, Speech and Signal Processing, Tokyo.
Karlsson, E. and M.H. Hayes (1987). Least squares ARM A modeling of linear time-varying
systems: lattice filter structures and fast RLS algorithms, IEEE Transactions on Acoustics,
Speech and Signal Processing, Vol. ASSP-35, pp. 994-1014.
Kashyap, R.L. (1980). Inconsistency of the AIC rule for estimating the order of AR models, IEEE
Transactions on Automatic Control, Vol. AC-25, pp. 996-998.
Kashyap, R.L. (1982). Optimal choice of AR and MA parts in autoregressive moving average
models, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-4, pp. 99103.
Kashyap, R.L. and A.R. Rao (1976). Dynamic Stochastic Models from Empirical Data. Academic
Press, New York.
Kendall, M.G. and S. Stuart (1961). The Advanced Theory of Statistics, Vol. II. Griffin, London.
Keviczky, L. and Cs. Banyasz (1976). Some new results on multiple input-multiple output
identification methods, Proc. 4th IFAC Symposium on Identification and System Parameter
Estimation, Tbilisi.
Kubrusly, C.S. (1977). Distributed parameter system identification. A survey, International
Journal of Control, Vol. 26, pp. 509-535.
Kucera, V. (1972). The discrete Riccati equation of optimal control, Kybernetika, Vol. 8,
pp. 430-447.
Kucera, V. (1979). Discrete Linear Control. Wiley, Chichester.
Kumar, R. (1985). A fast algorithm for solving a Toeplitz system of equations, IEEE Transactions
on Acoustics, Speech and Signal Processing, Vol. ASSP-33, pp. 254-267.
Kushner, H.J. and R. Kumar (1982), Convergence and rate of convergence of a recursive
identification and adaptive control method which uses truncated estimators, IEEE Transactions
on Automatic Control, Vol. AC-27, pp. 775-782.
Lai, T.L. and C.Z. Wei (1986). On the concept of excitation in least squares identification and
adaptive control, Stochastics, Vo!' 16, pp. 227-254.
Landau, LD. (1979). Adaptive Control. The Model Reference Approach. Marcel Dekker, New
York.
Lawson, c.L. and R,J. Hanson (1974). Solving Least Squares Problems. Prentice Hall, Englewood
Cliffs.
Lee, D.T.L., B. Friedlander and M. Morf (1982). Recursive ladder algorithms for ARMA
modelling, IEEE Transactions on Automatic Control, Vol. AC-29, pp. 753-764.
Lehmann, E.L. (1986). Testing Statistical Hypotheses (2nd edn). Wiley, New York.
Leondes, C.T. (ed.) (1987). Control and Dynamic Systems, Vol. 25-27: System Identification and
Adaptive Control - Advances in Theory and Applications. Academic Press, New York.
588
References
Leontaritis, I.J. and S.A. Billings (1985). Input-output parametric models for non-linear systems,
International Journal of Control, Vol. 41, pp. 303-344.
Leontaritis, I.J. and S.A. Billings (1987). Model selection and validation methods for non-linear
systems, International Journal of Control, Vol. 45, pp. 311-34l.
Levi nson, N. (1947). The Wiener RMS (root -mean-square) error criterion in filter design and
prediction, 1. Math. Phys., Vol. 25, pp. 261-278. Also in N. Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time Series. Wiley, New York, 1949.
Lindgren, B.W. (1976). Statistical Theory. MacMillan, New York.
Ljung, G.M. and G.E.P. Box (1979). The likelihood function of stationary autoregressive-moving
average models, Biometrika, Vol. 66, pp. 265-270.
Ljung, L. (1971). Characterization of the concept of 'persistently exciting' in the frequency
domain, Report 7Jl9, Division of Automatic Control, Lund Institute of Technology, Sweden.
Ljung, L. (1976). On the consistency of prediction error identification methods. In R.K. Mehra
and D.G. Lainiotis (eds), System Identification - Advances and Case Studies. Academic Press,
New York.
Ljung, L. (1977a). On positive real transfer functions and the convergence of some recursive
schemes, IEEE Transactions on Automatic Control, Vol. AC-22, pp. 539-551.
Ljung, L. (l977b). Analysis of recursive stochastic algorithms, IEEE Transactions on Automatic
Control, Vol. AC-22, pp. 551-575.
Ljung, L. (1977c). Some limit results for functionals of stochastic processes, Report LiTH-ISY-I0167, Department of Electrical Engineering, Linkoping University, Sweden.
Ljung, L. (1978). Convergence analysis of parametric identification methods, IEEE Transactions
on Automatic Control, Vol. AC-23, pp. 770-783.
Ljung, L. (1980). Asymptotic gain and search direction for recursive identification algorithms,
Proc. 19th IEEE Conference on Decision and Control, Albuquerque.
Ljung, L. (1981). Analysis of a general recursive prediction error identification algorithm,
Automatica, Vol. 17, pp. 89-100.
Ljung, L. (1982). Recursive identification methods for off-line identification problems, Proc. 6th
IFAC Symposium on Identification and System Parameter Estimation, Washington DC.
Ljung, L. (1984). Analysis of stochastic gradient algorithms for linear regression problems, IEEE
Transactions 011 Information Theory, Vol. IT-30, pp. 151-160.
Ljung, L. (1985a). Asymptotic variance expressions for identified black-box transfer function
models, IEEE Transactions on Automatic Control, Vol. AC-30, pp. 834-844.
Ljung, L. (1985b). Estimation of transfer functions, Automatica, Vol. 21, pp. 677-696.
Ljung, L. (1985c). A non-probabilistic framework for signal spectra, Proc. 24th IEEE Conference
on Decision and Control, Fort Lauderdale.
Ljung, L. (1986). System Identification Toolbox - User's Guide. The Mathworks, Sherborn, Mass.
Ljung, L. (1987). System Identification: Theory for the User. Prentice Hall, Englewood Cliffs.
Ljung, L. and P.E. Caines (1979). Asymptotic normality of prediction error estimation for
approximate system models, Stochastic~, Vol. 3, pp. 29-46.
Ljung, L. and K. Glover (l9tH). Frequency domain versus time domain methods in system
identification, Automatica, Vol. 17, pp. 71-86.
Ljung, L. and J. Rissanen (1976). On canonical forms, parameter identifiability and the concept of
complexity, Proc. 4th IFAC Symposium on Identification and System Parameter Estimation,
Tbilisi.
Ljung, L. and T. Soderstrom (1983). Theory and Practice of Recursive Identification. MIT Press,
Cambridge, Mass.
Ljung, S. (1983). Fast algorithms for integral equations and least squares identification problems.
Doctoral dissertation, Department of Electrical Engineering, Linkoping University, Sweden.
Ljung S. and L. Ljung (1985). Error propagation properties of recursive least-squares adaptation
algorithms, Automatica, Vol. 21, pp. 157-167.
Macchi, O. (ed.) (1984). IEEE Transactions on Information Theory, Special Issue on Linear
References 589
Adaptive Filtering, Vol. IT-30, no. 2.
McLeod, A.!. (1978). On the distribution of residual autocorrelations in Box-Jenkins models,
Journal Royal Statistical Society, Series B, Vol. 40, pp. 296-302.
Marcus-Roberts, H. and M. Thompson (eds) (1976). Life Science Models. Springer Verlag, New
York.
Mayne, D.O. (1967). A method for estimating discrete time transfer functions, Advances of
Control, 2nd UKAC Control Convention, University of Bristol.
Mayne, D.O. and F. Firoozan (1982). Linear identification of ARMA processes, A utomatica ,
Vol. 18, pp. 461-466.
Meditch, J.S. (1969). Stochastic Optimal Linear Estimation and Control. McGraw-Hill, New York.
Mehra, R.K. (1974). Optimal input signals for parameter estimation in dynamic systems - A
survey and new results, IEEE Transactions on Automatic Control, Vol. AC-19, pp. 753-768.
Mehra, R.K. (1976). Synthesis of optimal inputs for multiinput-multioutput systems with process
noise. In R.K. Mehra and D.G. Lainiotis (eds), System Identification - Advances and Case
Studies. Academic Press, New York.
Mehra, R. K. (1979). Nonlinear system identification - selected survey and recent trends, Proc. 5th
IFAC Symposium on Identification and System Parameter Estimation, Darmstadt.
Mehra, R.K. (1981). Choice of input signals. In P. Eykhoff (ed.), Trends and Progress in System
Identification. Pergamon, Oxford.
Mehra, R.K. and D.G. Lainiotis (eds) (1976). System Identification - Advances and Case Studies.
Academic Press, New York.
Mendel, J.M. (1973). Discrete Techniques of Parameter Estimation: The Equation Error Formulation. Marcel Dekker, New York.
Merchant, G.A. and T.W. Parks (1982). Efficient solution of a Toeplitz-plus-Hankel coefficient
matrix system of equations, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.
ASSP-30, pp. 40-44.
Moore, J.B. and G. Ledwich (1980). Multivariable adaptive parameter and state estimators with
convergence analysis, Journal of the Australian Mathematical Society, Vol. 21, pp. 176-197.
Morgan, B.J.T. (1984). Elements of Simulation. Chapman and Hall, London.
Moustakides, G. and A. Benveniste (1986). Detecting changes in the AR parameters of a
nonstationary ARMA process, IEEE Transactions on Information Theory, Vol. IT-30,
pp. 137-155.
Nehorai, A. and M. Morf (1984). Recursive identification algorithms for right matrix fraction
description models, IEEE Transactions on Automatic Control, Vol. AC-29, pp. 1103-1106.
Nehorai, A. and M. Morf (1985). A unified derivation for fast estimation algorithms by the
conjugate direction method, Linear Algebra and Its Applications, Vol. 72, pp. 119-143.
Nehorai, A. and P. Stoica (1988). Adaptive algorithms for constrained ARMA signals in the
presence of noise, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP36, pp. 1282-1291. Also Proc. IEEE Conference on Acoustics, Speech and Signal Processing
(ICASSP) , Dallas.
Newbold, P. (1974). The exact likelihood function for a mixed autoregressive-moving average
process, Biometrika, Vol. 61, pp. 423-426.
Ng, T.S., G.c. Goodwin and B.D.O. Anderson (1977). Identifiability of linear dynamic systems
operating in closed-loop, Automatica, Vol. 13, pp. 477-485.
Nguyen, V.V. and E.F. Wood (1982). Review and unification of linear identifiability concepts,
SIAM Review, Vol. 24, pp. 34-51.
Nicholson, H. (ed.) (1980). Modelling of Dynamical Systems, Va Is 1 and 2. Peregrinus, Stevenage.
Norton, J.P. (1986). An Introduction to identification. Academic Press, New York.
Ol<;er, S., B. Egardt and M. Morf (1986). Convergence analysis of ladder algorithms for AR
and ARMA models, Automafica, Vol. 22, pp. 345-354.
Oppenheim, A.V. and R.W. Schafer (1975). Digital Signal Processing. Prentice Hall, Englewood
Cliffs.
590
References
Oppenheim, A.V. and A.S. Willsky (1983). Signals and Systems. Prentice Hall, Englewood Cliffs.
van Overbeek, A.J.M. and L. Ljung (1982). On-line structure selection for multivariable statespace models, Automatica, Vol. 18, pp. 529-543.
Panuska, V. (1968). A stochastic approximation method for identification of linear systems using
adaptive filtering, Joint Automatic Control Conference, Ann Arbor.
Pazman, A. (1986). Foundation of Optimum Experimental Design. D. Reidel, Dordrecht.
Pearson, C.E. (ed.) (1974). Handbook of Applied Mathematics. Van Nostrand Reinhold, New
York.
Peterka, V. (1975). A square root filter for real time multivariate regression, Kybernetika, Vol. 11,
pp. 53-67.
Peterka, V. (1981). Bayesian approach to system identification. In P. Eykhoff (ed.), Trends and
Progress in System Identification. Pergamon, Oxford.
Peterka, V. and K. Smuk (1969). On-line estimation of dynamic parameters from input-output
data, 4th IFAC Congress, Warsaw.
Peterson, W.W. and E.J. Weldon, Jr (1972). Error-Correcting Codes (2nd edn). MIT Press,
Cambridge, Mass.
Polis, M.P. (1982). The distributed system parameter identification problem: A survey of recent
results, Proc. 3rd IFAC Symposium on Control of Distributed Parameter Systems, Toulouse.
Polis, M.P. and R.E. Goodson (1976). Parameter identification in distributed systems: a
synthesizing overview, Proceedings IEEE, Vol. 64, pp. 45-61.
Polyak, B.T. and Ya. Z. Tsypkin (1980). Robust identification, Automatica, Vol. 16, pp. 53-63.
Porat, B. and B. Friedlander (1985). Asymptotic accuracy of ARMA parameter estimation
methods based on sample covariances, Proc. 7th IFAClIFORS Symposium on Identification and
System Parameter Estimation, York.
Porat, B. and B. Friedlander (1986). Bounds on the accuracy of Gaussian ARMA parameter
estimation methods based on sample covariances, IEEE Transactions on Automatic Control,
Vol. AC-31, pp. 579-582.
Porat, B., B. Friedlander and M. Morf (1982). Square root covariance ladder algorithms, IEEE
Transactions on Automatic Control, Vol. AC-27, pp. 813-829.
Potter, J.E. (1963). New statistical formulas, Memo 40, Instrumentation Laboratory, MIT,
Cambridge, Mass.
Priestley, M.B. (1982). Spectral Analysis and Time Series. Academic Press, London.
Rabiner, L.R. and B. Gold (1975). Theory and Applications of Digital Signal Processing. Prentice
Hall, Englewood Cliffs.
Rake, H. (1980). Step response and frequency response methods, Automatica, Vol. 16, pp. 519-526.
Rake, H. (1987). Identification: transient- and frequency-response methods. In M. Singh (ed.),
Systems and Control Encyclopedia. Pergamon, Oxford.
Rao, C.R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York.
Ratkowsky, D.A. (1983). Nonlinear Regression Modelling. Marcel Dekker, New York.
Reiersol, O. (1941). Confluence analysis by means of lag moments and other methods of
confluence analysis, Econometrica, Vol. 9, pp. 1-23.
Rissanen, J. (1974). Basis of invariants and canonical forms for linear dynamic systems, Automatica, Vol. 10, pp. 175-182.
Rissanen, J. (1978). Modeling by shortest data description, Automatica, Vol. 14, pp. 465-471.
Rissanen, J. (1979). Shortest data description and consistency of order estimates in ARMA
processes, International Symposium on Systems Optimization and Analysis (eds A. Bensoussan
and J.L. Lions), pp. 92-98.
Rissanen, J. (1982). Estimation of structure by minimum description length, Circuits, Systems and
Signal Processing, Vol. 1, pp. 395-406.
Rogers, G.S. (1980). Matrix Derivatives. Marcel Dekker, New York.
Rowe, LH. (1970). A bootstrap method for the statistical estimation of model parameters,
References
591
592
References
References
593
594
References
References
595
Werner, H.J. (1985). More on BLU estimation in regression models with possibly singular
covariances, Linear Algebra and Its Applications, Vol. 64, pp. 207-214.
Wetherill, G.B., P. Duncombe, K. Kenward, J. K6l1erstr6m, S.R. Paul and B.J. Vowden (1986).
Regression Analysis with Applications. Chapman and Hall, London.
Whittle, P. (1953). The analysis of multiple stationary time series, Journal Royal Statistical
Society, Vol. 15, pp. 125-139.
Whittle, P. (1963). On the fitting of multivariate auto regressions and the approximate canonical
, factorization of a spectral density matrix, Biometrika, Vol. 50, pp. 129-134.
Widrow, B. and S.D. Stearns (1985). Adaptive Signal Processing. Prentice Hall, Englewood Cliffs.
Wieslander, J. (1976). IDPAC - User's guide, Report TFRT-3099, Department of Automatic
Control, Lund Institute of Technology, Sweden.
Wiggins, R.A. and E.A. Robinson (1966). Recursive solution to the multichannel filtering
problem, Journal Geophysical Research, Vol. 70, pp. 1885-1891.
Willsky, A. (1976). A survey of several failure detection methods, Automatica, Vol. 12, pp. 60161l.
Wilson, G. T. (1969). Factorization of the covariance generating function of a pure moving average
process, SIAM Journal Numerical Analysis, Vol. 6, pp. 1-8.
Wong, K.Y. and E. Polak (1967). Identification of linear discrete time systems using the instrumental variable approach, IEEE Transactions on Automatic Control, Vol. AC-12, pp. 707-718.
Young, P.c. (1965). On a weighted steepest descent method of process parameter estimation,
Report, Cambridge University, Engineering Laboratory, Cambridge.
Young, P.c. (1968). The use of linear regression and related procedures for the identification of
dynamic processes, Proc. 7th IEEE Symposium on Adaptive Processes, UCLA, Los Angeles.
Young, P.c. (1970). An instrumental variable method for real-time identification of a noisy
process, Automatica, Vol. 6, pp. 271-287.
Young, P.c. (1976). Some observations on instrumental variable methods of time series analysis,
International Journal of Control, Vol. 23, pp. 593-612.
Young, P. (1981). Parameter estimation for continuous-time models: a survey, Automatica, Vol.
17, pp. 23-39.
Young, P. (1982). A computer program for general recursive time-series analysis, Proc. 6th IFAC
Symposium on Identification and System Parameter Estimation, Washington DC.
Young, P. (1984). Recursive Estimation and Time-Series Analysis. Springer Verlag, Berlin.
Young, P.c. and A.J. lakeman (1979). Refined instrumental variable methods of recursive timeseries analysis. Part I: Single input, single output systems, International Journal of Control,
Vol. 29, pp. 1-30.
Young, P.c., AJ. Jakeman and R. McMurtrie (1980). An instrumental variable method for model
order identification, Automatica, Vol. 16, pp. 281-294.
Zadeh, L.A. (1962). From circuit theory to systems theory, Proc. IRE, Vol. 50, pp. 856-865.
Zadeh, L.A. and E. Polak (1969). System Theory. McGraw-Hill, New York.
Zarrop, M.B. (1979a). A Chebyshev system approach to optimal input design, IEEE Transactions
on Automatic Control, Vol. AC-24, pp. 687-698.
Zarrop, M.B. (1979b).Optimal Experiment Design for Dynamic System Identification. (Lecture
Notes in Control and Information Sciences no. 21). Springer Verlag, Berlin.
Zarrop, M. (1983). Variable forgetting factors in parameter estimation, Automatica, Vol. 19,
pp. 295-298.
van Zee, G.A. and O.H. Bosgra (1982). Gradient computation in prediction error identification of linear discrete-time systems, IEEE Transactions on Automatic Control, Vol. AC-27,
pp. 738-739.
Zohar, S. (1969). Toeplitz matrix inversion: the algorithm of W.F. Trench, J. ACM, Vol. 16,
pp. 592-60l.
Zohar, S. (1974). The solution of a Toeplitz set of linear equations, 1. ACM, Vol. 21, pp. 272-278.
Zygmund, A. (1968). Trigonometric Series (rev. edn). Cambridge University Press, Cambridge.
N+l
< mse(e 2) =
IV
2.1
mse(e l)
2.2
6,,2
3,,2
var(a) = N(N + 1)(2N + 1) "" N0
2.3
Eft
= rt
E0 1
=
A)
---;:;'2
var (ft)
=~
N
N-l
N
E0 2
--0
var ( 01
N - 1 2
= 2 ---;:;'20
var(02)
mse(02)
mse
=N~
(A)
2N - 1 2
= ~o
1 02
> mse(ol)
01
2.7
3.4
fv(k) = 1
3.8
var[h(k)] =
3.9
~2 a 2(-a)k
1
b2
IV ~:dl +
Na/2i
UN(w) = { 0
= 2nj/N. j
QJ
;:3
*' n
4.1
a=
A
N(N _ 1) [2(2N
+ l)So
iw
6Sd
Case (ii)
a=
2N 1+ 1 So
3
b = N(N + 1)(2N + 1)SI
596
6
[N+l
J
= N(N + 1)(2N
+ 1) - -2-So + Sl
b=
N(N +
1~(2N + 1) S't
2'A2(2N - 1)
N(N + 1)
_ .'
(e) N - 3. (8 - 8)
T(36 146)'(8 -
'
8
N -_ 8..(8
- 8) T ( 36
4.2
36)"
204
(8 - 8) 0<
~ 0.2
,
12
In .At] var(b) = N(N2 _ 1)
"
In .At2 var(b)
4.4
8) 0<
~ 0.2
(N
8= (<P?<PI)-I<P?YI
2
-lJ
)l2=<P28
Y -
<p8
(Yl)
o
- (<PI
<P2
0 ) ( 8)
-[ Y2
4.5
Hint. Use
2: t
= O(Nk+l/(k
1, N large
1=1
4.6
4.7
TN,
2 ~
Hint. <P <P = 21 8 = tV ~ <p(t)y(t)
The set of minimum points = the set of solutions to the normal equations. This set is given
by
a
) a an arbitrary parameter
( ,L; y(t)IN - a
1=1
lu(t)1 =
/3, t = 1,
= 2~ ~
y(t)(i)
... , N
= B(BTB)-ld
X2 = XI
X3 = B(M + BTB)-Id -+ XI
as
4.10 Hint. Take F = (q,TR-J<fl)-l<flTq,
4.9
XI
i'l -+ 0
598
4.12
N T~N ITlr(t)
N
.)"
1 T~N
and that ( N
ret)
<I:>IR-1<I:> ~ 1 as N ~
5.2
5.3
2:
Hint. Cp =
a2
4J
(Ok1,P
00.
+ OM-kj,p)
J=l
5.4
I(11 I ~ 1, 2Qt - 1 ~ Q2 ~ 1
5.5
The contour of the area for an MA(2) process is given by the curves
Ql
-Z
Qz =
Ql -
Q2
=Z
Qf = 4Q2(1 - 2(2)
For an AR(2) process: IQ11 ~ 1, 2QI - 1 ~
Q2 ~
a 2(1 - 2ai'i
5.6
(a) reT) =
. Hint. Use the relation Ell = E[E(u/y)]
a2
1 -- (1 - 2a)2
(b) <j>(w) = 2Jt 1 + (1 - 2af - 2(1 - 2a) cos w
5.7
5.8
2T
1: = 0, ... , n - 1; ret)
n
(c) u(t) is pe of order n.
hi ~
= C(q-I)e(t)
and set ~j
= -ret -- n)
hi
n-k
2:
j=O
M\+k
~1 Irkl ~ z(~o
11
...
f3 )
(0
nIl
l)(~O)
:
~11
Use the result XTQX ~ Amax(Q)xTx which holds for a symmetric matrix Q. When Q has
the structure Q = ee T - /, where e = [1 . .. 1], its eigenvalues can be calculated exactly by
using Corollary 1 of Lemma A.S.
5.9
6.1
- 2
az = - 1 - al
az = -1
~ aj ~
- 2 ~ al ~ 0
a]
= eel) - O.Se(t -
6.2
yet) - 0.8y(t - 1)
6.3
5Jt
1)
Ee 2 (t)
0.8
BT =
6.5
6.6
= (1 0...
yet + 1 - n)f
6.4
all
nc,.
ar2'
6.7
G(
-1.
q,
Khq~ H(
K) =
1 - q-I
A(K) = K 2 rh 3 (2
-1.
q,
K) = 1 + (2 - V3)q-1
(1 _ q I?
+ V3)/6
Kh(l2 +
1
7.1
y(t + lit) = -1
+ 05~
.)q
_1))
V12
+-V12
iy(t)
Hint. Compute (y(w) and make a spectral factorization to get the model yet) = vet)
0.5v(t - 1), Ev(t)v(s) = 40,.s '
7.3
7.4
UE~~. = _
D(q-I)
.
Ubi
ceq l)p(q I) u(t - I)
UE(t, e) _ _ _
1_
UCi
( _ .)
ce((-I) E t
Uc(t, ~ _ _1_ (
.)
ud i - D(q-I) E t - I
Uc(t, e) _ D(q-l)B(q--I) (
.)
ufi
- ceq l)p2 (q 1) U t - I
7.7
7.8
7.9
V(X(k+l)
V(X(k)
= -UgfSgk
+ O(a2) with
+ K(e)cee)tlB(e)
U2 V
= 2
USiUSj
1=1
uECt, e) uECt, e)
USi
uSj
UE(t, S) _ _
1_ ( _.)
uai
- ceq-I) y t
I
!JE(t, e) =
UCi
--=l.._ E(t
ceq-I)
U2 ECt, 8) = 0
UaiUaj
U2 ECt, 8) = 0
ua;ubj
i
1=1
E(t, e) U2 ECt, S)
USi US ;
UE(t, 8) _ ~ ( _ .)
Ubi
- ceq-I) u t i
- i 8)
+2
gk = V'(X(k)T.
+ O(IILlQI1 2)
a2E(!,
8)
-1
a2E(t,
.
.)
1- ]
= CZ(q-l) Y t -
oa1ac,
8) _
abiocj
1
(
.
")
CZ(q-l) u t - t - ]
aCiOCj
7.11
( -a)k-tcc -
yet +
kit) = -
where
Ici <
a\
cq-T- yet)
1+
1 is uniquely given by
==
I-,~
If the processes are not Gaussian this is still the best linear predictor (in the mean
square sense).
7.12
8-
80
""
[EtpT(t, 80N(t, 80 Wl
f"
f"
+
V OE =
7.14
~f:::W<Pu(W)dW
IA(eiWWI G(e iw )_
J"
(a) NP I
_"
G(e l.,")
(1 ~ a~
~2
0 2 (1
1 0
NP2 = (
ail
b o(1 - aB)
+ a~)
bo(1 - ail)
0
0
1-,2/02
0
b6(l - a5)
FTP-1F
+ 1-,2/(J2
FTQF)
T 2
F QF
F QPzQF
~ 0
7.18
1 - a2
(a) Apply a PEM to the model structure given in part (b). Alternatively, note that yet) is
an ARMA process, AO(q-l)y(t) = CO(q-l)W(t), for some Co, and use a PEM for
ARMA estimation.
(b) Let C(q-I) and I-,~v with C stable be uniquely determined from 8 by
C(Z)C(Z-I)I-,~
== A(z)A(z-l)A; +
A~
Then E(t, 8) = [A(q-l)/C(q-l)]y(/). Now the vector (al ... an f3A~ 131-,;) and 8 give
the same V(8) for all ~. Remedy: Set A~ = 1 during the optimization and compute the
correct ~-value from min V(8). As an alternative set = (a1 ... anP2) , pZ = A~IA; and
=
A~,). Minimize V w.r.t. and set ~~ = min V/N, giving
Finally 8 can be
computed by a I-to-l transformation form
e (e,
e.
e.
A(q-1)
601
(y(t)
u(t)
B(q-l)
A = ri, B = E, if = 1,
ii = O.
7.22
E[A(q-I){A(~-l) y(t)}
x(
yet -
A( 1.1)
q
A(;-=T)
8.2
B(q-l){A(~ __ l) U(t)}]
i)
i = 1, ... , n
u(t - i)
For the extended IV estimate the weighting matrix changes from Q to TTQT.
1o 0)
8.3
'*
8.4
8.6
8 LS
8.8
nb being preferable.
= (-C/I(2r:?1 + A2)
81v =
8.7
m; =
(-1)
(1 +
1
A2
P IV =
b 20 2
popt
(a)
IV
e2
--be
A2
= b 20 2
.
_
(c) Hmt. P
-be)
b 2 (1 + e 2 )
(1 - a 2)(1 - aef
-be(1 _ a 2 )(1 - ae)
),.4(e - a)2(1 - a 2 )
A2(e _ af]
b2a2 [b 20 2 +
(1 --
ae)
.-be (1 - ae
-be)
8.9
Hint. Show that JV(EZ(t)ZT(t c ./V(RT), and that this implies rank R ~ rank EZ(t)ZT(t).
8.n
Set
P1
rk
= Ey(t)y(t - k) and
fk =
fo
= 2:
rl
9.3
(a) p-I(l)
+ cp(t)q?(t)
AP-I(t - 1)
Set)
9.6
-1 <
9.12
[Atp-l(O)
C1
< 1, -1 <
C2
CI -
< 1
C2
(a) Hint. First show the following. Let h be an arbitrary vector and let <p(w) be the spectral
density of hTBTcp(t). Then
T -
h Mh
flt
1
Re C(e iOJ ) <p(w)dw ;:,: O.
-It
CI
9.13
C2
= -1,
C2
1- a2 (1 ~ a2
L(8 o) = - - (c- a)
aCt)
1=cac
_c__
1 - a2
+ ac 2
)
2c
(1 - a 2)(1 - ac)
Q(t)
ret)
Hint. Use Corollary 1 of Lemma A.7.
cpT(t
9.16
10.1
.
Hint.
= (u(t)
1)
(J N
'\jJ T(l,
1 ['
= N 2a 8(0) - 8
( ) N
( ') _
a
var a -
,
(b) N varU)
10.2
= 2 -
= -1, A2 = - -1---1-ac.
with eigenvalues Al
9.15
= -1
(a) var(b) =
(b) Ii = bk, j
give b1,.
12
f'.
J2
PCt)
T(t)
r
R(t)
A2 ~ 2,,-2
+ N"ia
6:1 k
(1 - a 2 - 2abf)02 + FA 2
aZ(b202 + A2)
A2(l - a 2)
= b 2 (b 2a2 +75
A2(l - b 2k 2)
~z + k 2 A2
= b.
A2
(c) var(h l ) = 2
o
> var(b)
A2
[k2(1 - b2k 2 )
1]
var(b 2) = (l + k 2)2. b 2a 2 + A2 + 02
A
> var(b)
var(b.,,) = var(b)
10.3
The system is identifiable in case (c) but not in cases (a), (b).
10.4
o~
g=02+0Zg
A
!:;
(b) a =
~O (ao
/..,2.
A = 2:[26(1 + a2 ) + 2a 2]112
w = arccos [6(1 _
10.7
10.8
;2) _a2]
aq
parameters.
2
/..,2 [ . .
2b 20 2 + /..,2]
11.3
EWN = /..,
11.4
2()
,2
'2(co--ao+b ok)2
E yt=II.+1I.
2
.
1 - (ao - bok)
+ N
+ (a... -
1) b202
/..,2
a-c
b
'=--
2)
+ ao +
/..,2(aij -- a 2f
1
?
- ao
11.6
+ ali) + 1
C 2n + 2
/..,2 2f(a5 - ao
a2 ,
N var(aj) = 1.
(a) N var(a) = J (b) Hint. Use Lemma A.2.
(c) Hint. Choose the assessment criterion
W.At(f) =
-~(aT
0)(8 - 80)(8 -
11.8
(a) P j =
~~]
1 -- C 2n + 2
aT + a2)2 +
80)T(~)
var(y) == 2m
Ey = m
+4
L (m -
J=I
2(")
j) r~(oJ)
ru
11.10
Hint. Use
a2~NI
a
(!=(!"
12.3
12.4
"),.2
(2 -b)
(e)
p(a}
P(h)
<
P(d}
L~2
12.6
var[y<l)(t) - y<2}(t)] = a 21
12.7
12.10
~
axel)
h
were
= 2
f(t e')
aEet, 8')
aEet, 8')
axel)
.
12.11
'"-1'
a 2 + var(x(o]
dist
X2 (1)
N var(S) = (1
,,2
[,
1-
a5
+ ao)2 s- bfP2 +
f...2
1]
02
= 1050
il'2 gives
f...2
N vaT(S)
= 2"
(1
o
+ ao )2
= 100
a 2 e 2ah - 1
(ahf
=N
_e- uh , b
= Q(l
a
- e- ah )
AUTHOR INDEX
Ahlen A, 226, 358, 509
Alexander, T.S., 357
Akaike, H., 232, 444, 456
Andel, J., 456
Anderson, B.D.O., 120, 149, 158,
169,171,174,177,226,258,
259,412,576
Anderson, T.W., 129,200,292,
576
Andersson, P., 358
Ansley, e.F., 250
Aoki, M., 129
Aris, R., 8
Astrom, K.l., 8, 54, 83,129,149,
158,159,171,226,231,248,
325,357,418,473,493,508,
509,510,576
Bai, E.W., 120
Balakrishna, S., 456
Bamberger, W., 357
Banks, H.T., 7
Banyasz, Cs., 238, 358
Bartlett, M.S., 92, 576
Basseville, M., 357,457,509
Baur, U.,357
Bendat, J.S., 55
Benveniste, A., 285, 357, 358,
457,509
Bergland, G.D., 55
Bhansali, R.J., 291, 457
Bierman, G.J., 358
Billings, S.A. 7, 171,456
Blundell, A.J., 8
Bohlin, T., 129, 171,226,325,
358,456,492,509
Bollen, R., 5lO
van den Boom, A.J. W., 171,456,
509
Bosgra, O.H., 226
Borisson, U., 357
Borrie, M.S., 8
Box, G.E.P., 226, 250, 438, 456,
461,464,509,576
Brillinger, D.R., 45, 46, 55,129,
576
Buey, R.S., 579
Burghes, D.N., 8
Butash, T.e., 456
Caines, P.E., 8, 226, 228, 357,
500,509
Carrol, RJ., 171
Chan, C.W., 500, 509
Chatfield, e., 291
606
Morf,M.,171,254,294,358,374
Morgan, BJ.T., 97
Moses, R., 297,298,495
Moustakides, G., 457
Murray, W., 226, 259
Negrao, A.I., 456
Nehorai, A., 171,223,231,232,
294,305,351
Newbold, P., 250, 509
Ng, TS., 412
Nguyen, V.V., 171
Nicholson, H., 8
Norton, J.P. 8
Ol~er, S., 358
Oppenheim, A.V., 129, 130
van Overbeek, A.l.M., 171
Samson, C, 358
Saridis, G.N., 357
Sastry, S., 120
Schafer, RW., 129, 130
Scharf, L. L. , 294
Schmid, Ch., 510
Schnabel, R.B., 226, 259
Schwarz, G., 457
Schwarze, G., 54
Seborg, D.E., 357
Shah, S.L., 357
Sharf, L.L., 250
Sharman, K.C., 495
Shibata, R., 449, 456
Siebert, H., 357
Sin, K.S., 120, 177,357,358,412
Sinha, N.K., 509
Smith H., 83
Smuk, K., 546
Soderstrom, T, 8, 31, 90, 92, 95,
135,162,170, 171, 182, 183,
224,225,226,229,231,234,
239,247,248,259,265,268,
276,284,285,289,290,294,
296,297,303,304,310,344,
350,357,358,412,416,420.
433,437,449,456,464,465,
495,500,509,576
Solbrand, G., 226, 358, 510
Solo, V., 358
de Sousa, CE., 177
Stearns, S.D., 247, 349, 357
Steiglitz, K., 225
Stern by , J., 472
Stewart, G.W., 546
Stoica, P., 8, 90, 92, 95,122,123,
129,135,170,171,182,183,
223,224,225,229,231,232,
234,239,247,248,259,265,
268,276,284,285,289,290,
294,296,297,298,303,304,
305,310,351,358,412,437,
449,456,464,494,499,510,556
Strang, G., 546
Strejc, V., 546
Stuart, S., 579
Szego, G., 135
Thompson, M., 8
Treichler, J. R., 357
Trench, VV.F.,298
Trulsson, E., 285, 412
Tsypkin, Ya, Z., 509
Tukey, J.VV., 55
Tyss~, A., 510
Unbchauen, H., 8, 456, 510
Unwin, J.M .. 254
Verbruggen, H.B., 129
Vieria, A., 254, 298
Vinter, R.B., 8
Vowden, B.l., 83
SUBJECT INDEX
Accuracy: of closed loop systems, 401; of correlation
analysis, 58; of instrumental variable estimates,
268; of Monte Carlo analysis, 576; of noise variance
estimate, 224; of prediction error estimate, 205
adaptive systems; 120, 320
Akaike's information criterion (AlC), 442, 445,446
aliasing, 473
a posteriori probability density function, 559
approximate ML, 333
approximation for RPEM, 329
approximation models, 7, 21, 28, 204, 228
ARARX model, 154, 169
ARMA covariance evaluation, 251
ARMAX model, 149, 162, 192, 196,213,215,222,
282,331,333,346,409,416,433,478,482,489,
490,494
assessment criterion, 438, 464
asymptotic distribution, 205, 240, 268, 285, 424, 427,
428,441,458,461,547,550
asymptotic estimates, 203
asymptotically best consistent estimate, 91
autocorrelation test, 423, 457
autoregressive (AR) model, 127, 151, 199,223,253,
289,310,361,373,446,574,575
autoregressive integrated moving average (ARIMA)
model, 461, 479, 481
autoregressive moving average (ARMA) model, 97,
103,122,125,151,177,207,215,229,247,249,
257,288,355,493,573
backward prediction, 312, 368
balance equation, 4
Bartlett's formula, 571
basic assumptions, 202, 264, 383
Bayes estimation, 560
best linear unbiased estimate (BLUE) 67, 82, 83, 88,
567,569
bias, 18,562
binary polynomial, J40
BLUE for singular residual covariance, 88
BLUE under linear constraints, 83
canonical parametrization, 155, 156
central limit theorem, 550
changes of sign test, 428
chi-two (X2) distribution, 556
Cholesky factorization, 259, 292, 314, 516
clock period, 113
closed loop operation, 25, 381, 453, 499
comparison of model structures, 440
computational aspects, 74,86,181,211,231,274,
277,292,298,310,350
computer packages, 501
condition number, 135,534
conditional expectation, 565
609
610
Subject index
Subject index
matrices: Cholesky factorization (continued)
generalized inverse, 520;
Hankel matrix, 181;
idempotent matrix, 463, 467, 537, 557;
inversion lemma, 323, 330, 5]];
Kronecker product, 131,233,466,542;
Moore-Penrose generalized inverse, 520;
norm, 532;
null space, 518;
orthogonal complement, 518;
orthogonal matrix, 522, 527, 528, 533, 539;
orthogonal projector, 522, 538;
orthogonal triangularization, 75, 278, 527;
OR method, 75, 527;
partitioned matrix, 511;
positive definite, 120, 183,297,512,515,516,544;
pseudo inverse , 63, 89, 518, 520, 524;
range, 518, 538;
rank, 512, 537,541;
singular values, 518, 524, 533;
Sylvester matrix, 134,246,249,307,540;
Toeplitz matrix, 181,250,294,298;
trace, 66;
unitary matrix, 131;
vanderMonde matrix, 131;
vectorization operator, 543
matrix fraction description (MFD), 182
matrix inversion lemma, 323, 330, 511
maximum aposteriori estimation, 559
maximum length PRBS, 138, 143
maximum likelihood (ML) estimation, 198,249,256,
496,559
mean value, 479
minimum variance estimation, 565, 567
mimimum variance regulator, 404, 409, 418
model, 3,146
model approximation, 7, 28
model dimension determination, 71
lllodelorder, 71,148,152,422
model output, 15,423
model parametrization, 146
model structure, 9, 29,146,148,188,260,275,278,
482
model structure determination, 422, 440, 482
model validation, 29, 422, 499
model verification, 499
modeling, 1, 146
modified spectral analysis, 385
modulo-two addition 137
moment generating function, 553
Monte Carlo analysis, 269, 446, 576
moving average (MA) model, 127, 151,258,291,494
multistep optimal IVM, 276
multistep prediction, 229, 453
multivariable system, 154, 155, 165,226,228,264,
277,373,384
nested model structures, 87, 278
Newton-Raphson algorithm, 212, 217, 218; for
spectral factorization, 181
noise-free part of regressor vector, 265
noisy input data, 258, 281
nonlinear models, 7
611
nonlinear regression, 91
nonparametric methods, 10,28,32
nonparametric models, 9
nonstationary process, 180,473,479,481,506
nonzero mean data, 473, 503
normal equations, 75,79
null hupothesis, 423
numerical accuracy, 532
on-line robustification, 498
open loop operation, 264, 403
optimal accuracy, 209
optimal experimental condition, 401, 411, 471, 544
optimal IV method, 272
optimal loss function, 497
optimal prediction, 192,216,229
optimization algorithms, 212
ordinary differential equation approach, 343
orthogonal projection, 64, 522, 538
orthogonal triangularization, 75, 278, 527
oscillator, 34
outliers, 495, 497
output error method, 198,239,409
output error model structure, 153
overdetermined linear equations, 62, 262, 278
overparametrization, 162,433,459
parameter identifiability, 167,204,388,416
parameter vector, 9, 14,60,148
parametric method, 28
parametric models, 9
Parseval's formula, 52
parsimony principle, 85,161,438,464
partial correlation coefficient, 313
periodic signals, 129
periodogram, 45
persistent excitation, 28, 117, 133,202,264,383,434,
469
pole-zero cancellation, 433
polynomial trend, 60, 77, 78, 479
polynomial trend model determination, 79
portmanteau statistics, 461
positive realness condition, 348, 352, 355
practical aspects. 348, 468
precomputations, 473
prediction, 21, 192,229
prediction error, 22,188,362
prediction error method:
asymptotic distribution, 205, 240;
basis for parsimony principle, 464;
closed loop operation, 387, 416;
comparison with IV, 276, 286;
computational aspects, 211;
consistency, 186.203,221;
description, 188;
indirect algorithm, 220;
optimal accuracy, 209;
recursive algorithm, 328, 345;
relation to LS, GLS and OE methods, 198,233,
236,239;
relation to ML method, 198;
statistical efficiency, 209;
underparametrized models, 209
612
Subject index