zyxwvuts
Backpropagation Through Time: What It
Does and How to Do It
PAUL J. WERBOS
Backpropagation is now the most widely used tool in the field
of artificial neural networks. At the core of backpropagation is a
method for calculating derivatives exactly and efficiently in any
large system made up of elementary subsystems or calculations
which are represented by known, differentiable functions; thus,
backpropagation has many applications which do not involve
neural networks as such.
This paper first reviews basic backpropagation, a simple method
which is now being widely used in areas like pattern recognition
and fault diagnosis. Next, it presents the basic equations for backpropagation through time, and discusses applications to areas like
pattern recognition involving dynamic systems, systems identification, and control. Finally, it describes further extensions of this
method, to deal with systems other than neural networks, systems
involving simultaneous equations or true recurrent networks, and
other practical issues which arise with this method. Pseudocode is
provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly
discussed.
propagation."The concepts here will already be familiar to
those who have read the paper by Rumelhart, Hinton, and
Williams [21 in the seminal book Parallel Distributed Processing, which played a pivotal role in the development of
the field. (That book also acknowledged the prior work of
Parker [3] and Le Cun [4], and the pivotal role of Charles
Smith of the Systems Development Foundation.) This section will use new notation which adds a bit of generality and
makes it easier to go on to complex applications in a rigorous manner. (The need for new notation may seem
unnecessary to some, but for thosewho have to apply backpropagation to complex systems, it i s essential.)
Section Ill will use the same notation to describe backpropagation through time. Backpropagation through time
has been applied to concrete problems by a number of
authors, including, at least, Watrous and Shastri [5],Sawai
and Waibel et al. [6], Nguyen and Widrow [A,Jordan [8],
Kawato [9],Elman and Zipser, Narendra [IO], and myself [I],
[Ill, [12], [15].Section IV will discuss what i s missing in this
simplified discussion, and how to do better.
At its core, backpropagation i s simply an efficient and
exact method for calculating all the derivatives of a single
target quantity (such as pattern classification error) with
respect to a large set of input quantities (such as the parameters or weights in a classification rule). Backpropagation
through time extends this method so that it applies to
dynamic systems. This allows one to calculate the derivatives needed when optimizing an iterative analysis procedure, a neural networkwith memory, or a control system
which maximizes performance over time.
zyxwvu
zyxwvutsrqponm
zyxw
zy
I. INTRODUCTION
Backpropagation through time is a very powerful tool,
with applications to pattern recognition, dynamic modeling, sensitivity analysis, and the control of systems over
time, among others. It can be applied to neural networks,
to econometric models, to fuzzy logic structures, to fluid
dynamics models, and to almost any system built up from
elementary subsystems or calculations. The one serious
constraint i s that the elementary subsystems must be represented by functions known to the user, functions which
are both continuous and differentiable (i.e., possess derivatives). For example, the first practical application of backpropagation was for estimating a dynamic model to predict
nationalism and social communications in 1974 [I].
Unfortunately, the most general formulation of backpropagation can only be used by those who are willing to
work out the mathematics of their particular application.
This paper will mainly describe a simpler version of backpropagation, which can be translated into computer code
and applied directly by neural network users.
Section II will review the simplest and most widely used
form of backpropagation,which may be called "basic back-
zyxwvutsrq
zyxwv
zyxwvutsrqpo
Manuscript received September 12,1989; revised March 15,1990.
The author i s with the National Science Foundation, 1800 G St.
NW, Washington, DC 20550.
IEEE Log Number 9039172.
II. BASIC BACKPROPACATION
A. The Supervised Learning Problem
Basic backpropagation is current the most popular
method for performing the supervised learning task, which
i s symbolized in Fig. 1.
In supervised learning, we try to adapt an artificial neural
network so that i t s actual outputs ( P ) come close to some
target outputs (Y)for a training set which contains T patterns. The goal i s to adapt the parameters of the network
so that it performs well for patterns from outside the training set.
The main use of supervised learning today lies in pattern
U.S. Government work not protected by U.S. copyright
1550
PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
eral case. Given a history of X(1) * * X(T) and Y(1) . . . Y(T),
we want to find a mapping from X to Y which will perform
well when we encounter new vectors Xoutside the training
set. The index "t" may be interpreted either as a time index
or as a pattern number index; however, this section will not
assume that the order of patterns i s meaningful.
zyxwvutsrqponmlk
'
B. Simple Feedforward Networks
zyxwvutsr
zyxwvutsrq
zy
zyxwv
zyxwvutsrqp
zyxwvutsrqp
zyxwvutsrq
zyxwvutsrqp
zyxwvutsrq
Fig. 1. Schematic of the supervised learning task.
recognition work. For example, suppose that we are trying
to build a neural network which can learn to recognize
handwritten ZIP codes. (AT&T has actually done this [13],
although the details are beyond the scope of this paper.)
We assumethat we already have a camera and preprocessor
which can digitize the image, locate the five digits, and provide a19 x 20 grid of ones and zeros representing the image
of each digit. We want the neural network to input the 19
x 20 image, and output a classification; for example, we
might ask the network to output four binary digits which,
taken together, identify which decimal digit i s being
observed.
Before adapting the parameters of the neural network,
one must first obtain a training database of actual handwritten digits and correct classifications. Suppose, for
example, that this database contains 2000 examples of
handwritten digits. In that case, T = 2000. We may give each
example a label t between 1 and 2000. For each sample t,
we have a record of the input pattern and the correct classification. Each input pattern consistsof 380numbers,which
may be viewed as a vector with 380 components; we may
call this vector X(t). The desired classification consists of
four numbers, which may be treatedAasa vector Y(t). The
actual output of the network will be Y ( t ) ,which may differ
from thedesired output Y ( t ) ,especially in the period before
the network has been adapted. To solve the supervised
learning problem, there aKe two steps:
We must specify the "topology" (connections and
equations) for a network which inputs X ( t ) and outputs a
four-component vector Y ( t ) , an approximation to Y(1). The
relation between the inputs and outputs must depend on
a set of weights (parameters) W which can be adjusted.
We must specify a "learning rule"-a procedure for
adjusting the weights W so as to make the actual outputs
P(t) approximate the desired outputs Y ( t ) .
Basic backpropagation i s currently the most popular
learning rule used in supervised learning. It i s generally
used with a very simple network design-to be described
in the next section-but the same approach can be used
with any network of differentiable functions, as will be discussed in Section IV.
Even when we use a simple network design, the vectors
X(t) and Y(t) need not be made of ones and zeros. They can
be made up of any values which the network i s capable of
inputting and outputting. Let us denote the components
of X ( t ) as Xl(t) * . . X,(t) so that there are m inputs to the
network. Let us denote the components of Y ( t ) as Vl(t) * * *
Y,,(t) so that we haven outputs. Throughout this paper, the
components of a vector will be represented by the same
letter as the vector itself, in the same case; this convention
turns out to be convenient because x ( t ) will represent a different vector, very closely related to X ( t ) .
Fig. 1 illustrates the supervised learning task in the gen-
WERBOS: BACKPROPAGATION THROUGH TIME
Before we specify a learning rule, we have to define
exactly how theoutputs of a neural net depend on i t s inputs
and weights. In basic backpropagation, we assume the following logic:
x,=X,,
I s i s m
(1)
1-1
net,
=
C
m
w,/x,,
/=I
<i
<i
x, = s(net,),
m
Y, = x , + ~ ,
1 5 i
I
IN
+n
(2)
+n
N
(3)
(4)
In
where the functions in (3) i s usually the following sigmoidal
function:
s(z) = 1/(1
+ e-')
(5)
and where N is a constant which can be any integer you
choose as long as it is no less than m. The value of N + n
decides how many neurons are in the network (if we include
inputs as neurons). Intuitively, net, represents thetotal level
of voltage exciting a neuron, and x, represents the intensity
of the resulting output from the neuron. (x, i s sometimes
called the "activation level" of the neuron.) It i s conventional to assume that there is a threshold or constant weight
W, added to the right side of (2); however, we can achieve
the same effect by assuming that one of the inputs (such
as X
), is always 1 .
The significance of these equations i s illustrated in Fig.
2. There are N n circles, representing all of the neurons
+
X Input
-
...
1
m m+l
i-1
i
N+l
I
1
4Output
Fig. 2.
zyx
N+n
...
Network design for basic backpropagation.
in the network, including the input neurons. The first rn
circles are really just copies of the inputs XI . . . ;X, they
are included as part of the vector x only as a way of simplifying the notation. Every other neuron in the networksuch as neuron numberi,which calculates net, and x,-takes
input from every cell which precedes it in the network. Even
the last output cell, which generates pn, takes inpu; from
other output cells, such as the one which outputs Y,-l.
In neural networkterminology, this network is"ful1yconnected" in the extreme. As a practical matter, it i s usually
desirable to limit the connections between neurons. This
can bedone bysimplyfixingsomeof theweights W,,tozero
so that they drop out of all calculations. For example, most
researchers prefer to use "layered" networks, in which all
1551
connection weights W,, are zeroed out, except for those
going from one"layer" (subsetof neurons)to the next layer.
Ingenera1,onemayzerooutasmanyorasfewoftheweights
as one likes, based on one's understanding of individual
applications. For those who first begin this work, it is conventional todefineonlythree layers-an input layer, a"hidden" layer, and an output layer. This section will assume
the full range of allowed connections, simply for the sake
of generality.
In computer code, we could represent this network as a
Fortran subroutine (assuming a Fortran which distinguishes upper case from lower case):
C
C
c
C
zyxwvut
zyxwvutsr
zyxwvutsrqp
zyxwvutsrqp
zyxwvutsrqpon
zyxwvutsr
SUBROUTINE NET(X, W, x, Yhat)
REAL X(m), W ( N +n, N+n),x(N +n), Yhat(n), net
INTEGER, i,j,m,n,N
First insert the inputs, as per equation (1)
DO 1 i = l , m
1 x(i) = X(i)
Next implement ( 2 ) and (3) together for each value
of i
D O 1000 i = m + l , N + n
calculate net, as a running sum, based on (2)
net = 0
DO I O / = 1,;-I
10
net = net
W(i, j ) * x ( j )
finally, calculate x, based on (3) and (5)
1000
x(i) = l/(l+exp(-net))
Finally, copy over the outputs, as per (4)
DO 2000; = l,n
2000
Yhat(i) = x(i+N);
+
C
C
Y(t) and the errors E(t) for that set of weights. Then we calculate the derivatives of Ewith respect to all of the weights;
this i s indicated by the dotted lines in Fig. 3. If increasing
a given weight would lead to more error, we adjust that
weight downwards. If increasinga weight leads to less error,
we adjust it upwards. After adjusting all the weights up or
down, we start all over, and keep on going through this process until the weights and the error settle down. (Some
researchers iterate until the error i s close to zero; however,
if the number of training patterns exceeds the number of
weights in the network-as recommended by studies on
generalization-it may not be possiblefortheerror to reach
zero.) The uniqueness of backpropagation lies in the
method used to calculate the derivatives exactly for all of
the weights in only one pass through the system.
In the pseudocode, note that Xand Ware technically the
inputs to the subroutine, while x and Yhat are the outputs.
Yhat is usually regarded as"the"output of the network, but
x may also have its uses outside of the subroutine proper,
as will be seen in the next section.
Many papers on backpropagation suggest that we need
only use the conventional chain rule for partial derivatives
to calculate the derivatives of € w i t h respect to all of the
weights. Under certain conditions, this can be a rigorous
approach, but its generality i s limited, and it requires great
care with the side conditions (which are rarely spelled out);
calculations of this sort can easily become confused and
erroneous when networks and applications grow complex.
Even when using (7) below, it is a good idea to test one's
gradient calculations using explicit perturbations in order
to be sure that there i s no bug in one's code.
When the idea of backpropagation was first presented to
the Harvard faculty in 1972, they expressed legitimate concern about the validity of the rather complex calculations
involved. To deal with this problem, I proved a new chain
rule for ordered derivatives:
a+ TARGET a TARGET
-~
82,
aZ,
In basic backpropagation, we choose the weights W,, so
as to minimize square error over the training set:
T
T
n
c E(t) = c c (1/2)(Vl(t) - Y,(t))2.
i=1
i=1 ,=1
(6)
This is simply a special case of the well-known method of
least squares, used very often in statistics, econometrics,
and engineering; the uniqueness of backpropagation lies
in how this expression i s minimized. The approach used
here is illustrated in Fig. 3.
,>I
az,
az,
az,
*-
zy
(7)
where the derivatives with the superscript represent
ordered derivatives, and the derivatives without subscripts
represent ordinary partial derivatives.Thischain rule isvalid
only for ordered systems where the values to be calculated
can be calculated one by one (if necessary) in the order zl,
22, . . . , zn,TARGET. The simple partial derivatives represent the direct impact of z, on z, through the system equation which determines z,. The ordered derivative represents the total impact of z, on TARGET, accounting for both
the direct and indirect effects. For example, suppose that
we had a simple system governed by the following two
equations, in order:
* 21
= 3 * z1 + 5 * z2.
22 =
z3
Fig. 3. Basic backpropagation (in pattern learning).
In basic backpropagation, we start with arbitrary values
for the weights W . (It is usual to choose random numbers
in the range from -0.1 to 0.1, but it may be better to guess
the weights based on prior information, in cases where prior
information i s available.) Next, we calculate the outputs
1552
a+ TARGET
+
zyxwvutsrqp
zyxwvutsrq
C. Adapting the Network: Approach
E=
D. Calculating Derivatives: Theoretical Background
4
The "simple" partial derivative of z3with respect to z1 (the
directeffect) is3; tocalculate thesimpleeffect,weonlylook
at the equation which determines z3.However, theordered
derivative of z3with respect to z, is 23 because of the indirect impact by way of z2.The simple partial derivative measures what happens when we increase z1(e.g., by l, in this
example) and assume that everything else (like z2) in the
equation which determines z3 remains constant. The
ordered derivative measures what happens when we
increase zl,and also recalculate all other quantities-like
PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
agation of information i s what gives backpropagation i t s
name. A little calculus and algebra, starting from (5), shows
us that
zyxwvutsrq
zyxwvutsrqpo
z,-which are later than z, in thecausal ordering we impose
on the system.
This chain rule provides a straightforward, plodding,
”linear“ recipe for how tocalculate thederivatives of agiven
TARGET variable with respect to a// of the inputs (and
parameters) of an ordered differentiable system in onlyone
pass through the system. This paper will not explain this
chain rule in detail since lengthy tutorials have been published elsewhere [I], [Ill. But there i s one point worth noting: because we are calculating ordered derivatives of one
target variable, we can use a simpler notation, a notation
which works out to be easier to use in complex practical
examples [Ill. We can write the ordered derivative of the
TARGETwith respecttoz,as “F-z,,”which may bedescribed
as “the feedback toz,.” In basic backpropagation, the TARGET variable of interest i s the error €. This changes the
appearance of our chain rule in that case to
5’(z) = s(z)
* (1
(1 3)
- s(z)),
which we can use when we implement (11).Finally, to adapt
the weights, the usual method i s to set
New W,, = W,, - learning-rate
* F-W,,
(14)
where the learningrate i s some small constant chosen on
an ad hoc basis. (The usual procedure i s to make it as large
as possible, u p to 1, until the error starts to diverge; however, there are more analytic procedures available [Ill.)
F. Adapting the Network: Code
The key part of basic backpropagation-(l0)-(13)-may
be coded up into a ”dual” subroutine, as follows.
SUBROUTINE f-NET(F-Yhat, W, x, f-W)
REAL F-Y hat(n),W (N+n, N n),x (N n),
F-W(N n, N n), F-net(N +n),F-x (N+ n)
INTEGER i,j,n,m,N
Initialize equation (IO)
DO 1 i = l , N
1
F-x(i) = 0
DO 2 i = l,n
2
F-x(i+ N ) =F-Yhat(i)
RUN THROUGH (10)-(12) AS A SET,
FOR i RUNNING BACKWARDS
DO 1000 i = N + n , m + l , - I
complete (IO) for the current value of i *I
DO 10 j = i + l , N + n
modify ”DO IO” if needed to be sure
nothing i s done if i=N+n
10
F-x(i) = F-x(i)+W(j,i)*f-net(j)
next implement (II), exploiting (13)
F-net(i) = f-x(i)*x(i)*(l.-x(i))
then implement (12) for the current
value of i
DO 1 2 j = 1,;-I
F-W(i, j)=F-net(i)*x(j)
12
1000
CONTINUE
+
+
zyxwvutsr
zyxwvutsrqpon
zyxwvutsrqp
For purposesof debugging, one can calculate the truevalue
of anyordered derivativesimply byperturbingz,atthe point
in the program where z, i s calculated; this i s particularly
useful when applying backpropagation to a complex network of functions other than neural networks.
+
C
E. Adapting the Network: Equations
For a given set of weights W, it i s easy to use (1)-(6) to
calculate Y ( t )and E ( t )for each pattern t. The trick is in how
we then calculate the derivatives.
Let us use the prefix “F-” to indicate the ordered derivative of € with respect to whatever variable the “F-” precedes. Thus, for example,
C
C
C
C
zyxwvutsrq
C
z
C
which follows simply by differentiating (6).Bythechain rule
for ordered derivatives as expressed in (8),
N+n
F-x,(t) =
(10)
=N+n;..,m+I
F-net,(t) = s’(net,)
* F-x,(t),
i =N
+ n,
*
*
,m
+1
(11)
T
F-w,, =
C
t=l
C
C
+ / =C
w,, * F-net,(t),
,+1
f-?l-N(t)
I
F-net,(t)
Note that the array F-W i s the only output of this subroutine.
Equation (14) represented ”batch learning,” in which
weights are adjusted only after a / / Tpatterns are processed.
It i s more common to use pattern learning, in which the
weights are continually updated after each observation.
Pattern learning may be represented as follows:
zyxwvutsrqpo
* x,(t)
(12)
where s’ i s the derivative of s(z) as defined in (5) and F-Yk
is assumed to be zero for k 5 0. Note how (IO) requires us
to run backwards through the network in order to calculate
the derivatives, as illustrated in Fig. 4; this backwards prop-
C PATTERN LEARNING
DO 1000 pass-num ber = 1, maximum-passes
DO 1 0 0 t =1,T
Calculation o f t
X Input
1
I
Hidden Cells
_
I
?Output
N N+l
mm+l
c
Calculation of F x
Map o f 1
Fig. 4.
CALL NET&@), W, x , Yhat)
Next implement equation (9)
DO 9 i = l , n
F-Yhat(i)= Yhat(i)- Y ( t ,i )
Next Implement (10)-(12)
CALL F-NET(F-Yhat, W, x, F-W)
Next Implement (14)
Note how weights are updated
within the “DO 100” loop.
zyxwvutsrqponmlk
C
I
+
Backwards flow of derivative calculation.
WERBOS: BACKPROPAGATION
N+n
9
C
C
C
C
zyxwvutsr
THROUGH TIME
.
]
1553
I
zyxwvutsrqp
zyxwvutsrqp
zyxwvutsrqp
zyxwvutsrqpon
zyxwvutsrqponml
DO 14 i = m + l , N + n
DO 1 4 j = I , ; - I
14
100
1000
W(i,j)=W(i,j)learning-rate*
F-W(i, j )
CONTl NUE
CONTINUE
Ill. BACKPROPAGATION
THROUGH
TIME
A. Background
Backpropagation through time-like basic backpropagation-is used most often in pattern recognition today.
Therefore, this section will focuson such applications, using
notation like that of the previous section. See Section IV for
other applications.
In someapplications-such as speech recognition or submarine detection-our classification at time t w i l l be more
accurate if we can account for what we saw at earlier times.
Even though the training set still fits the same format as
above, we want to use a more powerful class of networks
to do the classification; we want the output of the network
at time t to account for variables at earlier times (as in Fig.
1554
X(T-11
i(T-1)
&IT-21
?(T-21
zyxwvutsrqpo
zyxwvut
zyxw
The key pair.- here is that the weights Ware adjustec in
responseto the current vector F-W, which only depends on
the current pattern t; the weights are adjusted after each
pattern i s processed. (In batch learning, by contrast, the
weights are adjusted only after the "DO 100" loop i s completed.)
In practice, maximum-passes is usually set to an enormous number; the loop is exited only when a test of convergence is passed, a test of error size or weight change
which can be injected easily into the loop. True real-time
learning is like pattern learning, but with only one pass
through the data and no memory of earlier times t. (The
equations above could be implemented easily enough as
a real-time learning scheme; however, this will not be true
for backpropagation through time.) The term "on-line
learning" i s sometimes used to represent a situation which
could be pattern learning or could be real-time learning.
Most people using basic backpropagation now use pattern
learning rather than real-time learning because, with their
data sets, many passes through the data are needed to
ensure convergence of the weights.
The reader should be warned that I have not actually
tested the code here. It i s presented simply as a way of
explaining more precisely the preceding ideas. The C
implementations which I have worked with have been less
transparent, and harder to debug, in part because of the
absence of range checking in that language. It i s often
argued that people "who knowwhat they are doing" do not
need range checking and the like; however, people who
think they never make mistakes should probably not be
writing this kind of code. With neural network code, especially, good diagnostics and tests arevery important because
bugs can lead to slow convergence and oscillation-problems which are hard to track down, and are easily misattributed to the algorithm in use. If one must use a language
without range checking, it i s extremely important to maintain a version of the code which is highly transparent and
safe, however inefficient it may be, for diagnostic purposes.
5).
?IT1
Fig. 5.
Generalized network design with time lags.
The Introduction cited a number of exampleswheresuch
"memory" of previous time periods is very important. For
example, it is easier to recognize moving objects if our network accounts for changes in the scene from the time t 1 to time t, which requires memory of time t - 1. Many of
the best pattern recognition algorithms involve a kind of
"relaxation" approach where the representation of the
world at time t i s based on an adjustment of the representation at time t - 1; this requires memory of the internal
network variables for time t - 1. (Even Kalman filtering
requires such a representation.)
B. Example of a Recurrent Network
Backpropagationcan be applied to any system with a welldefined order of calculations, even if those calculations
depend on past calculations within the network itself. For
the sake of generality, I will show how this works for the
network design shown in Fig. 5 where every neuron is
potentially allowed to input values from any of the neurons
at the two previous time periods (including, of course, the
input neurons). To avoid excess clutter, Fig. 5 shows the
hidden and output sections of the network (parallel to Fig.
2)onlyfortime T, but they are presentat othertimesaswell.
To translate this network into a mathematical system, we
can simply replace (2) above by
N+n
1-1
net,(t) =
C
/=I
W,,x,(t)
+
C
/=I
N+il
~ ; , x , ( t- I )
+
C
j=1
w;X,(t
- 2).
(15)
Again, we can simply fix some of the weights to be zero, if
we so choose, in order to simplify the network. In most
applications today, the W weights are fixed to zero (i.e.,
erased from all formulas), and all the W weights are fixed
to zero as well, except for W;,. This is done in part for the
sake of parsimony, and in part for historical reasons. (The
"time-delay neural networks" of Watrous and Shastri [5]
assumed that special case.) Here, I deliberately include extra
terms for the sake of generality. I allow for the fact that a l l
active neurons (neurons other than input neurons) can be
allowed to input the outputs of any other neurons if there
i s a time lag in the connection. The weights W and W are
the weights on those time-lagged connections between
neurons. [Lags of more than two periods are also easy to
manage; they are treated just as one would expect from
seeing how we handle lag-two terms, as a special case of
( 7 ~
These equations could be embodied in a subroutine:
zyxwvut
SUBROUTINE NETZ(X(t), W', W", x(t - 2),
x(t - I), XU), Yhat),
PROCEEDINGS OF THE IEEE, VOL. 78, NO. IO, OCTOBER 1990
zyxwvutsrq
which is programmed just like the subroutine NET, with the
modifications one would expect from (15).The output arrays
are x(t) and Yhat.
When we call this subroutine for the first time, at t = 1,
we face a minor technical problem: there i s no value for
x(-1)or x(O), both of which we need as inputs. In principle,
we can use any values we wish to choose; the choice of x( -1)
and x(0) i s essentially part of the definition of our network.
Most people simply set these vectors to zero, and argue that
their network will start out with a blank slate in classifying
whatever dynamic pattern is at hand, both in the training
set and in later applications. (Statisticians have been known
to treat these vectors as weights, in effect, to be adapted
alongwith the otherweights in the network. Thisworks fine
in the training set, but opens up questions of what to do
when one applies the network to new data.)
In this section, I will assume that the data run from an
initial time t = 1 through to a final time t = T,which plays
a crucial role in the derivative calculations. Section IV will
show how this assumption can be relaxed somewhat.
First, to calculate the derivatives, we need a new subroutine, dual to NET2.
SUBROUTINE F_NET2(F_Yhat, W, W , W ” , x, F-net,
F-net’, F-net“, F-W, F - W , F - W )
REAL F-Yhath), W(N+n, N+n),
W(N+n,N+n), W”(N+n, N+n)
REAL x(N+ n),F-net(N n),F-net’(N + n ) ,
F-net”(N n )
REAL F- W(N n, N + n),F-W(N n, N + n),
F- W”(N n, N n ) ,F- x (N n )
INTEGER i, j , n, rn, N
Initialize equation (16)
DO l i = l , N
1
F-x(i) =O.
D O 2 i = l,n
2
F-x(i+ N)=F-Yhat(i)
RUN THROUGH (16), (IIAND
), (12) AS A SET,
RUNNING BACKWARDS
DO 1000 i=N+n,rn+l,-I
first complete (16)
DO 161 j = i+l,N+n
161
F-x(i)=F-x(i)+ W ( j,i)*F-net( j )
D O 162; = rn+l,N+n
162
F- x (i) =F- x ( i ) W’( j , i ) * F-n et ’( j )
W”(j,i)*F-net”( j )
next implement (11)
F-net(i) =F-x(i)*x(i)*(l - x ( i ) )
implement (12), (17), and (18)
(as running sums)
DO 1 2 j = 1,;-I
12
F-W(i,j)=F-W(i,j)
F-net(i)*x ( j )
DO 1718 j =I,N+n
F-W(i,j)=F-W(i,j)
F-net’(i )*x ( ;
1718
F-W(i, j )=F-W(i, j )
+F-net”(i)*x(j)
1000
CONTINUE
+
+
C
zyxwvu
+
+
+
+
+
zyxwvut
zyxwvuts
C. Adapting the Network: Equations
C
C
zyxwvuts
zyxwvutsrq
C
To calculate the derivatives of F-W,,, we use the same
equations as before, except that (IO) i s replaced by
+
N+”
F-x,(t) = F-)ir-N(t)
+ 1 =c, + l
N+n
+ I=rn+l
W,;
W,
* F-net,(t)
C
C
* F-net,(t + 1)
zyxwvu
N+n
+ / = cm + l
’
+
W;
* F-net,(t + 2).
(16)
Once again, if one wants to fix the W” terms to zero, one
can simply delete the rightmost term.
Notice that this equation makes it impossible for U S to
calculate F-x,(t) and F-net,(t) until after F-net,(t
1) and
2) are already known; therefore, we can only use
F-net,(t
this equation by proceeding backwards in time, calculating
F-net for time T, and then working our way backwards to
time 1.
To adapt this network, of course, we need to calculate
F-W:, and F-W; as well as F-W,,:
+
+
+
+
T
c F-net,(t + 1) * x,(t)
Notice that the last two DO loops have been set up to
perform running sums, to simplify what follows.
Finally, we may adapt the weights as follows, by batch
learning, where I use theabbreviation x(i,), to represent the
vector formed by x(i,j) across all j .
zyxwvutsrqponmlk
F-W;, =
(17)
I l l
r
F-W;
=
t=l
F-net,(t
+ 2) * x,(t).
(18)
+
+
In all of these calculations, F-net(T
1) and F-net(T
2)
should be treated as zero. For programming convenience,
Iwill later define quantities like F-net;(t) = F-net,(t I), but
this i s purely a convenience; the subscript “ i ” and the time
argument are enough to identify which derivative i s being
represented. (In other words, net,(f) represents a specific
quantity z, as in (8), and F-net,(t) represents the ordered
derivative of E with respect to that quantity.)
C
C
+
D. Adapting the Network: Code
To fully understand the meaning and implications of
these equations, it may help to run through a simple (hypothetical) implementation.
WERBOS: BACKPROPACATION THROUGH TIME
C
C
REAL x(-l:T,N+n),Yhat(T,n)
/ (2*(N+n)) * 0.01
DATA x(O,),X(-I,)
D O 1000 pass-number=l, maximum-passes
First calculate outputs and errors in
a forward pass
D O 100t=I,T
CALL NET2(X(t),W, W’, W , x (t- 21,
100
x(t-l),x(t,),Yhat(t,))
Initialize the running sums to 0 and
set F-net(T), F-netfT+I) to 0
DO 200 i = m + l , N + n
F-net’(i)=O.
F-net”(i)=O.
D O 199j = l,N+n
F-W(i, j ) = O .
F - W ( i , j ) =O.
F- W ( i ,j ) =O.
199
CONTl N UE
200
zyxwv
1555
I
zyxwvutsrqponm
zyxwvutsrqpo
zyxwvutsrq
zyxwvutsrq
zyxwvutsrq
zyxwvutsrq
C NEXT CALCULATE THE DERIVATIVES IN A SWEEP
BACKWARDS THROUGH TIME
DO 500 t = T,l,-I
C
First, calculate the errors at the
C
current time t
DO 410; = 1,n
410
F-Yhat(i)= Yhat(t,i)- Y ( t ,i)
C
Next, calculate F-net(t) for time t and
C
update the F-W running sums
call F_NET2(F_Yhat,W,w',W,X(t,),
F-net, +net', F-net", F-W,
F-W', F-W")
C
Move F-net(t+l) to F-net(t), in effect,
C
to prepare for a new t value
DO 420 i = m + l , N + n
F-net"(i) =F-net'(i)
420
F-net'(i)= F-net(i)
500
CONTINUE
C FINALLY, UPDATE THE WEIGHTS BY
STEEPEST DESCENT
DO 999 i = m+l,N+n
DO 998 j = I,N+n
W(i, j ) = W(i, j )learning-rate*F-W(i, j )
W ( i ,j)=W'(i,j)learning-rate*F-W(i, j )
998
W ( i , j)-learningrate*
F - W ( i , j);
999
CONTl N UE
1000
CONTINUE
Once again, note that we have to go backwards in time
in order to get the required derivatives. (There are ways to
do these calculations in forward time, but exact results
require the calculation of an entire lacobian matrix, which
i s far more expensive with large networks.) For backpropagation through time, the natural way to adapt the network
i s in one big batch. Also note that we need to store a lot of
intermediate information (which is inconsistent with realtimeadaptation).This storagecan be reduced by clever programming if w' and w" are sparse, but it cannot be eliminated altogether.
In using backpropagation through time, we usually need
to use much smaller learning rates than wedo in basic backpropagation if we use steepest descent at all. In my experience [20], it may also help to start out by fixing the w'
weights to zero (or to 1when we want to force memory) in
an initial phase of adaptation, and slowly free them up.
In some applications, we may not really care about errors
in classification at all times t. In speech recognition, for
example, we mayonlycareabout errorsattheendof aword
or phoneme; we usually do output a preliminary classification before the phoneme has been finished, but we usually do not care about the accuracy of that preliminary classification. In such cases, we may simply set F-Yhat to zero
in thetimeswedonot careabout.To be moresophisticated,
we may replace (6)by a more precise model of what we do
care about; whatever we choose, it should be simple to
replace (9) and the F-Yhat loop accordingly.
A. Use of Other Networks
The network shown in (1)-(5) is a very simple, basic network. Backpropagationcan be used to adapt a wide variety
of other networks, including networks representing econometric models, systems of simultaneous equations, etc.
Naturally, when one writes computer programs to implement a different kind of network, one must either describe
which alternative network one chooses or else put options
into the program to give the user this choice.
In the neural network field, users areoften given achoice
of network "topology." This simply means that they are
asked to declare which subset of the possible weightslconnections will actually be used. Every weight removed from
(15)should be removedfrom (16)aswell,alongwith(12)and
(14) (or whichever apply to that weight); therefore, simplifying the network by removing weights simplifies all the
other calculations as well. (Mathematically,this is the same
as fixing these weights to zero.) Typically, people will
remove an entire block of weights, such that the limits of
the sums in our equations are all shrunk.
In a truly brain-like network, each neuron [in (15)] will
only receive input from a small number of other cells. Neuroscientists do not agree on how many inputs are typical;
somecite numbersontheorderof lOOinputspercell,while
others quote 10 000. In any case, all of these estimates are
small compared to the billions of cells present. To implement this kind of network efficiently on a conventional
computer, one would use a linked list or a list of offsets to
represent the connections actually implemented for each
cell; the same strategy can be used to implement the backwards calculations and keep the connection costs low. Similar tricks are possible in parallel computers of all types.
Many researchers are interested in devising ways to automatically make and break connections so that users will not
have to specify all this information in advance [20]. The
research on topology is hard to summarize since it is a mixture of normal science, sophisticated epistemology, and
extensive ad hoc experimentation; however, the paper by
Guyon et al. [I31 is an excellent example of what works in
practice.
Even in the neural network field, many programmerstry
to avoid thecalculation of the exponential in (5).Depending
on what kind of processor one has available, this calculation can multiply run times by a significant factor.
In the first published paper which discussed backpropagation at length as a way to adapt neural networks [14], I
proposed the use of an artificial neuron ("continuous logic
unit," CLU) based on
zyxwvutsrq
Iv.
EXTENSIONS
OF THE
METHOD
Backpropagationthrough time i s a very general method,
with many extensions. This section will try to describe the
most important of these extensions.
1556
zyxw
s(z) = 0,
s(z) = 2,
s(z) = 1,
<0
0 <z <
z > 1.
z
1
This leadstoaverysimplederivativeaswell. Unfortunately,
thesecondderivativesof this function are not well behaved,
which can affect the efficiency of some applications. Still,
many programmersare now using piecewise linear approximations to (3,along with lookup tables, which can work
relatively well in some applications. In earlier experiments,
I have also found good uses for a Taylor series approximation:
S(Z) =
1/(1 - z
+ 0.5 * z'),
z < o
zyx
zyxwvutsrqponml
PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
I
1
U
zyxwvutsrq
zyxwvutsrqponmlkjihgfedc
S(Z) =
.
1 - 1/(1
+ z + 0.5 * z2),
z
>
0.
In a similar spirit, it is common to speed up learning by
“stretching out” s(z) so that it goes from -1 to 1 instead
of 0 to 1.
Backpropagation can also be used without using neural
networks at all. For example, it can be used to adapt a network consisting entirely of user-specified functions, representing something like an econometric model. In that
case, the way one proceeds depends on who one i s programming for and what kind of model one has.
If one i s programming for oneself and the model consists
of a sequence of equations which can be invoked one after
the other, then one should consider the tutorial paper [ I l l ,
which alsocontains a more rigorous definition of what these
”F-x,” derivatives really mean and a proof of the chain rule
for ordered derivatives. If one i s developing a tool for others, then one might set it up to look like a standard econometric package (like SAS or Troll) where the user of the
system types in the equations of his or her model; the backpropagation would go inside the package as a way to speed
up these calculations, and would mostly be transparent to
the user. If one’s model consists of a set of simultaneous
equations which need to be solved at each time, then one
must use more complicated procedures [15]; in neural network terms, one would call this a ”doubly recurrent network.” (The methods of Pineda [I61 and Almeida [I71 are
special cases of this situation.)
Pearlmutter [I81 and Williams [I91 have described alternative methods,designed toachieve results similartothose
of backpropagation through time, using a different computational strategy. For example, the Williams-Zipser
method i s a special caseof the“conventiona1 perturbation”
equation cited in [14], which rejected this as a neural network method on the grounds that its computational costs
scale as the square of the network size; however, the
method does yield exact derivatives with a time-forward
calculation.
Supervised learning problems or forecasting problems
which involve memory can also be translated into control
problems [15, p. 3521, [20], which allows the use of adaptive
critic methods, to be discussed in the next section. Normally, this would yield only an approximate solution (or
approximate derivatives), but it would also allow time-forward real-time learning. If the network itself contains calculation noise (due to hardware limitations), the adaptive
critic approach might even be more robust than backpropagation through time because it is based on mathematics which allow for the presence of noise.
we had control over (such as the settings of motors or actuators), The combination of X ( t ) and u ( t ) is input to the network at each time t . Our target, at time t , i s the vector XU
1).
We could easily build a network to input these inputs,
and aim at these targets. We could simply collect the inputs
and targets into the format of Section II, and then use basic
backpropagation. But basic backpropagation contains no
“memory.” The forecast of X(t 1) would depend on X(t),
but not on previous time periods. If human beings worked
like this, then they would be unable to predict that a ball
might roll outthefarsideof atableafter rollingdown under
the near side; as soon as the ball disappeared from sight
[from the current vector X(t)], they would have no way of
accounting for i t s existence. (Harold Szu has presented a
more interesting example of this same effect: if a tiger
chased after such a memoryless person, the person would
forget about the tiger after first turning to run away. Natural
selection has eliminated such people.) Backpropagation
through time permits more powerful networks, which do
have a “memory,” for use in the same setup.
Even this approach to the neuroidentification problem
has i t s limitations. Like the usual methods of econometrics
[15], it may lead to forecasts which hold u p poorly over multiple time periods. It does not properly identify where the
noise comes from. It does not permit real-time adaptation.
In an earlier paper [20], I have described some ideas for
overcomingthese limitations, but more research i s needed.
The first phase of Kawato’s cascade method [9] for controllinga robot arm isan identification phase,which i s more
robust over time, and which uses backpropagation through
timeinadifferentway; it isaspecia1caseofthe”purerobust
method,” which also worked well in the earliest applications which I studied [I], [20].
After we have solved the problem of identifying a dynamic
system, we are then ready to move on to controlling that
system.
In neurocontrol, we often start out with a model or network which describes the system or plant we are trying to
control. Our problem is to adapt a second network, the
action network, which inputs X(t) and outputs the control
u ( t ) .(In actuality, we can allow the action network to “see”
or input the entire vector x ( t ) calculated by the model network; this allows it to account for memories such as the
recent appearance of a tiger.) Usually, we want to adapt the
action network so as to maximize some measure of performance or utility U(X, t ) summed over time. Performance
measures used in past applications have included everything from the energy used to move a robot arm [8], [9]
through to net profits received bythegas industry[ll].Typically, we are given a set of possible initial states XU), and
asked to train the action network so as to maximize the sum
of utility from time 1 to a final time T.
To solve this problem using backpropagation through
time, we simply calculate the derivatives of our performance measure with respect to all of the weights in the
action network. “Backpropagation” refers to how we calculate the derivatives, notto anything involving pattern recognition or error. We then adapt the weights according to
these derivatives, as in (121, except that the sign of the
adjustment term i s now positive (because we are maxirnizing rather than minimizing).
The easiest way to implement this approach i s to merge
+
+
zyxwv
zyxwvu
zyxwvutsrqpo
zyxwvuts
zyxwvutsr
B. Applications Other Than Supervised Learning
Backpropagation through time can also be used in two
other major applications: neuroidentification and neurocontrol. (For applications to sensitivity analysis, see [14] and
[151.)
In neuroidentification, we try to do with neural nets what
econometricians do with forecasting models. (Engineers
would call this the identification problem or the problem
of identifying dynamic systems. Statisticians refer to it as
the problem of estimating stochastic time-series models.)
Ourtrainingsetconsistsof vectorsX(t) and u ( t ) ,notX(t) and
Y ( t ) . Usually, X(t) represents a set of observations of the
external (sic) world, and u ( t )represents a set of actions that
WERBOS: BACKPROPACATION THROUGH TIME
1557
I
zyxwvutsrq
zyxwvutsrqpo
zyxwvuts
zyxwvutsrq
zyxwvutsrqpo
zyx
zyxwv
zyxwvutsr
the utility function, the model network, and the action network into one big network. We can then construct the dual
to this entire network, as described in 1974 [I] and illustrated in my recent tutorial [ I l l . However, if wewish to keep
the three component networks distinct, then the bookkeeping becomes more complicated. The basic idea is illustrated in Fig. 6, which maps exactly into the approach used
by Nguyen and Widrow [ A and by Jordan [8].
SUBROUTINE F-MODEL(F-net’, F - X , x, F-net, F-U)
C The weights inside this subroutine are those
C used in MODEL, analogous to those in NET2, and are
C unrelated to the weights in ACTION
REAL F-net ’( N; n),F- X‘( n),x ( N n),
F-net(N+n), F-u ( p ) ,F-x(N+n)
INTEGER i, j,n,m, N,p
DO l i = l , N
F-x(i) =O.
1
DO 2 i = 1,n
F- x (i N ) = F- X(i)
2
DO 1000 i = N+n,l, -1
DO 91Oj = i+l,N+n
F-x(i)=F-x(i) W (j , i)*F-net(j 1
910
DO 920j = m + l , N + n
F-x(i)=F-x(i)+ W ( j,i)*F-net’(j )
920
F-net(i) =F-x(i)*x(i)*(l -x(i));
1000
DO 2000; = l,p
F-u ( i )=F-x(n + i )
2000
+
+
+
Fig. 6. Backpropagatingutilitythrough time. (Dashed lines
represent derivative calculations.)
Insteadof workingwith asinglesubroutine, NET,we now
need three subroutines:
UTILITY(X; t; x”;
U)
MODEL(X(t),u(t); x ( t ) ; X(t
+ 1))
ACTION(x(t); W; x’(t); u(t)).
Ineachofthesesubroutines,thetwoargumentsonthe right
are technically outputs, and the argument on the far right
is what we usually think of as the output of the network.
We need to know the full vector x produced inside the
model network so that the action network can ”see” important memories. The action network does not need to have
its own internal memory, but we need to save its internal
state (x’) so that we can later calculate derivatives. For simplicity, I will assume that MODEL does not contain any lagtwo memory terms (i.e., W weights). The primes after the
x’s indicate that we are looking at the internal states of different networks; they are unrelated to the primes representing lagged values, discussed in Section I l l , which we
will also need in what follows.
To use backpropagation through time, we need to construct dual subroutines for a l l three of these subroutines:
F-UTILITY(x”; t; F-X)
+
zyxwvutsrqp
F-MODEL(F-net’, F-X(t
+ 1); x (0; F-net,
F-U)
F-ACTION(F-u; x’(t); F-W).
The outputs of these subroutines are the arguments on the
far right (including F-net), which are represented by the
broken lines in Fig. 4. The subroutine F-UTILITY simply
reports out the derivatives of U(x, t) with respect to the
variables)(,.The subroutine F-MODEL i s like the earlier subroutine F-NET2, except that we need to output F-U instead
of derivatives to weights. (Again, we are adapting only the
action network here.)The subroutine F-ACTION isvirtually
identical to the old subroutine F-NET, except that we need
to calculate F-W as a running sum (as we did in F-NET2).
Of these three subroutines, F-MODEL i s by far the most
complex. Therefore, it may help to consider some possible
code.
1558
The last small DO loop here assumes that u(t) was part
of the input vector to the original subroutine MODEL,
1) and x(m). Again,
inserted into the slots between x(n
a good programmer could easily compress all this; my goal
here i s only to illustrate the mathematics.
Finally, in order to adapt the action network, we go
through multiple passes,each startingfrom oneofthestarting values of XU).In each pass, we call ACTION and then
MODEL, one after the other, until we have built up a stream
of forecasts from time 1 up to time T. Then, for each time
t going backwards from T to 1, we call the UTILITY subroutine, then F-UTILITY, then F-MODEL, and then F-ACTION. At the end of the pass, we have the correct array of
derivatives F-W, which wecan then usetoadjust theweights
of the action network.
In general, backpropagation through time has theadvantage of being relatively quick and exact. That is why I chose
it for my natural gas application [ I l l . However, it cannot
account for noise in the process to the controlled. To
account for noise in maximizing an arbitrary utility function, we must rely on adaptive critic methods [21]. Adaptive
critic methods do not require backpropagation through
time in any form, and are therefore suitable for true realtime 1earning.Thereareotherformsof neurocontrol aswell
[21] which are not based on maximizing a utility function.
C. Handling Strings of Data
In most of the examples above, I assumed that the trainingdataform one lonetime series, from tequalsl to tequals
T. Thus, in adapting the weights, I always assumed batch
learning (except in the code in Section 11); the weights were
always adapted after a complete set of derivatives was calculated, based on a complete pass through all the data.
Mechanically, one could use pattern learning in the backwards pass through time; however, this would lead to a host
of problems, and it i s difficult to see what it would gain.
Data in the real world are often somewhere between the
two extremes represented by Sections II and Ill. Instead of
having a set of unrelated patterns or one continuous time
series, we often have aset of time series or strings. For example, in speech recognition, our training set may consist of
a set of strings, each consisting of one word or one sen-
PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
tence. In robotics, our training set may consist of a set of
strings, where each string represents one experiment with
a robot.
I n these situations, we can apply backpropagation
through time to a single string of data at a time. For each
string, we can calculate complete derivatives and update
the weights. Then we can go on to the next string. This i s
like pattern learning, in that the weights are updated incrementally before the entire data set i s studied. It requires
intermediate storage for only one string at a time. To speed
things up even further, we might adapt the net in stages,
initially fixing certain weights (like W j j )to zero or one.
Nevertheless,string learning i s notthe samething as realtime learning. To solve problems in neuroidentification and
supervised learning, the only consistent way to have internal memory terms and to avoid backpropagation through
time i s to use adaptive critics in a supporting role [15]. That
alternative is complex, inexact, and relatively expensive for
these applications; it may be unavoidable for true real-time
systems like the human brain, but it would probably be better to live with string learning and focus on otherchallenges
in neuroidentification for the time being.
parts of engineering; however, there i s a large literature on
alternative approaches [12], both i n neural network theory
and in robust statistics.
These literatures are beyond the scope of this paper, but
a few related points may be worth noting. For example,
instead of minimizing square error, we could minimize the
1.5 power of error; all of the operations above still go
through. We can minimize E of (5) plus some constant k
times the sum of squares of theweights; as kgoes to infinity
and the network i s made linear, this converges to Kohonen's pseudoinverse method, a common form of associative memory. Statisticians like Dempster and Efron have
argued that the linear form of this approach can be better
than the usual least squares methods; their arguments capture the essential insight that people can forecast by analogyto historical precedent, instead of forecasting byacomprehensive model or network. Presumably, an ideal
network would bring together both kinds of forecasting
[121, [201.
Many authors worry a lot about local minima. In using
backpropagationthrough time in robust estimation, I found
it important to keep the "memory" weights near zero at
first, and free them up gradually in order to minimize problems. When T i s much larger than m-as statisticians recommend for good generalization-local minima are probably a lot less serious than rumor has it. Still, with T larger
than m, it is very easy to construct local minima. Consider
the example with m = 2 shown in Table I.
zyxwvutsrq
zyxwvuts
zyxwvutsrq
zyxwvutsrqpo
D. Speeding Up Convergence
For those who are familiar with numerical analysis and
optimization, it goes without saying that steepest descentas in (12)-is a very inefficient method.
There is a huge literature in the neural network field on
how to speed up backpropagation. For example, Fahlman
and Touretzky of Carnegie-Mellon have compiled and
tested avarietyof intuitive insights which can speed upconvergence a hundredfold. Their benchmark problems may
be very useful in evaluating other methods which claim to
do the same. A few authors have copied simple methods
from the field of numerical analysis, such as quasi-Newton
methods (BFGS) and Polak-Ribiere conjugate gradients;
however, the former works only on small problems (a
hundred or so weights) [22], while the latter works well only
with batch learningandverycareful linesearches.The need
for careful line searches i s discussed in the literature [23],
but I have found it to be unusually importantwhenworking
with large problems, including simulated linear mappings.
In my own work, I have used Shanno's more recent conjugate gradient method with batch learning; for a dense
training set-made up of distinctly different patterns-this
method worked better than anything else I tried, including
pattern learning methods [12]. Many researchers have used
approximate Newton's methods, without saying that they
are using an approximation; however an exact Newton's
method can also be implemented in O ( N )storage, and has
worked reasonably well in early tests [12]. Shanno has
reported new breakthroughs in function minimization
which may perform still better [24]. Still, there i s clearly a
lot of room for improvement through further research.
Needless to say, it can be much easier to converge to a
setofweightswhichdonot minimizeerrororwhichassume
a simpler network; methods of that sort are also popular,
but are useful only when they clearly fit the application at
hand for identifiable reasons.
zyxwvutsr
zyxwvut
Table 1 Training Set for Local Minima
t
1
2
3
X(t)
Y(t)
0 1
1 0
.1
.I
1 1
.9
zyxw
zyxwvutsrqpon
E. Miscellaneous Issues
Minimizing square error and maximizing likelihood are
often taken for granted as fundamental principles in large
WERBOS: BACKPROPACATION THROUGH TIME
The error for each of the patterns can be plotted as a contour map as a function of the two weights w,and w2.(For
this simple example, no threshold term i s assumed.) Each
map i s made up of straight contours, defining a fairly sharp
trough about a central line. The three central lines for the
three patterns form a triangle, the vertices of which correspond roughly to the local minima. Even when Tis much
larger than m, conflicts like this can exist within the training
set. Again, however, this may not be an overwhelming problem in practical applications [19].
U.
SUMMARY
Backpropagation through time can be applied to many
different categories of dynamical systems-neural networks, feedforward systems of equations, systems with time
lags, systems with instantaneous feedback between variables (as in ordinary differential equations or simultaneous
equation models), and s o on. The derivatives which it calculates can be used in pattern recognition, in systems identification, and i n stochastic and deterministic control. This
paper has presentedthe keyequationsof backpropagation,
as applied to neural networks of varying degrees of complexity. It has also discussed other papers which elaborate
on the extensions of this method to more general applications and some of the tradeoffs involved.
1559
zyxwvutsrqponm
zyxwvutsrqpon
zyxwvuts
zyxwvutsrqponm
zyxwvutsrqp
I
REFERENCES
[I] P. Werbos, “Beyond regression: Newtools for prediction and
analysisin the behavioralsciences,” Ph.D. dissertation, Committee on Appl. Math., Haivard Univ., Cambridge, MA, Nov.
1974.
t21 D. Rumelhart,D. Hinton, and G. Williams, “Learning internal
representationsby error propagation,” in D. Rumelhartand
F. McClelland, eds., Parallel Distributed Processing, Vol. 7.
Cambridge, M A M.I.T. Press, 1986.
t31 D. B. Parker, ”Learning-logic,”M.I.T. Cen. ComputationalRes.
Economics Management Sci., Cambridge, MA, TR-47,1985.
[41 Y. Le Cun, “Une procedure d‘apprentissage pour reseau a
seuilassymetrique,” in Proc. Cognitiva’85,Paris, France,June
1985,pp. 599-604.
[SI R. Watrous and L. Shastri, ”Learning phonetic features using
connectionist networks: an experiment in speech recognition,’’ in Proc. 7St I€€€
lnt. Conf. NeuralNetworks, June1987.
H. Sawai,A. Waibel, P. Haffner, M. Miyatake, and K. Shikano,
“Parallelism, hierarchy, scaling in time-delay neural networks for spotting JapanesephonemeslCV-syllables,” in Proc.
lEEE Int. joint Conf. Neural Networks, June 1989.
D. Nguyen dnd B. Widrow, “the truck backer-upper: An
example of self-learning in,neural networks,” in W. T. Miller,
R. Sutton, and P. Werbos, Eds., Neural Networks for Robotics
a n d Control. Cambridge, MA: M.I.T. Press, 1990.
M. Jordan,“Generic constraints on underspecified target trajectories,” in Proc. /€€€ lnt. joint Conf. NeuralNetworks, June
[ I 7 L. B. Almeida, “A learning rule for asynchronousperceptrons
with feedback in a combinatorial environment,” in Proc. 7st
F E E lnt. Conf. Neural Networks, 1987.
[I81 B. A. Pearlmutter, “Learning state space trajectories in recurrent neural networks,” in Proc. lnt. joint Conf. Neural Networks, June1989.
[I91 R. Williams, “Adaptive state representation and estimation
using recurrent connectionist networks,” in W. T. Miller, R.
Sutton, and P. Werbos, Eds., Neural Networks for Robotics
a n d Controol. Cambridge, M A M.I.T. Press, 1990.
[20] P. Werbos, “Learning howtheworldworks: Specificationsfor
predictive networks in robots and brains,” in Proc. 1987 /€E€
lnt. Conf. Syst., Man, Cybern., 1987.
[21] -, “Consistency of HDPapplied to a simple reinforcement
learning problem,” Neural Networks, Mar. 1990.
[22] J. Dennis and R. Schnabel, Numerical Methods for Unconstrained Optimization a n d Nonlinear Equations. Englewood
Cliffs, NJ:Prentice-Hall, 1983.
[23] D. Shanno, “Conjugate-gradient methods with inexact
searches,” Math. Oper. Res., vol. 3,Aug. 1978.
[24] -, ”Recent advancesin numericaltechniquesfor large-scale
optimization,” in W. T. Miller, R. Sutton, and P. Werbos, Eds.,
NeuralNetworks for Roboticsa n d Control. Cambridge, MA:
M.I.T. Press, 1990.
1989.
M. Kawato, “Computational schemes and neural network
models for formation and control of multijoint arm trajectory,” in W. T. Miller, R. Sutton, and P. Werbos, Eds., Neural
Networks for Robotics a n d Control. Cambridge, MA: M.I.T.
Press, 1990.
R. Narendra, ”Adaptive control using neural networks,” in
W. T. Millet, R. Sutton, and P. Werbos, Eds., NeuralNetworks
for RoboticsandControl. Cambridge, MA: M.I.T. Press, 1990.
P. Werbos, “Maximizing long-term gas industry profits in two
minutes in Lotus using neural network methods,” /FE€Trans.
Syst., Man, Cyberh., Mar./Apr. 1989.
-, “Backpropagation: Past and future,” in Proc. 2nd /FEE
lnt. Conf. Neural Networks, June 1988.The transcript of the
talk and slides; available from the author, are more introductory in natureand morecomprehensive in some respects.
[I31 I.Guyon, I. Poujaud, L. Personnaz,G. Dreyfus,J. Denker, and
Y. Le Cun, ”Comparing different neural network architectures for classifyinghandwritten digits,” ih Proc. lEEElnt.joint
Conf. Neural Networks, Jude 1989.
P. Werbos, “Applications of advances in nonlinear sensitivity
analysis,” in R. Drenick and F. Kozin, Eds., Systems Modeling
and Optimization: Proc. 70th lFlP Conf. (1981). New York:
Springer-Verlag, 1982.
-,“Generalization of backpropagationwith application to
a recurrent gas market model,” Neural Networks, Oct. 1988.
F. J.Pineda, ”Generalization of backpropagationto recurrent
and higher order networks,” in Proc. l€EEConf.Neurallnform.
ProcessingSyst., 1987.
1560
Paul J.Werbos receiveddegrees from Harvard University, Cambridge, MA, and the
London School of Economics, London,
England, that emphasized mathematical
physics, international political economy,
and economics. He developed backpropagation for the Ph.D. degree in applied
mathematicsat Harvard.
He iscurrently ProgramDirector for Neuroengineering and Emerging Technology
Initiation at the National Science Foundation (NSF)and Secretaryof the International Neural Network Society. While an Assistant Professor at the University of Maryland, he
developed advanced adaptive critic designs for neurocontrol.
Before joining the hlSF in 1989,he worked nine years at the Energy
InformationAdministration (EIA)of DOE, where he variously held
lead responsibility for evaluating long-range forecasts (under
Charles Smith), and for building models of industrial, transportation, and commercial demand, and natural gas supply using
backpropagationas one among several methods. In previousyears,
he was Regional Director and Washington representative of the
L-5 Society, a predecessor to the National Space Society, and an
organizer of the Global Futures Roundtable. He has worked on
occasion with the National Space Society, the Global Tomorrow
Coalition, the Stanford Energy Modeling Forum, and Adelphi
Friends Meeting. He also retains an active interest in fuel cells for
transportation and in the foundations of physics.
PROCEEDINGS OF THE IEEE, VOL. 78, NO. 10, OCTOBER 1990
-
k