ANN - Ch2-Adaline and Madaline
Neuron model:
y ( wT x )
of probabilities 4
2.1.2 Steepest Descent
The graph of ( w ) d 2
k w T
Rw 2 p T
is a paraboloid.
Steps: 1. Initialize weight values w( t0 )
2. Determine the steepest descent direction
d ( w(t ))
( w(t )) 2( p Rw(t ))
dw(t )
Let w(t ) w ( w(t )) 2( p Rw(t ))
3. Modify weight values
w (t 1) w (t ) w (t ), : step size
4. Repeat 2~3.
No calculation of
R 1
i) Need to calculate p and R,
ii) Steepest descent is a batch training method.
2.1.3 Stochastic Gradient Descent
Approximate w ( w (t )) 2( p Rw (t ))by randomly
selecting one training example at a time
1. Apply an input vector xk
2. εk2 (t ) (d k yk )2 (d k wT (t ) xk )2
3. ( w(t )) εk2 (t ) εk2 (t )
w w w
2( d k wT (t ) xk ) xk 2 εk (t ) xk
○ Practical Considerations:
(a) No. of training vectors, (b) Stopping criteria
(c) Initial weights, (d) Step size
2.1.4 Conjugate Gradient Descent
-- Drawback: can only minimize quadratic functions,
1 T
e.g., f ( w) w Aw bT w c
• Adequate for our error function
( w) 2
dk T T
w Rw 2 p w
Advantage: guarantees to find the optimum solution
in at most n iterations, where n is the size of matrix
• The size of correlation matrix R is the dimension
of input vectors x.
A-Conjugate Vectors:
Let Ann : square, symmetric, positive-definite matrix.
S {s(0), s(1), , s( n 1)}
sT (i ) As( j ) 0, i j are A-conjugate vectors
* If A = I (identity matrix), conjugacy = orthogonality.
• The conjugate-direction method for finding the w to
minimize f(w) is through w(i 1) w(i ) (i ) s(i ),
i 0, , n 1 where w(0): an arbitrary initial vector,
(i ) is determined by min f ( w(i ) s(i )).
1 T
Q f ( w ) w A w b T
f ( w(i ) s(i )) ( w(i ) s(i ))T A( w(i ) s(i ))
bT ( w(i ) s(i )) c
Let f ( w(i ) s(i )) 0
d d
f ( w ( i ) s ( i ))
w ( i ) T
Aw ( i ) s ( i ) T
Aw(i )
d d
w(i )T As(i ) 2 s(i )T As(i ) bT w(i ) bT s(i )
s(i )T Aw(i ) w(i )T As(i ) 2 s(i )T As(i ) bT s(i ) 0
Define r (i )
f ( w ) b Aw ( i )
Let s(i ) r (i ) (i ) s(i 1), i 1, , n 1 (A)
s(0) r (0) b Aw(0)
To determine (i ), multiply (A) by s(i-1)A,
s (i 1) As(i ) s (i 1) A( r (i ) (i ) s(i 1)) (B)
sT (i 1) Ar (i )
(i ) T
s (i 1) As(i 1)
Summary The conjugate-direction method
for error
function : minimizes ( w ) d 2
k w T
R w 2 p T
Let w(i 1) w(i ) (i ) s(i ), i 0,1, , n 1
w(0) is an arbitrary starting vector
p s(i ) ( s(i ) Rw(i ) w(i ) Rs(i ))
2 s(i )T Rs(i )
s(i ) r (i ) (i ) s(i 1), s(0) r (0) 2( p Rw(0))
sT (i 1) Rr (i )
r (i ) 2( p Rw(i )), (i ) T .
s (i 1) Rs(i 1) 13
Example: A comparison of the convergences of
gradient descent (green) and conjugate gradient (red)
for minimizing a quadratic function.
Conjugate gradient
converges in at most
n steps where n is the
size of the matrix of
the system (here n=2).
2.3. Applications
2.3.1 Predict Signal
An adaptive filter is
used to model a plant.
Inputs to the filter are
the same as those to
the plant. The filter
adjusts its weights
based on the difference
between its output and
the output of the plant.
2.3.3. Echo Cancellation in Telephone Circuits
2.4 Madaline : Many
○ XOR function This problem
cannot be solved
by an adaline.
○ Many adalines can be joined in a
2.4.2. Madaline Rule II (MRII)
○ Training algorithm – A trial–and–error procedure
with a minimum disturbance principle (those
nodes that can affect the output error while
incurring the least change in their weights
should have precedence in the learning
○ Procedure –
1. Input a training pattern
2. Count #incorrect values in the output layer
3. For all units on the output layer
3.1. Select the first previously unselected error node
whose analog output is closest to (threshold)
( Q this node can reverse its bipolar output with
the least change in its weights)
3.2. Change its weights s.t. the bipolar output of
the unit changes
3.3. Input the same training pattern
3.4. If reduce #errors, accept the weight change,
otherwise restore the original weights
4. Repeat Step 3 for the hidden layer.
5. For all units on the output layer
5.1. Select the previously unselected pair of units
whose outputs are closest to their thresholds
5.2. Apply a weight correction to both units, in
order to change their bipolar outputs
5.3. Input the same training pattern
5.4. If reduce # errors, accept the correction;
otherwise, restore the original weights.
6. Repeat step 5 for the hidden layer.
※ Steps 5 and 6 can be repeated with triplets,
quadruplets or longer combinations of units
until satisfactory results are obtained
2.4.3. A Madaline for Translation–Invariant
Pattern Recognition
。 Relationships among the weight matrices of Adalines
○ Extension -- Mutiple slabs with different key weight
matrices for discriminating more than two classes of