ANN - Ch2-Adaline and Madaline
ANN - Ch2-Adaline and Madaline
ANN - Ch2-Adaline and Madaline
1
Neuron model:
y ( wT x )
of probabilities 4
2.1.2 Steepest Descent
The graph of ( w ) d 2
k w T
Rw 2 p T
w
is a paraboloid.
5
Steps: 1. Initialize weight values w( t0 )
2. Determine the steepest descent direction
d ( w(t ))
( w(t )) 2( p Rw(t ))
w
dw(t )
Let w(t ) w ( w(t )) 2( p Rw(t ))
3. Modify weight values
w (t 1) w (t ) w (t ), : step size
4. Repeat 2~3.
No calculation of
R 1
Drawbacks:
i) Need to calculate p and R,
ii) Steepest descent is a batch training method.
6
2.1.3 Stochastic Gradient Descent
Approximate w ( w (t )) 2( p Rw (t ))by randomly
selecting one training example at a time
1. Apply an input vector xk
2. εk2 (t ) (d k yk )2 (d k wT (t ) xk )2
3. ( w(t )) εk2 (t ) εk2 (t )
w w w
2( d k wT (t ) xk ) xk 2 εk (t ) xk
○ Practical Considerations:
(a) No. of training vectors, (b) Stopping criteria
(c) Initial weights, (d) Step size
8
2.1.4 Conjugate Gradient Descent
-- Drawback: can only minimize quadratic functions,
1 T
e.g., f ( w) w Aw bT w c
2
• Adequate for our error function
( w) 2
dk T T
w Rw 2 p w
Advantage: guarantees to find the optimum solution
in at most n iterations, where n is the size of matrix
A.
• The size of correlation matrix R is the dimension
of input vectors x.
8
A-Conjugate Vectors:
Let Ann : square, symmetric, positive-definite matrix.
S {s(0), s(1), , s( n 1)}
sT (i ) As( j ) 0, i j are A-conjugate vectors
if
* If A = I (identity matrix), conjugacy = orthogonality.
• The conjugate-direction method for finding the w to
minimize f(w) is through w(i 1) w(i ) (i ) s(i ),
i 0, , n 1 where w(0): an arbitrary initial vector,
(i ) is determined by min f ( w(i ) s(i )).
8
1 T
Q f ( w ) w A w b T
wc
2
1
f ( w(i ) s(i )) ( w(i ) s(i ))T A( w(i ) s(i ))
2
bT ( w(i ) s(i )) c
d
Let f ( w(i ) s(i )) 0
d
d d
f ( w ( i ) s ( i ))
w ( i ) T
Aw ( i ) s ( i ) T
Aw(i )
d d
w(i )T As(i ) 2 s(i )T As(i ) bT w(i ) bT s(i )
s(i )T Aw(i ) w(i )T As(i ) 2 s(i )T As(i ) bT s(i ) 0
Define r (i )
w
f ( w ) b Aw ( i )
Let s(i ) r (i ) (i ) s(i 1), i 1, , n 1 (A)
s(0) r (0) b Aw(0)
To determine (i ), multiply (A) by s(i-1)A,
s (i 1) As(i ) s (i 1) A( r (i ) (i ) s(i 1)) (B)
T T
9
sT (i 1) Ar (i )
(i ) T
s (i 1) As(i 1)
Summary The conjugate-direction method
for error
function : minimizes ( w ) d 2
k w T
R w 2 p T
w
Let w(i 1) w(i ) (i ) s(i ), i 0,1, , n 1
w(0) is an arbitrary starting vector
p s(i ) ( s(i ) Rw(i ) w(i ) Rs(i ))
T T T
2 s(i )T Rs(i )
s(i ) r (i ) (i ) s(i 1), s(0) r (0) 2( p Rw(0))
sT (i 1) Rr (i )
r (i ) 2( p Rw(i )), (i ) T .
s (i 1) Rs(i 1) 13
Example: A comparison of the convergences of
gradient descent (green) and conjugate gradient (red)
for minimizing a quadratic function.
Conjugate gradient
converges in at most
n steps where n is the
size of the matrix of
the system (here n=2).
13
2.3. Applications
2.3.1 Predict Signal
An adaptive filter is
used to model a plant.
Inputs to the filter are
the same as those to
the plant. The filter
adjusts its weights
based on the difference
between its output and
the output of the plant.
16
2.3.3. Echo Cancellation in Telephone Circuits
18
2.4 Madaline : Many
adaline
○ XOR function This problem
cannot be solved
by an adaline.
21
○ Many adalines can be joined in a
layer
22
2.4.2. Madaline Rule II (MRII)
○ Training algorithm – A trial–and–error procedure
with a minimum disturbance principle (those
nodes that can affect the output error while
incurring the least change in their weights
should have precedence in the learning
process)
○ Procedure –
1. Input a training pattern
2. Count #incorrect values in the output layer
23
3. For all units on the output layer
3.1. Select the first previously unselected error node
whose analog output is closest to (threshold)
( Q this node can reverse its bipolar output with
the least change in its weights)
3.2. Change its weights s.t. the bipolar output of
the unit changes
3.3. Input the same training pattern
3.4. If reduce #errors, accept the weight change,
otherwise restore the original weights
4. Repeat Step 3 for the hidden layer.
24
5. For all units on the output layer
5.1. Select the previously unselected pair of units
whose outputs are closest to their thresholds
5.2. Apply a weight correction to both units, in
order to change their bipolar outputs
5.3. Input the same training pattern
5.4. If reduce # errors, accept the correction;
otherwise, restore the original weights.
6. Repeat step 5 for the hidden layer.
25
※ Steps 5 and 6 can be repeated with triplets,
quadruplets or longer combinations of units
until satisfactory results are obtained
26
2.4.3. A Madaline for Translation–Invariant
Pattern Recognition
27
。 Relationships among the weight matrices of Adalines
28
○ Extension -- Mutiple slabs with different key weight
matrices for discriminating more than two classes of
patterns
29