ANN - Ch2-Adaline and Madaline

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 29

Adaline and Madaline

Adaline : Adaptive Linear neuron


Madaline : Multiple Adaline
2.1 Adaline (Bernard Widrow, Stanford Univ.)
Neuron:

1
Neuron model:

y  ( wT x )

Adaline: Neuron model with linear active function


( x )  x  y  (wT x )  wT x
2
2.1.1 Least Mean Square (LMS) Learning
◎ Input vectors : { x1 , x2 ,  , xL }
Ideal outputs : {d1 , d 2 ,   , d L }
Actual outputs : { y1 , y2 ,   , y L }
Mean square error:
L
1
d 
2
ε k   εk   d k  y k  
2 2 2 T
k  w xk
L k 1
 d k 2  wT xk xkT w  2 d k xkT w - - (2.4)

Let ξ  εk2 , p  d k xk , R  xk xkT : correlation


matrix
3
(2.4)    d k2  wT Rw  2 pT w
*
Idea: w  arg min ( w )
w
d  ( w) * 1
Let  2 Rw  2 p  0. Obtain w  R p.
dw
Practical difficulties of analytical formula :
1
1. Large dimensions – R difficult to calculate
2. Need to calculate the expected values of
p  d k xk , R  xk xkT – lack of the knowledge

of probabilities 4
2.1.2 Steepest Descent

The graph of  ( w )  d 2
k  w T
Rw  2 p T
w
is a paraboloid.

5
Steps: 1. Initialize weight values w( t0 )
2. Determine the steepest descent direction
d  ( w(t ))
   ( w(t ))    2( p  Rw(t ))
w
dw(t )
Let w(t )  w  ( w(t ))  2( p  Rw(t ))
3. Modify weight values
w (t  1)  w (t )  w (t ),  : step size
4. Repeat 2~3.
No calculation of
R 1
Drawbacks:
i) Need to calculate p and R,
ii) Steepest descent is a batch training method.
6
2.1.3 Stochastic Gradient Descent
Approximate w  ( w (t ))  2( p  Rw (t ))by randomly
selecting one training example at a time
1. Apply an input vector xk
2. εk2 (t )  (d k  yk )2  (d k  wT (t )  xk )2
3.    ( w(t ))   εk2 (t )    εk2 (t )
w w w

 2( d k  wT (t )  xk ) xk  2 εk (t ) xk

4. w(t  1)  w(t )  μw(t )  w(t )  2 μεk (t ) xk


5. Repeat 1~4 with the next input vector
No calculation of p and R
7
Drawback: time consuming.
Improvement: mini-batch training method.

○ Practical Considerations:
(a) No. of training vectors, (b) Stopping criteria
(c) Initial weights, (d) Step size
8
2.1.4 Conjugate Gradient Descent
-- Drawback: can only minimize quadratic functions,
1 T
e.g., f ( w)  w Aw  bT w  c
2
• Adequate for our error function

 ( w)  2
dk T T
 w Rw  2 p w
Advantage: guarantees to find the optimum solution
in at most n iterations, where n is the size of matrix
A.
• The size of correlation matrix R is the dimension
of input vectors x.
8
A-Conjugate Vectors:
Let Ann : square, symmetric, positive-definite matrix.
S  {s(0), s(1), , s( n  1)}
sT (i ) As( j )  0, i  j are A-conjugate vectors
if
* If A = I (identity matrix), conjugacy = orthogonality.
• The conjugate-direction method for finding the w to
minimize f(w) is through w(i  1)  w(i )   (i ) s(i ),
i  0, , n  1 where w(0): an arbitrary initial vector,
 (i ) is determined by min f ( w(i )   s(i )).

8
 1 T
 Q f ( w )  w A w  b T
wc
 2
1
f ( w(i )   s(i ))  ( w(i )   s(i ))T A( w(i )   s(i ))
2
 bT ( w(i )   s(i ))  c
d
Let  f ( w(i )   s(i ))  0
d
d d
 f ( w ( i )   s ( i ))   
 w ( i ) T
Aw ( i )   s ( i ) T
Aw(i )
d d
 w(i )T As(i )   2 s(i )T As(i )  bT w(i )   bT s(i ) 
 s(i )T Aw(i )  w(i )T As(i )  2 s(i )T As(i )  bT s(i )  0

bT s(i )  ( s(i )T Aw(i )  w(i )T As(i )) 


 
2 s(i )T As(i )  9
How to determine s(i ) ?
1 T
f ( w)  w Aw  bT w  c    f ( w)  b  Aw
2 w

Define r (i )   
w
f ( w )  b  Aw ( i )
Let s(i )  r (i )   (i ) s(i  1), i  1, , n  1  (A)
s(0)  r (0)  b  Aw(0)
To determine  (i ), multiply (A) by s(i-1)A,
s (i  1) As(i )  s (i  1) A( r (i )   (i ) s(i  1))  (B)
T T

In order to be A-conjugate: sT (i ) As( j )  0, i  j


(B)  0  s (i  1) Ar (i )   (i ) s (i  1) As(i  1).
T T

9
sT (i  1) Ar (i )
 (i )   T
s (i  1) As(i  1)
Summary The conjugate-direction method
for error
function : minimizes  ( w )  d 2
k  w T
R w  2 p T
w
Let w(i  1)  w(i )   (i ) s(i ), i  0,1, , n  1
w(0) is an arbitrary starting vector
p s(i )  ( s(i ) Rw(i )  w(i ) Rs(i ))
T T T

2 s(i )T Rs(i )
s(i )  r (i )   (i ) s(i  1), s(0)  r (0)  2( p  Rw(0))
sT (i  1) Rr (i )
r (i )  2( p  Rw(i )),  (i )   T .
s (i  1) Rs(i  1) 13
Example: A comparison of the convergences of
gradient descent (green) and conjugate gradient (red)
for minimizing a quadratic function.

Conjugate gradient
converges in at most
n steps where n is the
size of the matrix of
the system (here n=2).

13
2.3. Applications
2.3.1 Predict Signal

An adaptive filter is trained to predict signal. The


Signal used to train the filter is a delayed actual
signal. The expected output is the current signal. 15
2.3.2 Reproduce Signal

An adaptive filter is
used to model a plant.
Inputs to the filter are
the same as those to
the plant. The filter
adjusts its weights
based on the difference
between its output and
the output of the plant.

16
2.3.3. Echo Cancellation in Telephone Circuits

s : outgoing voice, r : incoming voice


n : noise (leakage of r)
y : the output of the filter mimics n 17
2.3.4. Adaptive beam – forming antenna arrays

Antenna : spatial array of sensors which are


directional in their reception
characteristics.
Adaptive filter learns to steer antennae in order
that they can respond to incoming signals no
matter what their directions are, which reduce
responses to unwanted noise signals coming in
from other directions

18
2.4 Madaline : Many
adaline
○ XOR function This problem
cannot be solved
by an adaline.

Reason: w1 x1  w2 x2   specifies is a line


in the ( x1 , x2 )
19
plane.
The two neurons in the hidden layer provides two
lines that can separate the plane into three regions.
The two regions containing (0,0) and (1,1) are
associated with the network output of 0. The central
region is associated with the network output of 1.
20
There are many solution One solution:
pairs of lines.

21
○ Many adalines can be joined in a
layer

22
2.4.2. Madaline Rule II (MRII)
○ Training algorithm – A trial–and–error procedure
with a minimum disturbance principle (those
nodes that can affect the output error while
incurring the least change in their weights
should have precedence in the learning
process)

○ Procedure –
1. Input a training pattern
2. Count #incorrect values in the output layer
23
3. For all units on the output layer
3.1. Select the first previously unselected error node
whose analog output is closest to  (threshold)
( Q this node can reverse its bipolar output with
the least change in its weights)
3.2. Change its weights s.t. the bipolar output of
the unit changes
3.3. Input the same training pattern
3.4. If reduce #errors, accept the weight change,
otherwise restore the original weights
4. Repeat Step 3 for the hidden layer.
24
5. For all units on the output layer
5.1. Select the previously unselected pair of units
whose outputs are closest to their thresholds
5.2. Apply a weight correction to both units, in
order to change their bipolar outputs
5.3. Input the same training pattern
5.4. If reduce # errors, accept the correction;
otherwise, restore the original weights.
6. Repeat step 5 for the hidden layer.

25
※ Steps 5 and 6 can be repeated with triplets,
quadruplets or longer combinations of units
until satisfactory results are obtained

The MRII learning rule considers the network with


only one hidden layer and input/output patterns are
all bipolar vectors. For networks with more hidden
layers and general input/output patterns, the
backpropagation learning strategy to be discussed
later can be employed.

26
2.4.3. A Madaline for Translation–Invariant
Pattern Recognition

27
。 Relationships among the weight matrices of Adalines

28
○ Extension -- Mutiple slabs with different key weight
matrices for discriminating more than two classes of
patterns

29

You might also like