Perceptron and Adaline: Fig. 11.1 Single Layer Network With One Output and Two Inputs
Perceptron and Adaline: Fig. 11.1 Single Layer Network With One Output and Two Inputs
Perceptron and Adaline: Fig. 11.1 Single Layer Network With One Output and Two Inputs
mywbut.com
+ 0 ) 2 6 - 4
11.1 INTRODUCTION
This chapter describes single layer neural networks, including some of the classical approaches to the
neural computing and learning problem. In the first part of this chapter we discuss the representational
power of the single layer networks and their learning algorithms and will give some examples of using
the networks. In the second part we will discuss the representational limitations of single layer networks.
Two classical models will be described in the first part of the chapter: the Perceptron, proposed by
Rosenblatt and the Adaline, presented by Widrow and Hoff.
A single layer feed-forward network consists of one or more output neurons o, each of which is
connected with a weighting factor wio to all of the inputs i. In the simplest case the network has only two
inputs and a single output, as sketched in Fig. 11.1 (we leave the output index o out). The input of the
neuron is the weighted sum of the inputs plus the bias term. The output of the network is formed by the
activation of the output neuron, which is some function of the input:
X1
W1
W2
X2
q
+1
Fig. 11.1 Single layer network with one output and two inputs.
mywbut.com
F 2 I
y=F GH å w x + G JK
i =1
i i ...(11.1)
The activation function F can be linear so that we have a linear network, or non-linear. In this
section we consider the threshold (sgn) function:
F(s) =
RS+1 if s > 0
...(11.2)
T1 otherwise
The output of the network thus is either +1 or 1, depending on the input. The network can now be
used for a classification task: it can decide whether an input pattern belongs to one of two classes. If the
total input is positive, the pattern will be assigned to class +1, if the total input is negative, the sample
will be assigned to class +1. The separation between the two classes in this case is a straight line, given
by the equation:
w1x1 + w2 x2 + q = 0 ...(11.3)
The single layer network represents a linear discriminant function.
A geometrical representation of the linear threshold neural network is given in Fig. 11.2. Equation
(11.3) can be written as
w1 q
x2 = x1 ...(11.4)
w2 w2
and we see that the weights determine the slope of the line and the bias determines the offset, i.e. how
far the line is from the origin. Note that also the weights can be plotted in the input space: the weight
x2
+
+
+ +
w1
+ +
w2
+
q x1
||W||
Fig. 11.2 Geometric representation of the discriminant function and the weights.
mywbut.com
Dq =
RS 0 if the perceptron responds correctly
...(11.7)
Td ( x) otherwise
w o w*
Now define cos a = .
|| w||
When according to the perceptron learning rule, connection weights are modified at a given input x,
we know that Dw = d(x)x, and the weight after modification is w¢ = w + Dw. From this it follows that:
w¢ o w* = w o w* + d(x) o w* o x
= w o w* + sgn(w* o x) w* o x
> w o w* + d
||w¢||2 = ||w + d(x)x||2
= w2 + 2d (x) w o x + x2
< w2 + x2 (because d (x) = sgn [w o x])
2
=w +M
After t modifications we have:
w(t) o w* > w o w* + td
||w(t)||2 < w2 + tM
such that
w* o w(t )
cos a(t) =
|| w(t )||
w* o w + td
>
w2 + tM
d
limt®¥ cos a(t) = limt®¥ t = ¥ while cos a £ 1.
M
The conclusion is that there must be an upper limit tmax for t. the system modifies its connections
only a limited number of times. In other words, after maximally tmax modifications of the weights the
perceptron is correctly performing the mapping. tmax will be reached when cos a = 1. If we start with
connections w = 0,
M
tmax = ...(11.8)
d2
Example 11.1: A perceptron is initialized with the following weights: w1 = 1; w2 = 2; q = 2. The
perceptron learning rule is used to learn a correct discriminant function for a number of samples,
sketched in Fig. 11.3.
mywbut.com
+ A +
1
B +C
x1
1 2
The first sample A, with values x = (0:5; 1:5) and target value d(x) = +1 is presented to the network.
From equation (11.1) it can be calculated that the network output is +1, so no weights are adjusted. The
same is the case for point B, with values x = (0:5; 0:5) and target value d(x) = -1; the network output is
negative, so no change. When presenting point C with values x = (0:5; 0:5) the network output will be
1, while the target value d(x) = +1.
According to the perceptron learning rule, the weight changes are:
Dw1 = 0:5, Dw2 = 0:5, q = 1.
The new weights are now:
w1 = 1:5, w2 = 2:5, q = 1,
and sample C is classified correctly.
In Fig. 11.3 the discriminant function before and after this weight update is shown.
An important generalisation of the perceptron training algorithm was presented by Widrow and Hoff as
the least mean square (LMS) learning procedure, also known as the delta rule. The main functional
di erence with the perceptron training rule is the way the output of the system is used in the learning
rule. The perceptron learning rule uses the output of the threshold function (either 1 or +1) for learning.
The delta-rule uses the net output without further mapping into output values 1 or +1.
The learning rule was applied to the adaptive linear element, also named Adaline, developed by
Widrow and Hoff. In a simple physical implementation (Fig. 11.4) this device consists of a set of
controllable resistors connected to a circuit, which can sum up currents caused by the input voltage
signals. Usually the central block, the summer, is also followed by a quantiser, which outputs either +1
or 1, depending on the polarity of the sum.
Although the adaptive process is here exemplified in a case when there is only one output, it may be
clear that a system with many parallel outputs is directly implementable by multiple units of the above
kind.
If the input conductances are denoted by wi, i = 0; 1,..., n, and the input and output signals by xi and
y, respectively, then the output of the central block is defined to be
mywbut.com
+1
1 +1 Level
w1 w0
w2
Output
w3 S
n
y= åw x i i +q ...(11.9)
i =1
where q = w0. The purpose of this device is to yield a given value y = d p at its output when the set of
values xip, i = 1; 2, , n, is applied at the inputs. The problem is to determine the coeficients wi, i = 0, 1,
, n, in such a way that the input-output response is correct for a large number of arbitrarily chosen
signal sets. If an exact mapping is not possible, the average error must be minimised, for instance, in the
sense of least squares. An adaptive operation means that there exists a mechanism by which the wi can
be adjusted, usually iteratively, to attain the correct values. For the Adaline, Widrow introduced the
delta rule to adjust the weights.
For a single layer network with an output unit with a linear activation function the output is simply
given by
y= åw x
j
j j +q ...(11.10)
Such a simple network is able to represent a linear relationship between the value of the output unit
and the value of the input units. By thresholding the output value, a classifier can be constructed (such
as Adaline), but here we focus on the linear relationship and use the network for a function
approximation task. In high dimensional input spaces the network represents a (hyper) plane and it will
be clear that also multiple output units may be defined.
Suppose we want to train the network such that a hyperplane is fitted as well as possible to a set of
training samples consisting of input values x p and desired (or target) output values d p. For every given
input sample, the output of the network differs from the target value d p by (d p y p), where y p is the
actual output for this pattern. The delta-rule now uses a cost-or error-function based on these differences
to adjust the weights.
mywbut.com
The error function, as indicated by the name least mean square, is the summed squared error. That
is, the total error E is defined to be
1
E= åE
p
p
=
2 å (d
p
p
y p )2 ...(11.11)
where the index p ranges over the set of input patterns and E p represents the error on pattern p. The LMS
procedure finds the values of all the weights that minimize the error function by a method called
gradient descent. The idea is to make a change in the weight proportional to the negative of the
derivative of the error as measured on the current pattern with respect to each weight:
¶E p
Dp wj = g ...(11.12)
¶w j
¶E p ¶E p ¶y p
= . ...(11.13)
¶w j ¶y p ¶w j
¶y p
= xj ...(11.14)
¶w j
and
¶E p
= (d p y p) ...(11.15)
¶y p
such that
Dp wj = g d p xj ...(11.16)
where d p = d p y p is the difference between the target output and the actual output for pattern p.
The delta rule modifies weight appropriately for target and actual outputs of either polarity and for
both continuous and binary input and output units. These characteristics have opened up a wealth of
new applications.
In the previous sections we have discussed two learning algorithms for single layer networks, but we
have not discussed the limitations on the representation of these networks.
mywbut.com
N N @
1 1 1
1 1 1
1 1 1
1 1 1
One of Minsky and Paperts most discouraging results shows that a single layer perceptron cannot
represent a simple exclusive-or function. Table 3.1 shows the desired relationships between inputs and
output units for this function.
In a simple network with two inputs and one output, as depicted in Fig. 11.1, the net input is equal
to:
s = w1x1 + w2x2 + q ...(11.17)
According to eq. (11.1), the output of the perceptron is zero when s is negative and equal to one
when s is positive. In Fig. 11.5 a geometrical representation of the input domain is given. For a constant
q, the output of the perceptron is equal to one on one side of the dividing line which is defined by:
w1x1 + w2x2 = q ...(11.18)
and equal to zero on the other side of this line.
To see that such a solution cannot be found, take a loot at Fig. 11.5. The input space consists of four
points, and the two solid circles at (1, 1) and (1, 1) cannot be separated by a straight line from the two
open circles at (1, 1) and (1, 1). The obvious question to ask is: How can this problem be overcome?
Minsky and Papert prove that for binary inputs, any transformation can be carried out by adding a layer
of predicates which are connected to all inputs. The proof is given in the next section.
x1 x1 x1
( 1, 1) (1, 1)
?
x2 x2 x2
?
( 1, 1) (1, 1)
And OR XOR
For the specific XOR problem we geometrically show that by introducing hidden units, thereby
extending the network to a multi-layer perceptron, the problem can be solved. Fig. 11.6a demonstrates
that the four input points are now embedded in a three-dimensional space defined by the two inputs plus
the single hidden unit. These four points are now easily separated by a linear manifold (plane) into two
groups, as desired. This simple example demonstrates that adding hidden units increases the class of
mywbut.com
problems that are soluble by feed-forward, perceptron- like networks. However, by this generalization
of the basic architecture we have also incurred a serious loss: we no longer have a learning rule to
determine the optimal weights.
(a) The perceptron of Fig. 11.1 with an extra hidden unit. With the indicated values of the weights
wij (next to the connecting lines) and the thresholds qi (in the circles) this perceptron solves the XOR
problem. (b) This is accomplished by mapping the four points of Fig. 11.6 onto the four points indicated
here; clearly, separation (by a linear manifold) into the required groups is now possible.
(1, 1, 1)
1
1 1
0.5 0.5
1
1
( 1, 1, 1)
a. b.
In the previous section we showed that by adding an extra hidden unit, the XOR problem can be
solved. For binary units, one can prove that this architecture is able to perform any transformation given
the correct connections and weights. The most primitive is the next one. For a given transformation y =
d(x), we can divide the set of all possible input vectors into two classes:
X + = {x|d(x) = 1} and X = {x|d(x) 1} ...(11.19)
Since there are N input units, the total number of possible input vectors x is 2N. For every x p Î X+
a hidden unit h can be reserved of which the activation yh is 1 if and only if the specific pattern p is
present at the input: we can choose its weights wih equal to the specific pattern xp and the bias qh equal
to 1 - N such that
F wx 1 I
y hp = sgn GH å i
ih i
p
N+
2 JK ...(11.20)
is equal to 1 for xp = wh only. Similarly, the weights to the output neuron can be chosen such that the
output is one as soon as one of the M predicate neurons is one:
y op
F
= sgn G å y
M
+M
1 I ...(11.21)
H h =1
h
2
JK
mywbut.com
This perceptron will give y0 = 1 only if x Î X+: it performs the desired mapping. The problem is the
large number of predicate units, which is equal to the number of patterns in X +, which is maximally 2N.
Of course we can do the same trick for X , and we will always take the minimal number of mask units,
which is maximally 2N-1. A more elegant proof is given by Minsky and Papert, but the point is that for
complex transformations the number of required units in the hidden layer is exponential in N.