Artificial Neural Network: Lecture Module 22

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

Artificial Neural Network

Lecture Module 22
Neural Networks
Artificial neural network (ANN) is a machine learning
approach that models human brain and consists of a
number of artificial neurons.
Neuron in ANNs tend to have fewer connections than
biological neurons.
Each neuron in ANN receives a number of inputs.
An activation function is applied to these inputs which
results in activation level of neuron (output value of
the neuron).
Knowledge about the learning task is given in the
form of examples called training examples.
Contd..

An Artificial Neural Network is specified by:


neuron model: the information processing unit of the NN,
an architecture: a set of neurons and links connecting
neurons. Each link has a weight,
a learning algorithm: used for training the NN by modifying
the weights in order to model a particular learning task
correctly on the training examples.
The aim is to obtain a NN that is trained and
generalizes well.
It should behaves correctly on new instances of the
learning task.
Neuron
The neuron is the basic information processing unit of
a NN. It consists of:
1 A set of links, describing the neuron inputs, with weights W1,
W2, , Wm
2 An adder function (linear combiner) for computing the
weighted sum of the inputs: m
(real numbers) u wjxj
j 1

3 Activation function for limiting the amplitude of the


neuron output. Here b denotes bias.

y (u b)
The Neuron Diagram
Bias
b
x1 w1
Activation
Induced function
Field


Output
x2 w2
v () y
Input
values

Summing
function

xm wm
weights
Bias of a Neuron

The bias b has the effect of applying a transformation


to the weighted sum u
v=u+b
The bias is an external parameter of the neuron. It
can be modeled by adding an extra input.
v is called induced field of the neuron

m
v w x
j 0
j j

w0 b
Neuron Models
The choice of activation function determines the
neuron model.
Examples:
a if v c
step function: (v )
b if v c

a if v c
ramp function:
( v ) b if v d
a (( v c )( b a ) /( d c )) otherwise

sigmoid function with z,x,y parameters 1


(v ) z
1 exp( xv y )
Gaussian function: 1 1 v 2

(v ) exp
2 2

Step Function

c
Ramp Function

c d
Sigmoid function
The Gaussian function is the probability function of the
normal distribution. Sometimes also called the frequency
curve.
Network Architectures
Three different classes of network architectures

single-layer feed-forward
multi-layer feed-forward
recurrent

The architecture of a neural network is linked


with the learning algorithm used to train
Single Layer Feed-forward

Input layer Output layer


of of
source nodes neurons
Perceptron: Neuron Model
(Special form of single layer feed forward)
The perceptron was first proposed by Rosenblatt (1958) is a
simple neuron that is used to classify its input into one of two
categories.
A perceptron uses a step function that returns +1 if weighted
sum of its input 0 and -1 otherwise
1 if v 0
(v )
1 if v 0
b (bias)
x1
w1
v y
x2 w2
(v)
wn
xn
Perceptron for Classification
The perceptron is used for binary classification.
First train a perceptron for a classification task.
Find suitable weights in such a way that the training examples are
correctly classified.
Geometrically try to find a hyper-plane that separates the examples
of the two classes.
The perceptron can only model linearly separable classes.
When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the
mean squared error.
Given training examples of classes C1, C2 train the
perceptron in such a way that :
If the output of the perceptron is +1 then the input is assigned to
class C1
If the output is -1 then the input is assigned to C2
Boolean function OR Linearly separable
Learning Process for Perceptron

Initially assign random weights to inputs between -0.5


and +0.5
Training data is presented to perceptron and its output is
observed.
If output is incorrect, the weights are adjusted
accordingly using following formula.
wi wi + (a* xi *e), where e is error produced and a (-1 a
1) is learning rate
a is defined as 0 if output is correct, it is +ve, if output is too low
and ve, if output is too high.
Once the modification to weights has taken place, the next piece
of training data is used in the same way.
Once all the training data have been applied, the process starts
again until all the weights are correct and all errors are zero.
Each iteration of this process is known as an epoch.
Example: Perceptron to learn OR
function
Initially consider w1 = -0.2 and w2 = 0.4
Training data say, x1 = 0 and x2 = 0, output is 0.
Compute y = Step(w1*x1 + w2*x2) = 0. Output is correct
so weights are not changed.
For training data x1=0 and x2 = 1, output is 1
Compute y = Step(w1*x1 + w2*x2) = 0.4 = 1. Output is
correct so weights are not changed.
Next training data x1=1 and x2 = 0 and output is 1
Compute y = Step(w1*x1 + w2*x2) = - 0.2 = 0. Output is
incorrect, hence weights are to be changed.
Assume a = 0.2 and error e=1
wi = wi + (a * xi * e) gives w1 = 0 and w2 =0.4
With these weights, test the remaining test data.
Repeat the process till we get stable result.
Perceptron: Limitations
The perceptron can only model linearly separable
functions,
those functions which can be drawn in 2-dim graph and
single straight line separates values in two part.
Boolean functions given below are linearly
separable:
AND
OR
COMPLEMENT
It cannot model XOR function as it is non linearly
separable.
When the two classes are not linearly separable, it may be
desirable to obtain a linear separator that minimizes the
mean squared error.
XOR Non linearly separable function

A typical example of non-linearly separable function is the


XOR that computes the logical exclusive or..
This function takes two input arguments with values in
{0,1} and returns one output in {0,1},
Here 0 and 1 are encoding of the truth values false and
true,
The output is true if and only if the two inputs have
different truth values.
XOR is non linearly separable function which can not be
modeled by perceptron.
For such functions we have to use multi layer feed-
forward network.
These two classes (true and false) cannot be separated using a
line. Hence XOR is non linearly separable.
Multi layer feed-forward NN (FFNN)

FFNN is a more general network architecture, where there


are hidden layers between input and output layers.
Hidden nodes do not directly receive inputs nor send
outputs to the external environment.
FFNNs overcome the limitation of single-layer NN.
They can handle non-linearly separable learning tasks.

Input Output
layer layer

Hidden Layer
3-4-2 Network
FFNN for XOR

The ANN for XOR has two hidden nodes that realizes this non-
linear separation and uses the sign (step) activation function.
Arrows from input nodes to two hidden nodes indicate the
directions of the weight vectors (1,-1) and (-1,1).
The output node is used to combine the outputs of the two hidden
nodes.
Since we are representing two states by 0 (false) and 1 (true),
we will map negative outputs (1, 0.5) of hidden and output
layers to 0 and positive output (0.5) to 1.
FFNN NEURON MODEL
The classical learning algorithm of FFNN is based on
the gradient descent method.
For this reason the activation function used in FFNN
are continuous functions of the weights, differentiable
everywhere.
The activation function for node i may be defined as a
simple form of the sigmoid function in the following
manner:

where A > 0, Vi = Wij * Yj , such that Wij is a weight of the


link from node i to node j and Yj is the output of node j.
Training Algorithm: Backpropagation
The Backpropagation algorithm learns in the same way
as single perceptron.
It searches for weight values that minimize the total
error of the network over the set of training examples
(training set).
Backpropagation consists of the repeated application of
the following two passes:
Forward pass: In this step, the network is activated on one
example and the error of (each neuron of) the output layer is
computed.
Backward pass: in this step the network error is used for
updating the weights. The error is propagated backwards from
the output layer through the network layer by layer. This is done
by recursively computing the local gradient of each neuron.
Backpropagation

Back-propagation training algorithm

Network activation
Forward Step

Error propagation
Backward Step

Backpropagation adjusts the weights of the NN in order


to minimize the network total mean squared error.
Contd..
Consider a network of three layers.
Let us use i to represent nodes in input layer, j to
represent nodes in hidden layer and k represent nodes
in output layer.
wij refers to weight of connection between a node in
input layer and node in hidden layer.
The following equation is used to derive the output
value Yj of node j
Yj 1
X
1 e j

where, Xj = xi . wij - j , 1 i n; n is the number of inputs to


node j, and j is threshold for node j
Total Mean Squared Error

The error of output neuron k after the activation of the


network on the n-th training example (x(n), d(n)) is:
ek(n) = dk(n) yk(n)
The network error is the sum of the squared errors of the
output neurons:
E(n) e2
k (n)

The total mean squared error is the average of the


network errors of the training examples.
N

E (n)
1
EAV N
n 1
Weight Update Rule
The Backprop weight update rule is based on the
gradient descent method:
It takes a step in the direction yielding the maximum
decrease of the network error E.
This direction is the opposite of the gradient of E.
Iteration of the Backprop algorithm is usually
terminated when the sum of squares of errors of the
output values for all training data in an epoch is less
than some threshold such as 0.01
E
wij wij wij w ij -
w ij
Backprop learning algorithm
(incremental-mode)
n=1;
initialize weights randomly;
while (stopping criterion not satisfied or n <max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from
those of the output layer:
w ji w ji w ji
with w ji computed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
Stopping criterions
Total mean squared error change:
Back-prop is considered to have converged when the
absolute rate of change in the average squared error per
epoch is sufficiently small (in the range [0.1, 0.01]).
Generalization based criterion:
After each epoch, the NN is tested for generalization.
If the generalization performance is adequate then stop.
If this stopping criterion is used then the part of the training
set used for testing the network generalization will not used
for updating the weights.
NN DESIGN ISSUES

Data representation
Network Topology
Network Parameters
Training
Validation
Data Representation
Data representation depends on the problem.
In general ANNs work on continuous (real valued) attributes.
Therefore symbolic attributes are encoded into continuous
ones.
Attributes of different types may have different ranges of
values which affect the training process.
Normalization may be used, like the following one which
scales each attribute to assume values between 0 and 1.
xi mini
xi
max i mini
for each value xi of ith attribute, mini and maxi are the minimum and
maximum value of that attribute over the training set.
Network Topology
The number of layers and neurons depend on the
specific task.
In practice this issue is solved by trial and error.
Two types of adaptive algorithms can be used:
start from a large network and successively remove some
neurons and links until network performance degrades.
begin with a small network and introduce new neurons until
performance is satisfactory.
Network parameters

How are the weights initialized?


How is the learning rate chosen?
How many hidden layers and how many
neurons?
How many examples in the training set?
Initialization of weights
In general, initial weights are randomly chosen, with
typical values between -1.0 and 1.0 or -0.5 and 0.5.
If some inputs are much larger than others, random
initialization may bias the network to give much more
importance to larger inputs.
In such a case, weights can be initialized as follows:

w ij 21N 1
|x i |
i 1,..., N
For weights from the input to the first layer

w jk 21N
i 1,..., N
(
1
wijx )
i
For weights from the first to the second layer
Choice of learning rate

The right value of depends on the application.


Values between 0.1 and 0.9 have been used in
many applications.
Other heuristics is that adapt during the
training as described in previous slides.
Training
Rule of thumb:
the number of training examples should be at least five to
ten times the number of weights of the network.
Other rule:

|W| |W|= number of weights


N
(1 - a) a=expected accuracy on test set
Recurrent Network
FFNN is acyclic where data passes from input to the
output nodes and not vice versa.
Once the FFNN is trained, its state is fixed and does not alter as
new data is presented to it. It does not have memory.
Recurrent network can have connections that go
backward from output to input nodes and models
dynamic systems.
In this way, a recurrent networks internal state can be altered
as sets of input data are presented. It can be said to have
memory.
It is useful in solving problems where the solution depends not
just on the current inputs but on all previous inputs.
Applications
predict stock market price,
weather forecast
Recurrent Network Architecture
Recurrent Network with hidden neuron: unit delay operator d
is used to model a dynamic system

input
d hidden
output

d
Learning and Training
During learning phase,
a recurrent network feeds its inputs through the network,
including feeding data back from outputs to inputs
process is repeated until the values of the outputs do not
change.
This state is called equilibrium or stability
Recurrent networks can be trained by using back-
propagation algorithm.
In this method, at each step, the activation of the
output is compared with the desired activation and
errors are propagated backward through the network.
Once this training process is completed, the network
becomes capable of performing a sequence of actions.
Hopfield Network
A Hopfield network is a kind of recurrent network as
output values are fed back to input in an undirected
way.
It consists of a set of N connected neurons with weights which
are symmetric and no unit is connected to itself.
There are no special input and output neurons.
The activation of a neuron is binary value decided by the sign
of the weighted sum of the connections to it.
A threshold value for each neuron determines if it is a firing
neuron.
A firing neuron is one that activates all neurons that are
connected to it with a positive weight.
The input is simultaneously applied to all neurons, which then
output to each other.
This process continues until a stable state is reached.
Activation Algorithm
Active unit represented by 1 and inactive by 0.

Repeat
Choose any unit randomly. The chosen unit may be
active or inactive.
For the chosen unit, compute the sum of the weights
on the connection to the active neighbours only, if any.
If sum > 0 (threshold is assumed to be 0), then the
chosen unit becomes active, otherwise it becomes
inactive.
If chosen unit has no active neighbours then ignore it,
and status remains same.
Until the network reaches to a stable state
Stable Networks
Weight Computation Method
Weights are determined using training examples.

Here
W is weight matrix
Xi is an input example represented by a vector of N values
from the set {1, 1}.
Here, N is the number of units in the network; 1 and -1

represent active and inactive units respectively.


(Xi)T is the transpose of the input Xi ,
M denotes the number of training input vectors,
I is an N N identity matrix.
Example
Let us now consider a Hopfield network with four units
and three training input vectors that are to be learned by
the network.
Consider three input examples, namely, X1, X2, and X3
defined as follows:
Contd..
The networks generated using these weights
and input vectors are stable, except X2.
X2 stabilizes to X1 (which is at hamming
distance 1).
Finally, with the obtained weights and stable
states (X1 and X3), we can stabilize any new
(partial) pattern to one of those
Radial-Basis Function Networks

A function is said to be a radial basis function (RBF) if


its output depends on the distance of the input from a
given stored vector.
The RBF neural network has an input layer, a hidden layer and
an output layer.
In such RBF networks, the hidden layer uses neurons with RBFs
as activation functions.
The outputs of all these hidden neurons are combined linearly at
the output node.
These networks have a wide variety of applications such
as
function approximation,
time series prediction,
control and regression,
pattern classification tasks for performing complex (non-linear).
RBF Architecture

x1
1 w1
x2
y

wm1
m1

xm
One hidden layer with RBF activation functions
1... m1
Output layer with linear activation function.

y w11 (|| x t1 ||) ... wm1 m1 (|| x tm1 ||)


|| x t || distance of x ( x1 ,..., xm ) from center t
Cont...

Here we require weights, wi from the hidden layer to the


output layer only.
The weights wi can be determined with the help of any
of the standard iterative methods described earlier for
neural networks.
However, since the approximating function given below
is linear w. r. t. wi, it can be directly calculated using the
matrix methods of linear least squares without having to
explicitly determine wi iteratively.
N
Y f ( X ) wi ( X i ti )
i 1

It should be noted that the approximate function f(X) is


differentiable with respect to wi.
Comparison
RBF NN FF NN

Non-linear layered feed-forward Non-linear layered feed-forward


networks. networks

Hidden layer of RBF is non-linear, Hidden and output layers of


the output layer of RBF is linear. FFNN are usually non-linear.

One single hidden layer May have more hidden layers.


Neuron model of the hidden neurons Hidden and output neurons
is different from the one of the share a common neuron model.
output nodes.

Activation function of each hidden Activation function of each


neuron in a RBF NN computes the hidden neuron in a FFNN
Euclidean distance between input computes the inner product of
vector and the center of that unit. input vector and the synaptic
weight vector of that neuron

You might also like