Artificial Neural Network
Artificial Neural Network
Artificial Neural Network
Biological Neuron
A nerve cell (neuron) is a special biological cell that processes information.
According to an estimation, there are huge number of neurons, approximately
1011 with numerous interconnections, approximately 1015.
Schematic Diagram
Working of a Biological Neuron
As shown in the above diagram, a typical neuron consists of the following four parts
with the help of which we can explain its working −
• Dendrites − They are tree-like branches, responsible for receiving the
information from other neurons it is connected to. In other sense, we can say
that they are like the ears of neuron.
• Soma − It is the cell body of the neuron and is responsible for processing of
information, they have received from dendrites.
• Axon − It is just like a cable through which neurons send the information.
• Synapses − It is the connection between the axon and other neuron
dendrites.
Before taking a look at the differences between Artificial Neural Network (ANN) and
Biological Neural Network (BNN), let us take a look at the similarities based on the
terminology between these two.
Soma Node
Dendrites Input
Axon Output
The following table shows the comparison between ANN and BNN based on some
criteria mentioned.
Processing Massively parallel, slow but Massively parallel, fast but inferior than
superior than ANN BNN
Size 1011 neurons and 102 to 104 nodes (mainly depends on the
1015 interconnections type of application and network designer)
Learning They can tolerate ambiguity Very precise, structured and formatted
data is required to tolerate ambiguity
• Network Topology
• Adjustments of Weights or Learning
• Activation Functions
In this chapter, we will discuss in detail about these three building blocks of ANN
Network Topology
A network topology is the arrangement of a network along with its nodes and
connecting lines. According to the topology, ANN can be classified as the following
kinds −
Feedforward Network
Feedback Network
As the name suggests, a feedback network has feedback paths, which means the
signal can flow in both directions using loops. This makes it a non-linear dynamic
system, which changes continuously until it reaches a state of equilibrium. It may be
divided into the following types −
• Recurrent networks − They are feedback networks with closed loops.
Following are the two types of recurrent networks.
• Fully recurrent network − It is the simplest neural network architecture
because all nodes are connected to all other nodes and each node works as
both input and output.
• Jordan network − It is a closed loop network in which the output will go to
the input again as feedback as shown in the following diagram.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a
teacher. This learning process is dependent.
During the training of ANN under supervised learning, the input vector is presented
to the network, which will give an output vector. This output vector is compared with
the desired output vector. An error signal is generated, if there is a difference
between the actual output and the desired output vector. On the basis of this error
signal, the weights are adjusted until the actual output is matched with the desired
output.
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a
teacher. This learning process is independent.
During the training of ANN under unsupervised learning, the input vectors of similar
type are combined to form clusters. When a new input pattern is applied, then the
neural network gives an output response indicating the class to which the input
pattern belongs.
There is no feedback from the environment as to what should be the desired output
and if it is correct or incorrect. Hence, in this type of learning, the network itself must
discover the patterns and features from the input data, and the relation for the input
data over the output.
Reinforcement Learning
As the name suggests, this type of learning is used to reinforce or strengthen the
network over some critic information. This learning process is similar to supervised
learning, however we might have very less information.
During the training of network under reinforcement learning, the network receives
some feedback from the environment. This makes it somewhat similar to supervised
learning. However, the feedback obtained here is evaluative not instructive, which
means there is no teacher as in supervised learning. After receiving the feedback,
the network performs adjustments of the weights to get better critic information in
future.
Activation Functions
It may be defined as the extra force or effort applied over the input to obtain an
exact output. In ANN, we can also apply activation functions over the input to get
the exact output. Followings are some activation functions of interest −
It is also called the identity function as it performs no input editing. It can be defined
as −
F(x)=xF(x)=x
F(x)=sigm(x)=11+exp(−x)F(x)=sigm(x)=11+exp(−x)
• Bipolar sigmoidal function − This activation function performs input editing
between -1 and 1. It can be positive or negative in nature. It is always
bounded, which means its output cannot be less than -1 and more than 1. It
is also strictly increasing in nature like sigmoid function. It can be defined as
F(x)=sigm(x)=21+exp(−x)−1=1−exp(x)1+exp(x)F(x)=sigm(x)=21+exp
(−x)−1=1−exp(x)1+exp(x)
Basically, learning means to do and adapt the change in itself as and when there is
a change in environment. ANN is a complex system or more precisely we can say
that it is a complex adaptive system, which can change its internal structure based
on the information passing through it.
Why Is It important?
Being a complex adaptive system, learning in ANN implies that a processing unit is
capable of changing its input/output behavior due to the change in environment.
The importance of learning in ANN increases because of the fixed activation
function as well as the input/output vector, when a particular network is constructed.
Now to change the input/output behavior, we need to adjust the weights.
Classification
It may be defined as the process of learning to distinguish the data of samples into
different classes by finding common features between the samples of the same
classes. For example, to perform training of ANN, we have some training samples
with unique features, and to perform its testing we have some testing samples with
other unique features. Classification is an example of supervised learning.
This rule, one of the oldest and simplest, was introduced by Donald Hebb in his
book The Organization of Behavior in 1949. It is a kind of feed-forward,
unsupervised learning.
Basic Concept − This rule is based on a proposal given by Hebb, who wrote −
“When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth process or metabolic change takes
place in one or both cells such that A’s efficiency, as one of the cells firing B, is
increased.”
From the above postulate, we can conclude that the connections between two
neurons might be strengthened if the neurons fire at the same time and might
weaken if they fire at different times.
Mathematical Formulation − According to Hebbian learning rule, following is the
formula to increase the weight of connection at every time step.
Δwji(t)=αxi(t).yj(t)Δwji(t)=αxi(t).yj(t)
Here, Δwji(t)Δwji(t) = increment by which the weight of connection increases at
time step t
αα = the positive and constant learning rate
xi(t)xi(t) = the input value from pre-synaptic neuron at time step t
yi(t)yi(t) = the output of pre-synaptic neuron at same time step t
This rule is an error correcting the supervised learning algorithm of single layer
feedforward networks with linear activation function, introduced by Rosenblatt.
Basic Concept − As being supervised in nature, to calculate the error, there would
be a comparison between the desired/target output and the actual output. If there is
any difference found, then a change must be made to the weights of connection.
Mathematical Formulation − To explain its mathematical formulation, suppose we
have ‘n’ number of finite input vectors, x(n), along with its desired/target output
vector t(n), where n = 1 to N.
Now the output ‘y’ can be calculated, as explained earlier on the basis of the net
input, and activation function being applied over that net input can be expressed as
follows −
y=f(yin)={1,0,yin>θyin⩽θy=f(yin)={1,yin>θ0,yin⩽θ
Where θ is threshold.
The updating of weight can be done in the following two cases −
Case I − when t ≠ y, then
w(new)=w(old)+txw(new)=w(old)+tx
Case II − when t = y, then
No change in weight
It is introduced by Bernard Widrow and Marcian Hoff, also called Least Mean
Square (LMS) method, to minimize the error over all training patterns. It is kind of
supervised learning algorithm with having continuous activation function.
Basic Concept − The base of this rule is gradient-descent approach, which
continues forever. Delta rule updates the synaptic weights so as to minimize the net
input to the output unit and the target value.
Mathematical Formulation − To update the synaptic weights, delta rule is given by
Δwi=α.xi.ejΔwi=α.xi.ej
Here ΔwiΔwi = weight change for ith pattern;
αα = the positive and constant learning rate;
xixi = the input value from pre-synaptic neuron;
ejej = (t−yin)(t−yin), the difference between the desired/target output and the actual
output yinyin
The above delta rule is for a single output unit only.
The updating of weight can be done in the following two cases −
Case-I − when t ≠ y, then
w(new)=w(old)+Δww(new)=w(old)+Δw
Case-II − when t = y, then
No change in weight
It is concerned with unsupervised training in which the output nodes try to compete
with each other to represent the input pattern. To understand this learning rule, we
must understand the competitive network which is given as follows −
Basic Concept of Competitive Network − This network is just like a single layer
feedforward network with feedback connection between outputs. The connections
between outputs are inhibitory type, shown by dotted lines, which means the
competitors never support themselves.
yk={10ifvk>vjforallj,j≠kotherwiseyk={1ifvk>vjforallj,j≠k0otherwise
It means that if any neuron, say ykyk , wants to win, then its induced local field
(the output of summation unit), say vkvk, must be the largest among all the other
neurons in the network.
• Condition of sum total of weight − Another constraint over the competitive
learning rule is, the sum total of weights to a particular output neuron is going
to be 1. For example, if we consider neuron k then −
∑jwkj=1forallk∑jwkj=1forallk
• Change of weight for winner − If a neuron does not respond to the input
pattern, then no learning takes place in that neuron. However, if a particular
neuron wins, then the corresponding weights are adjusted as follows
Δwkj={−α(xj−wkj),0,ifneuronkwinsifneuronklossesΔwkj={−α(xj−wkj),if
neuronkwins0,ifneuronklosses
Perceptron
Developed by Frank Rosenblatt by using McCulloch and Pitts model, perceptron is
the basic operational unit of artificial neural networks. It employs supervised
learning rule and is able to classify the data into two classes.
Operational characteristics of the perceptron: It consists of a single neuron with an
arbitrary number of inputs along with adjustable weights, but the output of the
neuron is 1 or 0 depending upon the threshold. It also consists of a bias whose
weight is always 1. Following figure gives a schematic representation of the
perceptron.
Training Algorithm
Perceptron network can be trained for single output unit as well as multiple output
units.
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the
learning rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 5 − Now obtain the net input with the following relation −
yin=b+∑inxi.wiyin=b+∑inxi.wi
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output.
f(yin)=⎧⎩⎨10−1ifyin>θif−θ⩽yin⩽θifyin<−θf(yin)={1ifyin>θ0if−θ⩽yin⩽θ−1ifyin<
−θ
The following diagram is the architecture of perceptron for multiple output classes.
Step 1 − Initialize the following to start the training −
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the
learning rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training vector x.
Step 4 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 5 − Obtain the net input with the following relation −
yin=b+∑inxiwijyin=b+∑inxiwij
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output for each
output unit j = 1 to m −
f(yin)=⎧⎩⎨⎪⎪10−1ifyinj>θif−θ⩽yinj⩽θifyinj<−θf(yin)={1ifyinj>θ0if−θ⩽yinj⩽θ−
1ifyinj<−θ
Architecture
The basic structure of Adaline is similar to perceptron having an extra feedback loop
with the help of which the actual output is compared with the desired/target output.
After comparison on the basis of training algorithm, the weights and bias will be
updated.
Training Algorithm
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the
learning rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-6 for every bipolar training pair s:t.
Step 4 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 5 − Obtain the net input with the following relation −
yin=b+∑inxiwiyin=b+∑inxiwi
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output −
f(yin)={1−1ifyin⩾0ifyin<0f(yin)={1ifyin⩾0−1ifyin<0
Step 7 − Adjust the weight and bias as follows −
Case 1 − if y ≠ t then,
wi(new)=wi(old)+α(t−yin)xiwi(new)=wi(old)+α(t−yin)xi
b(new)=b(old)+α(t−yin)b(new)=b(old)+α(t−yin)
Case 2 − if y = t then,
wi(new)=wi(old)wi(new)=wi(old)
b(new)=b(old)b(new)=b(old)
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
(t−yin)(t−yin) is the computed error.
Step 8 − Test for the stopping condition, which will happen when there is no change
in weight or the highest weight change occurred during training is smaller than the
specified tolerance.
Architecture
Training Algorithm
By now we know that only the weights and bias between the input and the Adaline
layer are to be adjusted, and the weights and bias between the Adaline and the
Madaline layer are fixed.
Step 1 − Initialize the following to start the training −
• Weights
• Bias
• Learning rate αα
For easy calculation and simplicity, weights and bias must be set equal to 0 and the
learning rate must be set equal to 1.
Step 2 − Continue step 3-8 when the stopping condition is not true.
Step 3 − Continue step 4-7 for every bipolar training pair s:t.
Step 4 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 5 − Obtain the net input at each hidden layer, i.e. the Adaline layer with the
following relation −
Qinj=bj+∑inxiwijj=1tomQinj=bj+∑inxiwijj=1tom
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 6 − Apply the following activation function to obtain the final output at the
Adaline and the Madaline layer −
f(x)={1−1ifx⩾0ifx<0f(x)={1ifx⩾0−1ifx<0
Output at the hidden (Adaline) unit
Qj=f(Qinj)Qj=f(Qinj)
Final output of the network
y=f(yin)y=f(yin)
i.e. yinj=b0+∑mj=1Qjvjyinj=b0+∑j=1mQjvj
Step 7 − Calculate the error and adjust the weights as follows −
Case 1 − if y ≠ t and t = 1 then,
wij(new)=wij(old)+α(1−Qinj)xiwij(new)=wij(old)+α(1−Qinj)xi
bj(new)=bj(old)+α(1−Qinj)bj(new)=bj(old)+α(1−Qinj)
In this case, the weights would be updated on Qj where the net input is close to 0
because t = 1.
Case 2 − if y ≠ t and t = -1 then,
wik(new)=wik(old)+α(−1−Qink)xiwik(new)=wik(old)+α(−1−Qink)xi
bk(new)=bk(old)+α(−1−Qink)bk(new)=bk(old)+α(−1−Qink)
In this case, the weights would be updated on Qk where the net input is positive
because t = -1.
Here ‘y’ is the actual output and ‘t’ is the desired/target output.
Case 3 − if y = t then
There would be no change in weights.
Step 8 − Test for the stopping condition, which will happen when there is no change
in weight or the highest weight change occurred during training is smaller than the
specified tolerance.
As shown in the diagram, the architecture of BPN has three interconnected layers
having weights on them. The hidden layer as well as the output layer also has bias,
whose weight is always 1, on them. As is clear from the diagram, the working of
BPN is in two phases. One phase sends the signal from the input layer to the output
layer, and the other phase back propagates the error from the output layer to the
input layer.
Training Algorithm
For training, BPN will use binary sigmoid activation function. The training of BPN will
have the following three phases.
• Phase 1 − Feed Forward Phase
• Phase 2 − Back Propagation of error
• Phase 3 − Updating of weights
All these steps will be concluded in the algorithm as follows
Step 1 − Initialize the following to start the training −
• Weights
• Learning rate αα
For easy calculation and simplicity, take some small random values.
Step 2 − Continue step 3-11 when the stopping condition is not true.
Step 3 − Continue step 4-10 for every training pair.
Phase 1
Step 4 − Each input unit receives input signal xi and sends it to the hidden unit for
all i = 1 to n
Step 5 − Calculate the net input at the hidden unit using the following relation −
Qinj=b0j+∑i=1nxivijj=1topQinj=b0j+∑i=1nxivijj=1top
Here b0j is the bias on hidden unit, vij is the weight on j unit of the hidden layer
coming from i unit of the input layer.
Now calculate the net output by applying the following activation function
Qj=f(Qinj)Qj=f(Qinj)
Send these output signals of the hidden layer units to the output layer units.
Step 6 − Calculate the net input at the output layer unit using the following relation −
yink=b0k+∑j=1pQjwjkk=1tomyink=b0k+∑j=1pQjwjkk=1tom
Here b0k is the bias on output unit, wjk is the weight on k unit of the output layer
coming from j unit of the hidden layer.
Calculate the net output by applying the following activation function
yk=f(yink)yk=f(yink)
Phase 2
Step 7 − Compute the error correcting term, in correspondence with the target
pattern received at each output unit, as follows −
δk=(tk−yk)f′(yink)δk=(tk−yk)f′(yink)
On this basis, update the weight and bias as follows −
Δvjk=αδkQijΔvjk=αδkQij
Δb0k=αδkΔb0k=αδk
Then, send δkδk back to the hidden layer.
Step 8 − Now each hidden unit will be the sum of its delta inputs from the output
units.
δinj=∑k=1mδkwjkδinj=∑k=1mδkwjk
Error term can be calculated as follows −
δj=δinjf′(Qinj)δj=δinjf′(Qinj)
On this basis, update the weight and bias as follows −
Δwij=αδjxiΔwij=αδjxi
Δb0j=αδjΔb0j=αδj
Phase 3
Step 9 − Each output unit (ykk = 1 to m) updates the weight and bias as follows −
vjk(new)=vjk(old)+Δvjkvjk(new)=vjk(old)+Δvjk
b0k(new)=b0k(old)+Δb0kb0k(new)=b0k(old)+Δb0k
Step 10 − Each output unit (zjj = 1 to p) updates the weight and bias as follows −
wij(new)=wij(old)+Δwijwij(new)=wij(old)+Δwij
b0j(new)=b0j(old)+Δb0jb0j(new)=b0j(old)+Δb0j
Step 11 − Check for the stopping condition, which may be either the number of
epochs reached or the target output matches the actual output.
Mathematical Formulation
For the activation function yk=f(yink)yk=f(yink) the derivation of net input on Hidden
layer as well as on output layer can be given by
yink=∑iziwjkyink=∑iziwjk
And yinj=∑ixivijyinj=∑ixivij
Now the error which has to be minimized is
E=12∑k[tk−yk]2E=12∑k[tk−yk]2
By using the chain rule, we have
∂E∂wjk=∂∂wjk(12∑k[tk−yk]2)∂E∂wjk=∂∂wjk(12∑k[tk−yk]2)
=∂∂wjk⟮12[tk−t(yink)]2⟯=∂∂wjk⟮12[tk−t(yink)]2⟯
=−[tk−yk]∂∂wjkf(yink)=−[tk−yk]∂∂wjkf(yink)
=−[tk−yk]f(yink)∂∂wjk(yink)=−[tk−yk]f(yink)∂∂wjk(yink)
=−[tk−yk]f′(yink)zj=−[tk−yk]f′(yink)zj
Now let us say δk=−[tk−yk]f′(yink)δk=−[tk−yk]f′(yink)
The weights on connections to the hidden unit zj can be given by −
∂E∂vij=−∑kδk∂∂vij(yink)∂E∂vij=−∑kδk∂∂vij(yink)
Putting the value of yinkyink we will get the following
δj=−∑kδkwjkf′(zinj)δj=−∑kδkwjkf′(zinj)
Weight updating can be done as follows −
For the output unit −
Δwjk=−α∂E∂wjkΔwjk=−α∂E∂wjk
=αδkzj=αδkzj
For the hidden unit −
Δvij=−α∂E∂vijΔvij=−α∂E∂vij
=αδjxi=αδjxi
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a
teacher. This learning process is independent. During the training of ANN under
unsupervised learning, the input vectors of similar type are combined to form
clusters. When a new input pattern is applied, then the neural network gives an
output response indicating the class to which input pattern belongs. In this, there
would be no feedback from the environment as to what should be the desired output
and whether it is correct or incorrect. Hence, in this type of learning the network
itself must discover the patterns, features from the input data and the relation for the
input data over the output.
Winner-Takes-All Networks
These kinds of networks are based on the competitive learning rule and will use the
strategy where it chooses the neuron with the greatest total inputs as a winner. The
connections between the output neurons show the competition between them and
one of them would be ‘ON’ which means it would be the winner and others would be
‘OFF’.
Following are some of the networks based on this simple concept using
unsupervised learning.
Hamming Network
Max Net
This is also a fixed weight network, which serves as a subnet for selecting the node
having the highest input. All the nodes are fully interconnected and there exists
symmetrical weights in all these weighted interconnections.
Architecture
It uses the mechanism which is an iterative process and each node receives
inhibitory inputs from all other nodes through connections. The single node whose
value is maximum would be active or winner and the activations of all other nodes
would be inactive. Max Net uses identity activation function with
f(x)={x0ifx>0ifx≤0f(x)={xifx>00ifx≤0
The task of this net is accomplished by the self-excitation weight of +1 and mutual
inhibition magnitude, which is set like [0 < ɛ < 1m1m] where “m” is the total number
of the nodes.
This network is just like a single layer feed-forward network having feedback
connection between the outputs. The connections between the outputs are
inhibitory type, which is shown by dotted lines, which means the competitors never
support themselves.
Mathematical Formulation
Following are the three important factors for mathematical formulation of this
learning rule −
• Condition to be a winner
Suppose if a neuron yk wants to be the winner, then there would be the
following condition
yk={10ifvk>vjforallj,j≠kotherwiseyk={1ifvk>vjforallj,j≠k0otherwise
It means that if any neuron, say, yk wants to win, then its induced local field
(the output of the summation unit), say vk, must be the largest among all the
other neurons in the network.
• Condition of the sum total of weight
Another constraint over the competitive learning rule is the sum total of
weights to a particular output neuron is going to be 1. For example, if we
consider neuron k then
∑kwkj=1forallk∑kwkj=1forallk
• Change of weight for the winner
If a neuron does not respond to the input pattern, then no learning takes
place in that neuron. However, if a particular neuron wins, then the
corresponding weights are adjusted as follows −
Δwkj={−α(xj−wkj),0ifneuronkwinsifneuronklossesΔwkj={−α(xj−wkj),ifn
euronkwins0ifneuronklosses
K-means is one of the most popular clustering algorithm in which we use the
concept of partition procedure. We start with an initial partition and repeatedly move
patterns from one cluster to another, until we get a satisfactory result.
Algorithm
Step 1 − Select k points as the initial centroids. Initialize k prototypes (w1,…,wk), for
example we can identifying them with randomly chosen input vectors −
Wj=ip,wherej∈{1,....,k}andp∈{1,....,n}Wj=ip,wherej∈{1,....,k}andp∈{1,....,n}
Each cluster Cj is associated with prototype wj.
Step 2 − Repeat step 3-5 until E no longer decreases, or the cluster membership no
longer changes.
Step 3 − For each input vector ip where p ∈ {1,…,n}, put ip in the cluster Cj* with the
nearest prototype wj* having the following relation
|ip−wj∗|≤|ip−wj|,j∈{1,....,k}|ip−wj∗|≤|ip−wj|,j∈{1,....,k}
Step 4 − For each cluster Cj, where j ∈ { 1,…,k}, update the prototype wj to be the
centroid of all samples currently in Cj , so that
wj=∑ip∈Cjip|Cj|wj=∑ip∈Cjip|Cj|
Step 5 − Compute the total quantization error as follows −
E=∑j=1k∑ip∈wj|ip−wj|2E=∑j=1k∑ip∈wj|ip−wj|2
Neocognitron
It is a multilayer feedforward network, which was developed by Fukushima in 1980s.
This model is based on supervised learning and is used for visual pattern
recognition, mainly hand-written characters. It is basically an extension of Cognitron
network, which was also developed by Fukushima in 1975.
Architecture
Training Algorithm
Calculations in S-cell
The S-cell possesses the excitatory signal received from the previous layer and
possesses inhibitory signals obtained within the same layer.
θ=∑∑tic2i−−−−−−−−−√θ=∑∑tici2
Here, ti is the fixed weight and ci is the output from C-cell.
The scaled input of S-cell can be calculated as follows −
x=1+e1+vw0−1x=1+e1+vw0−1
Here, e=∑iciwie=∑iciwi
wi is the weight adjusted from C-cell to S-cell.
w0 is the weight adjustable between the input and S-cell.
v is the excitatory input from C-cell.
The activation of the output signal is,
s={x,0,ifx≥0ifx<0s={x,ifx≥00,ifx<0
Calculations in C-cell
Architecture
Following figure shows the architecture of LVQ which is quite similar to the
architecture of KSOM. As we can see, there are “n” number of input units
and “m” number of output units. The layers are fully interconnected with having
weights on them.
Parameters Used
Following are the parameters used in LVQ training process as well as in the
flowchart
• x = training vector (x1,...,xi,...,xn)
• T = class for training vector x
• wj = weight vector for jth output unit
• Cj = class associated with the jth output unit
Training Algorithm
Step 1 − Initialize reference vectors, which can be done as follows −
• Step 1(a) − From the given set of training vectors, take the first “m” (number
of clusters) training vectors and use them as weight vectors. The remaining
vectors can be used for training.
• Step 1(b) − Assign the initial weight and classification randomly.
• Step 1(c) − Apply K-means clustering method.
Step 2 − Initialize reference vector αα
Step 3 − Continue with steps 4-9, if the condition for stopping this algorithm is not
met.
Step 4 − Follow steps 5-6 for every training input vector x.
Step 5 − Calculate Square of Euclidean Distance for j = 1 to m and i = 1 to n
D(j)=∑i=1n∑j=1m(xi−wij)2D(j)=∑i=1n∑j=1m(xi−wij)2
Step 6 − Obtain the winning unit J where D(j) is minimum.
Step 7 − Calculate the new weight of the winning unit by the following relation −
if T = Cj then wj(new)=wj(old)+α[x−wj(old)]wj(new)=wj(old)+α[x−wj(old)]
if T ≠ Cj then wj(new)=wj(old)−α[x−wj(old)]wj(new)=wj(old)−α[x−wj(old)]
Step 8 − Reduce the learning rate αα.
Step 9 − Test for the stopping condition. It may be as follows −
Flowchart
Variants
Three other variants namely LVQ2, LVQ2.1 and LVQ3 have been developed by
Kohonen. Complexity in all these three variants, due to the concept that the winner
as well as the runner-up unit will learn, is more than in LVQ.
LVQ2
As discussed, the concept of other variants of LVQ above, the condition of LVQ2 is
formed by window. This window will be based on the following parameters −
• x − the current input vector
• yc − the reference vector closest to x
• yr − the other reference vector, which is next closest to x
• dc − the distance from x to yc
• dr − the distance from x to yr
The input vector x falls in the window, if
dcdr>1−θanddrdc>1+θdcdr>1−θanddrdc>1+θ
Here, θθ is the number of training samples.
Updating can be done with the following formula −
yc(t+1)=yc(t)+α(t)[x(t)−yc(t)]yc(t+1)=yc(t)+α(t)[x(t)−yc(t)] (belongs to
different class)
yr(t+1)=yr(t)+α(t)[x(t)−yr(t)]yr(t+1)=yr(t)+α(t)[x(t)−yr(t)] (belongs to same
class)
Here αα is the learning rate.
LVQ2.1
In LVQ2.1, we will take the two closest vectors namely yc1 and yc2 and the condition
for window is as follows −
Min[dc1dc2,dc2dc1]>(1−θ)Min[dc1dc2,dc2dc1]>(1−θ)
Max[dc1dc2,dc2dc1]<(1+θ)Max[dc1dc2,dc2dc1]<(1+θ)
Updating can be done with the following formula −
yc1(t+1)=yc1(t)+α(t)[x(t)−yc1(t)]yc1(t+1)=yc1(t)+α(t)[x(t)−yc1(t)] (belongs to
different class)
yc2(t+1)=yc2(t)+α(t)[x(t)−yc2(t)]yc2(t+1)=yc2(t)+α(t)[x(t)−yc2(t)] (belongs to
same class)
Here, αα is the learning rate.
LVQ3
In LVQ3, we will take the two closest vectors namely yc1 and yc2 and the condition
for window is as follows −
Min[dc1dc2,dc2dc1]>(1−θ)(1+θ)Min[dc1dc2,dc2dc1]>(1−θ)(1+θ)
Here θ≈0.2θ≈0.2
Updating can be done with the following formula −
yc1(t+1)=yc1(t)+β(t)[x(t)−yc1(t)]yc1(t+1)=yc1(t)+β(t)[x(t)−yc1(t)] (belongs to
different class)
yc2(t+1)=yc2(t)+β(t)[x(t)−yc2(t)]yc2(t+1)=yc2(t)+β(t)[x(t)−yc2(t)] (belongs to
same class)
Here ββ is the multiple of the learning rate αα and β=mα(t)β=mα(t) for every 0.1 <
m < 0.5
Adaptive Resonance Theory
This network was developed by Stephen Grossberg and Gail Carpenter in 1987. It
is based on competition and uses unsupervised learning model. Adaptive
Resonance Theory (ART) networks, as the name suggests, is always open to new
learning (adaptive) without losing the old patterns (resonance). Basically, ART
network is a vector classifier which accepts an input vector and classifies it into one
of the categories depending upon which of the stored pattern it resembles the most.
Operating Principal
The main operation of ART classification can be divided into the following phases −
• Recognition phase − The input vector is compared with the classification
presented at every node in the output layer. The output of the neuron
becomes “1” if it best matches with the classification applied, otherwise it
becomes “0”.
• Comparison phase − In this phase, a comparison of the input vector to the
comparison layer vector is done. The condition for reset is that the degree of
similarity would be less than vigilance parameter.
• Search phase − In this phase, the network will search for reset as well as the
match done in the above phases. Hence, if there would be no reset and the
match is quite good, then the classification is over. Otherwise, the process
would be repeated and the other stored pattern must be sent to find the
correct match.
ART1
It is a type of ART, which is designed to cluster binary vectors. We can understand
about this with the architecture of it.
Architecture of ART1
Algorithm
Step 1 − Initialize the learning rate, the vigilance parameter, and the weights as
follows −
α>1and0<ρ≤1α>1and0<ρ≤1
0<bij(0)<αα−1+nandtij(0)=10<bij(0)<αα−1+nandtij(0)=1
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every training input.
Step 4 − Set activations of all F1(a) and F1 units as follows
F2 = 0 and F1(a) = input vectors
Step 5 − Input signal from F1(a) to F1(b) layer must be sent like
si=xisi=xi
Step 6 − For every inhibited F2 node
yj=∑ibijxiyj=∑ibijxi the condition is yj ≠ -1
Step 7 − Perform step 8-10, when the reset is true.
Step 8 − Find J for yJ ≥ yj for all nodes j
Step 9 − Again calculate the activation on F1(b) as follows
xi=sitJixi=sitJi
Step 10 − Now, after calculating the norm of vector x and vector s, we need to
check the reset condition as follows −
If ||x||/ ||s|| < vigilance parameter ρ,theninhibit node J and go to step 7
Else If ||x||/ ||s|| ≥ vigilance parameter ρ, then proceed further.
Step 11 − Weight updating for node J can be done as follows −
bij(new)=αxiα−1+||x||bij(new)=αxiα−1+||x||
tij(new)=xitij(new)=xi
Step 12 − The stopping condition for algorithm must be checked and it may be as
follows −
This topology has 24 nodes in the distance-2 grid, 16 nodes in the distance-1 grid,
and 8 nodes in the distance-0 grid, which means the difference between each
rectangular grid is 8 nodes. The winning unit is indicated by #.
Hexagonal Grid Topology
This topology has 18 nodes in the distance-2 grid, 12 nodes in the distance-1 grid,
and 6 nodes in the distance-0 grid, which means the difference between each
rectangular grid is 6 nodes. The winning unit is indicated by #.
Architecture
The architecture of KSOM is similar to that of the competitive network. With the help
of neighborhood schemes, discussed earlier, the training can take place over the
extended region of the network.
Algorithm for training
Step 1 − Initialize the weights, the learning rate α and the neighborhood topological
scheme.
Step 2 − Continue step 3-9, when the stopping condition is not true.
Step 3 − Continue step 4-6 for every input vector x.
Step 4 − Calculate Square of Euclidean Distance for j = 1 to m
D(j)=∑i=1n∑j=1m(xi−wij)2D(j)=∑i=1n∑j=1m(xi−wij)2
Step 5 − Obtain the winning unit J where D(j) is minimum.
Step 6 − Calculate the new weight of the winning unit by the following relation −
wij(new)=wij(old)+α[xi−wij(old)]wij(new)=wij(old)+α[xi−wij(old)]
Step 7 − Update the learning rate α by the following relation −
α(t+1)=0.5αtα(t+1)=0.5αt
Step 8 − Reduce the radius of topological scheme.
Step 9 − Check for the stopping condition for the network.
Architecture
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 (i = 1 to n, j = 1 to n)
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 4 − Activate each output unit as follows −
yj=sj(j=1ton)yj=sj(j=1ton)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyjwij(new)=wij(old)+xiyj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
yinj=∑i=1nxiwijyinj=∑i=1nxiwij
Step 5 − Apply the following activation function to calculate the output
yj=f(yinj)={+1−1ifyinj>0ifyinj⩽0yj=f(yinj)={+1ifyinj>0−1ifyinj⩽0
Architecture
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 (i = 1 to n, j = 1 to m)
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)xi=si(i=1ton)
Step 4 − Activate each output unit as follows −
yj=sj(j=1tom)yj=sj(j=1tom)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyjwij(new)=wij(old)+xiyj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
yinj=∑i=1nxiwijyinj=∑i=1nxiwij
Step 5 − Apply the following activation function to calculate the output
yj=f(yinj)=⎧⎩⎨⎪⎪+10−1ifyinj>0ifyinj=0ifyinj<0yj=f(yinj)={+1ifyinj>00ifyinj=0−1i
fyinj<0
Architecture
Following are some important points to keep in mind about discrete Hopfield
network −
• This model consists of neurons with one inverting and one non-inverting
output.
• The output of each neuron should be the input of other neurons but not the
input of self.
• Weight/connection strength is represented by wij.
• Connections can be excitatory as well as inhibitory. It would be excitatory, if
the output of the neuron is same as the input, otherwise inhibitory.
• Weights should be symmetrical, i.e. wij = wji
The output from Y1 going to Y2, Yi and Yn have the
weights w12, w1i and w1n respectively. Similarly, other arcs have the weights on them.
Training Algorithm
Testing Algorithm
Step 1 − Initialize the weights, which are obtained from training algorithm by using
Hebbian principle.
Step 2 − Perform steps 3-9, if the activations of the network is not consolidated.
Step 3 − For each input vector X, perform steps 4-8.
Step 4 − Make initial activation of the network equal to the external input
vector X as follows −
yi=xifori=1tonyi=xifori=1ton
Step 5 − For each unit Yi, perform steps 6-9.
Step 6 − Calculate the net input of the network as follows −
yini=xi+∑jyjwjiyini=xi+∑jyjwji
Step 7 − Apply the activation as follows over the net input to calculate the output −
yi=⎧⎩⎨1yi0ifyini>θiifyini=θiifyini<θiyi={1ifyini>θiyiifyini=θi0ifyini<θi
Here θiθi is the threshold.
Step 8 − Broadcast this output yi to all other units.
Step 9 − Test the network for conjunction.
Ef=12∑i=1n∑j=1j≠inyiyjwij−∑i=1nxiyi+1λ∑i=1n∑j=1j≠inwijgri∫yi0a−1(y)dyEf=12∑i=1
n∑j=1j≠inyiyjwij−∑i=1nxiyi+1λ∑i=1n∑j=1j≠inwijgri∫0yia−1(y)dy
Here λ is gain parameter and gri input conductance.
Boltzmann Machine
These are stochastic learning processes having recurrent structure and are the
basis of the early optimization techniques used in ANN. Boltzmann Machine was
invented by Geoffrey Hinton and Terry Sejnowski in 1985. More clarity can be
observed in the words of Hinton on Boltzmann Machine.
“A surprising feature of this network is that it uses only locally available information.
The change of weight depends only on the behavior of the two units it connects,
even though the change optimizes a global measure” - Ackley, Hinton 1985.
Some important points about Boltzmann Machine −
• They use recurrent structure.
• They consist of stochastic neurons, which have one of the two possible
states, either 1 or 0.
• Some of the neurons in this are adaptive (free state) and some are clamped
(frozen state).
• If we apply simulated annealing on discrete Hopfield network, then it would
become Boltzmann Machine.
Architecture
The following diagram shows the architecture of Boltzmann machine. It is clear from
the diagram, that it is a two-dimensional array of units. Here, weights on
interconnections between units are –p where p > 0. The weights of self-connections
are given by b where b > 0.
Training Algorithm
As we know that Boltzmann machines have fixed weights, hence there will be no
training algorithm as we do not need to update the weights in the network. However,
to test the network we have to set the weights as well as to find the consensus
function (CF).
Boltzmann machine has a set of units Ui and Uj and has bi-directional connections
on them.
• We are considering the fixed weight say wij.
• wij ≠ 0 if Ui and Uj are connected.
• There also exists a symmetry in weighted interconnection, i.e. wij = wji.
• wii also exists, i.e. there would be the self-connection between units.
• For any unit Ui, its state ui would be either 1 or 0.
The main objective of Boltzmann Machine is to maximize the Consensus Function
(CF) which can be given by the following relation
CF=∑i∑j⩽iwijuiujCF=∑i∑j⩽iwijuiuj
Now, when the state changes from either 1 to 0 or from 0 to 1, then the change in
consensus can be given by the following relation −
ΔCF=(1−2ui)(wij+∑j≠iuiwij)ΔCF=(1−2ui)(wij+∑j≠iuiwij)
Here ui is the current state of Ui.
The variation in coefficient (1 - 2ui) is given by the following relation −
(1−2ui)={+1,−1,UiiscurrentlyoffUiiscurrentlyon(1−2ui)={+1,Uiiscurrentlyoff−1,
Uiiscurrentlyon
Generally, unit Ui does not change its state, but if it does then the information would
be residing local to the unit. With that change, there would also be an increase in
the consensus of the network.
Probability of the network to accept the change in the state of the unit is given by
the following relation −
AF(i,T)=11+exp[−ΔCF(i)T]AF(i,T)=11+exp[−ΔCF(i)T]
Here, T is the controlling parameter. It will decrease as CF reaches the maximum
value.
Testing Algorithm
Brain-State-in-a-Box Network
The Brain-State-in-a-Box (BSB) neural network is a nonlinear auto-associative
neural network and can be extended to hetero-association with two or more layers.
It is also similar to Hopfield network. It was proposed by J.A. Anderson, J.W.
Silverstein, S.A. Ritz and R.S. Jones in 1977.
Some important points to remember about BSB Network −
• It is a fully connected network with the maximum number of nodes depending
upon the dimensionality n of the input space.
• All the neurons are updated simultaneously.
• Neurons take values between -1 to +1.
Mathematical Formulations
The node function used in BSB network is a ramp function, which can be defined as
follows −
f(net)=min(1,max(−1,net))f(net)=min(1,max(−1,net))
This ramp function is bounded and continuous.
As we know that each node would change its state, it can be done with the help of
the following mathematical relation −
xt(t+1)=f(∑j=1nwi,jxj(t))xt(t+1)=f(∑j=1nwi,jxj(t))
Here, xi(t) is the state of the ith node at time t.
Weights from ith node to jth node can be measured with the following relation −
wij=1P∑p=1P(vp,ivp,j)wij=1P∑p=1P(vp,ivp,j)
Here, P is the number of training patterns, which are bipolar.
Matrix Representation
Actually each tour of n-city TSP can be expressed as n × n matrix whose ith row
describes the ith city’s location. This matrix, M, for 4 cities A, B, C, D can be
expressed as follows −
M=⎡⎣⎢⎢⎢A:B:C:D:1000010000100001⎤⎦⎥⎥⎥M=[A:1000B:0100C:0010D:0001]
To be the optimized solution, the energy function must be minimum. On the basis of
the following constraints, we can calculate the energy function as follows −
Constraint-I
First constraint, on the basis of which we will calculate energy function, is that one
element must be equal to 1 in each row of matrix M and other elements in each row
must equal to 0 because each city can occur in only one position in the TSP tour.
This constraint can mathematically be written as follows −
∑j=1nMx,j=1forx∈{1,...,n}∑j=1nMx,j=1forx∈{1,...,n}
Now the energy function to be minimized, based on the above constraint, will
contain a term proportional to −
∑x=1n(1−∑j=1nMx,j)2∑x=1n(1−∑j=1nMx,j)2
Constraint-II
As we know, in TSP one city can occur in any position in the tour hence in each
column of matrix M, one element must equal to 1 and other elements must be equal
to 0. This constraint can mathematically be written as follows −
∑x=1nMx,j=1forj∈{1,...,n}∑x=1nMx,j=1forj∈{1,...,n}
Now the energy function to be minimized, based on the above constraint, will
contain a term proportional to −
∑j=1n(1−∑x=1nMx,j)2∑j=1n(1−∑x=1nMx,j)2
[γ1∑x(1−∑iMx,i)2+γ2∑i(1−∑xMx,i)2][γ1∑x(1−∑iMx,i)2+γ2∑i(1−∑xMx,i)2]
Here, γ1 and γ2 are two weighing constants.
We can understand the main working idea of gradient descent with the help of the
following steps −
• First, start with an initial guess of the solution.
• Then, take the gradient of the function at that point.
• Later, repeat the process by stepping the solution in the negative direction of
the gradient.
By following the above steps, the algorithm will eventually converge where the
gradient is zero.
Mathematical Concept
Suppose we have a function f(x) and we are trying to find the minimum of this
function. Following are the steps to find the minimum of f(x).
• First, give some initial value x0forxx0forx
• Now take the gradient ∇f∇f of function, with the intuition that the gradient
will give the slope of the curve at that x and its direction will point to the
increase in the function, to find out the best direction to minimize it.
• Now change x as follows −
xn+1=xn−θ∇f(xn)xn+1=xn−θ∇f(xn)
Here, θ > 0 is the training rate (step size) that forces the algorithm to take small
jumps.
Simulated Annealing
The basic concept of Simulated Annealing (SA) is motivated by the annealing in
solids. In the process of annealing, if we heat a metal above its melting point and
cool it down then the structural properties will depend upon the rate of cooling. We
can also say that SA simulates the metallurgy process of annealing.
Use in ANN
Algorithm
Advantages of GAs
GAs have various advantages which have made them immensely popular. These
include −
• Does not require any derivative information (which may not be available for
many real-world problems).
• Is faster and more efficient as compared to the traditional methods.
• Has very good parallel capabilities.
• Optimizes both continuous and discrete functions as well as multi-objective
problems.
• Provides a list of “good” solutions and not just a single solution.
• Always gets an answer to the problem, which gets better over the time.
• Useful when the search space is very large and there are large number of
parameters involved.
Limitations of GAs
Like any technique, GAs also suffers from a few limitations. These include −
• GAs are not suited for all problems, especially problems which are simple and
for which derivative information is available.
• Fitness value is calculated repeatedly, which might be computationally
expensive for some problems.
• Being stochastic, there are no guarantees on the optimality or the quality of
the solution.
• If not implemented properly, GA may not converge to the optimal solution.
GA – Motivation
Genetic Algorithms have the ability to deliver a “good-enough” solution “fast-
enough”. This makes Gas attractive for use in solving optimization problems. The
reasons why GAs are needed are as follows −
In computer science, there is a large set of problems, which are NP-Hard. What this
essentially means is that, even the most powerful computing systems take a very
long time (even years!) to solve that problem. In such a scenario, GAs prove to be
an efficient tool to provide usable near-optimal solutions in a short amount of
time.
Some difficult problems like the Travelling Salesman Problem (TSP), have real-
world applications like path finding and VLSI Design. Now imagine that you are
using your GPS Navigation system, and it takes a few minutes (or even a few
hours) to compute the “optimal” path from the source to destination. Delay in such
real-world applications is not acceptable and therefore a “good-enough” solution,
which is delivered “fast” is what is required.
Areas of Application
Followings are some of the areas, where ANN is being used. It suggests that ANN
has an interdisciplinary approach in its development and applications.
Speech Recognition
Character Recognition
Signatures are one of the most useful ways to authorize and authenticate a person
in legal transactions. Signature verification technique is a non-vision based
technique.
For this application, the first approach is to extract the feature or rather the
geometrical feature set representing the signature. With these feature sets, we have
to train the neural networks using an efficient neural network algorithm. This trained
neural network will classify the signature as being genuine or forged under the
verification stage.
Human Face Recognition
It is one of the biometric methods to identify the given face. It is a typical task
because of the characterization of “non-face” images. However, if a neural network
is well trained, then it can be divided into two classes namely images having faces
and images that do not have faces.
First, all the input images must be preprocessed. Then, the dimensionality of that
image must be reduced. And, at last it must be classified using neural network
training algorithm. Following neural networks are used for training purposes with
preprocessed image −
• Fully-connected multilayer feed-forward neural network trained with the help
of back-propagation algorithm.
• For dimensionality reduction, Principal Component Analysis (PCA) is used.