Neural Network Notes Unit 1
Neural Network Notes Unit 1
Neural Network Notes Unit 1
Humans have made several attempts to mimic the biological systems, and one of them
is artificial neural networks inspired by the biological neural networks in living
organisms. However, they are very much different in several ways. For example, the
birds had inspired humans to create airplanes, and the four-legged animals inspired us
to develop cars.
The artificial counterparts are definitely more powerful and make our life better. The
perceptrons, who are the predecessors of artificial neurons, were created to mimic
certain parts of a biological neuron such as dendrite, axon, and cell body using
mathematical models, electronics, and whatever limited information we have
of biological neural networks.
The neuron is the fundamental building block of neural networks. In the biological
systems, a neuron is a cell just like any other cell of the body, which has a DNA code
and is generated in the same way as the other cells. Though it might have different
DNA, the function is similar in all the organisms. A neuron comprises three major
parts: the cell body (also called Soma), the dendrites, and the axon. The dendrites are
like fibers branched in different directions and are connected to many cells in that
cluster.
Dendrites receive the signals from surrounding neurons, and the axon transmits the
signal to the other neurons. At the ending terminal of the axon, the contact with the
dendrite is made through a synapse. Axon is a long fiber that transports the output
signal as electric impulses along its length. Each neuron has one axon. Axons pass
impulses from one neuron to another like a domino effect.
ANN Structure
ANNs contain artificial neurons. There can be any number of neurons, depending on the
requirements of the application. These neurons are grouped into layers. The most commonly
used ANN structure comprises an input layer, one or more hidden layers, and an output layer.
Biological neurons communicate with each other by sending electric pulses, whereas ANNs
pass the information across layers and nodes. At the nodes, the importance of the input signal
is determined by associating weights. The value of weights can be positive or negative. A
neuron is active if the weight is positive, whereas a neuron becomes inactive if the weight is
negative. A neuron sums all input entries and multiplies these entries with the associated weight
of the node.
ANN Layers
ANNs are arranged in layers. Nodes interconnect layers. Each node is characterized by a
specific activation function. Typically, an ANN may have flowing layers.
Input layer
It receives the data, which usually is in the form of vectors. The vector may contain any number
of parameters. Generally, the number of input nodes in the input layer is equal to the number of
parameters in the input vector. Input layers preprocess the data and feed it to subsequent
hidden layers. The input nodes do not change the data; it simply checks if the data is in a valid
format and then passes it along to the next layer.
Hidden Layer
The main processing happens in the hidden layers. The number of hidden layers can vary. The
incoming information passes through weighted connections in the hidden layer, where the input
values are multiplied with weights. Subsequently, the weighted inputs are summed up to
produce a single number.
Output layer
The processed information is directed towards the output layer. The output layer can be
connected with the hidden layer, input layer, or both. In some cases, the output layers feed the
information back to the input layer. The output layer generates the final prediction value. There
is typically one output node in classification networks. The activation functions at the nodes of
the output layer add and change the data to produce the output values. Proper weight
adjustment is vital for neural networks to find useful data patterns and prevent overfitting.
In recent times the Artificial Neurons of Deep Learning AI models is being perceived as
Biological Neurons and how they are somewhat related in functioning. However the
difference between them are numerous and distinct, let us explore them. In this post we’ll
see few key characteristics of biological neurons, and how they are simplified to obtain
artificial neurons. We’ll then try to understand how these differences impose limits on deep
learning networks, and how advancement towards a better model close to biological neuron
can improve AI as we know of it now.
Biological Neurons
Neurons are the basic functional units of the nervous system, and they generate electrical
signals called action potentials, which allows them to quickly transmit information over long
distances. Almost all the neurons have three basic functions essential for the normal
functioning of all the cells in the body.
These are to
Now let us understand the basic parts of a neuron to know how they actually work. It is
mainly composed of 3 main parts:
1. Dendrite
These thin filaments dendrites are responsible for getting incoming signals from outside and
propagate the electrochemical stimulation received from other neural cells to the cell body,
or soma, of the neuron.
2. Cell Body/Soma
Soma is the cell body responsible for the processing of input signals and deciding whether a
neuron should fire an output signal. It contains the cell’s nucleus.
3. Axon
Axon is responsible for typically conducting electrical impulses known as action potentials
away from the nerve cell body. It normally ends with a number of synapses connecting to
the dendrites of other neurons.
Most neurons receive multiple input signals throughout their dendritic trees. A single neuron
may have more than one set of dendrites and can receive thousands of input signals.
Whether or not a neuron is excited into firing an impulse depends on the scale of all of the
excitatory and inhibitory signals it receives. The processing of this information happens
in soma which is neuron cell body. When a neuron does end up firing, the nerve impulse,
or action potential, is conducted down the axon.
Towards the end of it, the axon divides into many branches and has large swellings known
as axon terminals (or nerve terminals). These axon terminals make contact with the
target cells.
Artificial Neurons
Artificial neuron also known as perceptron is the basic unit of the neural network. In simple
terms, it is a mathematical function based on a model of biological neurons. It can also be
seen as a simple logic gate with binary outputs. They are sometimes also
called perceptrons. Perceptron is a single layer neural network and a multi-layer
perceptron is called Neural Network / Deep Neural Network.
Activation functions are mathematical equations that determine the output of a neural network model.
Activation functions also have a major effect on the neural network’s ability to converge and the
convergence speed, or in some cases, activation functions might prevent neural networks from
converging in the first place. Activation function also helps to normalize the output of any input in the
range between 1 to -1 or 0 to 1.
Activation function must be efficient and it should reduce the computation time because the neural
network sometimes trained on millions of data points.
Let’s consider the simple neural network model without any hidden layers.
Y = ∑ (weights*input + bias)
and it can range from -infinity to +infinity. So it is necessary to bound the output to get the desired
prediction or generalized results.
So the activation function is an important part of an artificial neural network. They decide whether a
neuron should be activated or not and it is a non-linear transformation that can be done on the input
before sending it to the next layer of neurons or finalizing the output.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method
of fine-tuning the weights of a neural network based on the error rate obtained
in the previous epoch (i.e., iteration). Proper tuning of the weights allows you
to reduce error rates and make the model reliable by increasing its
generalization.
Backpropagation in neural network is a short form for “backward propagation
of errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the
weights in the network.
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
• Static Back-propagation
• Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in
static back-propagation while it is nonstatic in recurrent backpropagation.
• The signals in a feedforward network flow in one direction, from input, through successive
hidden layers, to the output.
• The connections between the nodes do not form a cycle as such, it is different from recurrent
neural networks.
Perceptron
Single-layer Perceptron
Perceptron has just two layers of input layers and output
layers. Often called a single-layer network on account of
having 1 layer of links, between input and output.
RNN works on the principle of saving the output of a particular layer and feeding this back to the
input in order to predict the output of the layer.
Below is how you can convert a Feed-Forward Neural Network into a Recurrent Neural
Network:
The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
Fig: Fully connected Recurrent Neural Network
Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are
the network parameters used to improve the output of the model. At any given time t, the current
input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to
the network to improve on the output.
Fig: Fully connected Recurrent Neural Network
Now that you understand what a recurrent neural network is let’s look at the different types of
recurrent neural networks.
RNN were created because there were a few issues in the feed-forward neural network:
The solution to these issues is the RNN. An RNN can handle sequential data, accepting the
current input data, and previously received inputs. RNNs can memorize previous inputs due to
their internal memory.
How Does Recurrent Neural Networks Work?
In Recurrent Neural networks, the information cycles through a loop to the middle hidden layer.
The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions
and weights and biases. If you have a neural network where the various parameters of different
hidden layers are not affected by the previous layer, ie: the neural network does not have
memory, then you can use a recurrent neural network.
The Recurrent Neural Network will standardize the different activation functions and weights
and biases so that each hidden layer has the same parameters. Then, instead of creating multiple
hidden layers, it will create one and loop over it as many times as required.
In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.
Image Captioning
Any time series problem, like predicting the prices of stocks in a particular month, can be solved
using an RNN.
Text mining and Sentiment analysis can be carried out using an RNN for Natural Language
Processing (NLP).
Machine Translation
Given an input in one language, RNNs can be used to translate the input into different languages
as output.
1. One to One
2. One to Many
3. Many to One
4. Many to Many
This type of neural network is known as the Vanilla Neural Network. It's used for general
machine learning problems, which has a single input and a single output.
One to Many RNN
This type of neural network has a single input and multiple outputs. An example of this is the
image caption.
Many to One RNN
This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a good
example of this kind of network where a given sentence can be classified as expressing positive
or negative sentiments.
Many to Many RNN
This RNN takes a sequence of inputs and generates a sequence of outputs. Machine translation is
one of the examples.
Two Issues of Standard RNNs
While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training process.
Long training time, poor performance, and bad accuracy are the major issues in gradient
Suppose you want to predict the last word in the text: “The clouds are
in the ______.”
The most obvious answer to this is the “sky.” We do not need any
further context to predict the last word in the above sentence.
Consider this sentence: “I have been staying in Spain for the last 10
years…I can speak fluent ______.”
The word you predict will depend on the previous few words in
context. Here, you need the context of Spain to predict the last word in
the text, and the most suitable answer to this sentence is “Spanish.”
The gap between the relevant information and the point where it's
needed may have become very large. LSTMs help you solve this
problem.
In a typical RNN, one input is fed into the network at a time, and a
single output is obtained. But in backpropagation, you use the current
as well as the previous inputs as input. This is called a timestep and
one timestep will consist of many time series data points entering the
RNN simultaneously.
Once the neural network has trained on a timeset and given you an
output, that output is used to calculate and accumulate the errors.
After this, the network is rolled back up and weights are recalculated
and updated keeping the errors in mind.
Unit 2
MCP Neuron
Definition
We need a basic building block of ANNs: the artificial neuron. The first mathematical model dates back to
Warren McCulloch and Walter Pitts (MCP)[MP43], who proposed it in 1942, hence at the very beginning of
the electronic computer age during World War II. The MCP neuron depicted in Fig. 4 is a basic ingredient
of all ANNs discussed in this course. It is built on very simple general rules, inspired neatly by the
biological neuron:
• The signal enters the nucleus via dendrites from other neurons.
• The synaptic connection for each dendrite may have a different (and adjustable) strength (weight).
• In the nucleus, the signal from all the dendrites is combined (summed up) into ss.
• If the combined signal is stronger than a given threshold, then the neuron fires along the axon, in
the opposite case it remains still.
• In the simplest realization, the strength of the fired signal has two possible levels: on or off, i.e. 1
or 0. No intermediate values are needed.
• Axon terminal connects to dendrites of other neurons.
Fig. 4 MCP neuron: xixi is the input, wiwi are the weights, ss is the signal, bb is
the bias, and f(s;b)f(s;b) represents an activation function, yielding the
output y=f(s;b)y=f(s;b). The blue oval encircles the whole neuron, as used e.g.
in Fig. 3.¶
Translating this into a mathematical prescription, one assigns to the input cells the
numbers x1,x2…,xnx1,x2…,xn (input data point). The strength of the synaptic connections is controlled
with the weights wiwi. Then the combined signal is defined as the weighted sum
s=∑i=1nxiwi.s=∑i=1nxiwi.
The signal becomes an argument of the activation function, which, in the simplest case, takes the form
of the step function
When the combined signal ss is larger than the bias (threshold) bb, the nucleus fires. i.e. the signal passed
along the axon is 1. in the opposite case, the generated signal value is 0 (no firing). This is precisely what
we need to mimic the biological prototype.
There is a convenient notational convention that is frequently used. Instead of splitting the bias from the
input data, we may treat all uniformly. The condition for firing may be trivially transformed as
s≥b→s−b≥0→∑i=1nxiwi−b≥0→∑i=1nxiwi+x0w0≥0→∑i=0nxiwi≥0,s≥b→s−b≥0→∑i=1nxiwi−b≥0→∑i=1
nxiwi+x0w0≥0→∑i=0nxiwi≥0,
where x0=1x0=1 and w0=−bw0=−b. In other words, we may treat the bias as a weight on the edge
connected to an additional cell with the input always fixed to 1. This notation is shown in Fig. 5. Now, the
activation function is simply
(1)¶
(2)¶
s=∑i=0nxiwi=x0w0+x1w1+�+xnwn.s=∑i=0nxiwi=x0w0+x1w1+�+xnwn.
Hyperparameters
Hebbian Learning Rule, also known as Hebb Learning Rule, was proposed by
Donald O Hebb. It is one of the first and also easiest learning rules in the neural
network. It is used for pattern classification. It is a single layer neural network,
i.e. it has one input layer and one output layer. The input layer can have many
units, say n. The output layer only has one unit. Hebbian rule works by updating
the weights between neurons in the neural network for each training sample.
Hebbian Learning Rule Algorithm :
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat steps 3-5.
3. Set activations for input units with the input vector Xi = Si for i = 1 to n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:
Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule
based on the original MCP neuron. A Perceptron is an algorithm for supervised learning of
binary classifiers. This algorithm enables neurons to learn and processes elements in the training
set one at a time.
There are two types of Perceptrons: Single layer and Multilayer.
• Single layer - Single layer perceptrons can learn only linearly separable patterns
• Multilayer - Multilayer perceptrons or feedforward neural networks with two or more layers have
the greater processing power
The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.
This enables you to distinguish between the two linearly separable classes +1 and -1.
Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.
Perceptron Learning Rule states that the algorithm would automatically learn the optimal weight
coefficients. The input features are then multiplied with these weights to determine if a neuron
fires or not.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.
Implementation of Perceptron
Algorithm for AND Logic Gate with
2-bit Binary Input
In the field of Machine Learning, the Perceptron is a Supervised Learning
Algorithm for binary classifiers. The Perceptron Model implements the
following function:
For a particular choice of the weight vector and bias parameter , the model
predicts output for the corresponding input vector .
AND logical function truth table for 2-bit binary variables, i.e, the input vector
and the corresponding output –
0 0 0
0 1 0
1 0 0
1 1 1
f(x) = kx
where k is a constant.
On the other hand, when the word is to be read from an associative memory,
the content of the word, or part of the word, is specified. The words which
match the specified content are located by the memory and are marked for
reading.
3
Architecture
As shown in the following figure, the architecture of Auto Associative memory
network has ‘n’ number of input training vectors and similar ‘n’ number of
output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i = 1 to n, j=1 to n
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1 to n)
Step 4 − Activate each output unit as follows −
yj=sj(j=1 to n)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
Hetero Associative memory
Similar to Auto Associative Memory network, this is also a single layer neural
network. However, in this network the input training vector and the output
target vectors are not the same. The weights are determined so that the
network stores a set of patterns. Hetero associative network is static in
nature, hence, there would be no non-linear and delay operations.
Architecture
As shown in the following figure, the architecture of Hetero Associative
Memory network has ‘n’ number of input training vectors and ‘m’ number of
output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1tom
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)
Step 4 − Activate each output unit as follows −
yj=sj(j=1tom)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;
Training Algorithm
Step 1 − Initialize all the weights to zero as wij = 0 i = 1 to n, j=1 to n
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1 to n)
Step 4 − Activate each output unit as follows −
yj=sj(j=1 to n)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj
Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
Example
Outer product rule for training and Testing
the outer product of two coordinate vectors is a matrix. If the two vectors have
dimensions n and m, then their outer product is an n × m matrix. More generally, given
two tensors (multidimensional arrays of numbers), their outer product is a tensor. The outer
product of tensors is also referred to as their tensor product, and can be used to define
the tensor algebra.
What Does Storage Capacity Mean?
Storage capacity refers to the specific amount of data storage that a
device or system can accommodate
BAM Architecture:
When BAM accepts an input of n-dimensional vector X from
set A then the model recalls m-dimensional vector Y from set B.
Similarly when Y is treated as input, the BAM recalls X.
Algorithm
Limitations of BAM:
• Storage capacity of the BAM: In the BAM, stored number
of associations should not be exceeded the number of
neurons in the smaller layer.
• computer games
• mechanical engineering
This method was proposed before the era of modern
computers and there was an intensive development meantime
which led to numerous improved versions of
Function requirements
• differentiable
• convex
5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved. After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.
Activation function must be efficient and it should reduce the computation time because the
neural network sometimes trained on millions of data points.
Let’s consider the simple neural network model without any hidden layers.
Y = ∑ (weights*input + bias)
and it can range from -infinity to +infinity. So it is necessary to bound the output to get the
desired prediction or generalized results.
Also Read: 3 Things to Know before deep diving into Neural Networks
Properties:
1. Range is -infinity to +infinity
2. Provides a convex error surface so optimisation can be achieved faster
3. df(x)/dx = a which is constant. So cannot be optimised with gradient descent
Limitations:
1. Since the derivative is constant, the gradient has no relation with input
2. Back propagation is constant as the change is delta x
Hyperbolic Tangent
The function produces outputs in scale of [-1, 1] and it is a continuous function. In other
words, function produces output for every x value.
Y = tanh(x)
tanh(x) = (ex – e-x) / (ex + e-x)
Inverse Hyperbolic Tangent (arctanh)
It is similar to sigmoid and tanh but the output ranges from [-pi/2,pi/2]
Softmax
The softmax function is sometimes called the soft argmax function, or multi-class logistic
regression. This is because the softmax is a generalization of logistic regression that can be
used for multi-class classification, and its formula is very similar to the sigmoid function
which is used for logistic regression. The softmax function can be used in a classifier only
when the classes are mutually exclusive.
Gudermannian
The Gudermannian function relates circular functions and hyperbolic functions without
explicitly using complex numbers.
GELU(x)=0.5x(1+tanh(√2/π(x+0.044715×3)))
So it’s just a combination of some functions (e.g. hyperbolic tangent tanh) and approximated
numbers.
It has a negative coefficient, which shifts to a positive coefficient. So when x is greater than
zero, the output will be x, except from when x=0 to x=1, where it slightly leans to a smaller
y-value.
Also Read: What is Recurrent Neural Network | Introduction of Recurrent Neural Network
2. Exploding Gradients
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training. These large updates in turn
results in an unstable network. At an extreme, the values of weights can become so large as
to overflow and result in NaN values.
ReLU(x) = max(0,x)
So if the input is negative, the output of ReLU is 0 and for positive values, it is x.
Though it looks like a linear function, it’s not. ReLU has a derivative function and allows for
backpropagation.
There is one problem with ReLU. Let’s suppose most of the input values are negative or 0,
the ReLU produces the output as 0 and the neural network can’t perform the back
propagation. This is called the Dying ReLU problem. Also, ReLU is an unbounded function
which means there is no maximum value.
Pros:
Parametric ReLU
PReLU is actually not so different from Leaky ReLU.
So for negative values of x, the output of PReLU is alpha times x and for positive values, it
is x.
Parametric ReLU is the most common and effective method to solve a dying ReLU problem
but again it doesn’t solve exploding gradient problem.
Softplus or SmoothReLU
The derivative of the softplus function is the logistic function.
Here, β is a parameter that must be tuned. If β gets closer to ∞, then the function looks like
ReLU. Authors of the Swish function proposed to assign β as 1 for reinforcement learning
tasks.