Neural Network Notes Unit 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

Neural Network

Humans have made several attempts to mimic the biological systems, and one of them
is artificial neural networks inspired by the biological neural networks in living
organisms. However, they are very much different in several ways. For example, the
birds had inspired humans to create airplanes, and the four-legged animals inspired us
to develop cars.
The artificial counterparts are definitely more powerful and make our life better. The
perceptrons, who are the predecessors of artificial neurons, were created to mimic
certain parts of a biological neuron such as dendrite, axon, and cell body using
mathematical models, electronics, and whatever limited information we have
of biological neural networks.

Components and Working of Biological Neural


Networks
Image caption: Parts of a biological neural network
In living organisms, the brain is the control unit of the neural network, and it has
different subunits that take care of vision, senses, movement, and hearing. The brain is
connected with a dense network of nerves to the rest of the body’s sensors and actors.
There are approximately 10¹¹ neurons in the brain, and these are the building blocks of
the complete central nervous system of the living body.

The neuron is the fundamental building block of neural networks. In the biological
systems, a neuron is a cell just like any other cell of the body, which has a DNA code
and is generated in the same way as the other cells. Though it might have different
DNA, the function is similar in all the organisms. A neuron comprises three major
parts: the cell body (also called Soma), the dendrites, and the axon. The dendrites are
like fibers branched in different directions and are connected to many cells in that
cluster.
Dendrites receive the signals from surrounding neurons, and the axon transmits the
signal to the other neurons. At the ending terminal of the axon, the contact with the
dendrite is made through a synapse. Axon is a long fiber that transports the output
signal as electric impulses along its length. Each neuron has one axon. Axons pass
impulses from one neuron to another like a domino effect.

Why Understand Biological Neural Networks?


For creating mathematical models for artificial neural networks, theoretical analysis of
biological neural networks is essential as they have a very close relationship. And this
understanding of the brain’s neural networks has opened horizons for the development
of artificial neural network systems and adaptive systems designed to learn and adapt
to the situations and inputs.
Image caption: An artificial neuron
Neural Network Structure
Artificial Neural Networks are abbreviated as ANN. They fundamentally are computational
models. ANNs are structured such that they emulate the functions of neurons present in the
human brain. A human brain learns from the environment by connecting various neurons,
whereas the neural networks learn from input and output data. On a fundamental level, ANNs
behave as nonlinear statistical data such that there is a relationship between input and output.
ANNs try to figure out different patterns using this relationship.

Similarity with Human Brain


ANNs are computational algorithms that are based on the inner workings of a biological nervous
system. ANNs can use machine learning to recognize patterns, classify objects, and perform
predictions. ANNs are composed up of interconnected layers and nodes. An ANN can also be
categorized as an oriented or directed graph. Input layers first receive information, then the
weights are assigned, and activation functions are used to process the information. ANNs have
many interconnected processing blocks; these blocks are connected by nodes that are
analogous to biological neurons. These nodes are present between each layer.

ANN Structure
ANNs contain artificial neurons. There can be any number of neurons, depending on the
requirements of the application. These neurons are grouped into layers. The most commonly
used ANN structure comprises an input layer, one or more hidden layers, and an output layer.
Biological neurons communicate with each other by sending electric pulses, whereas ANNs
pass the information across layers and nodes. At the nodes, the importance of the input signal
is determined by associating weights. The value of weights can be positive or negative. A
neuron is active if the weight is positive, whereas a neuron becomes inactive if the weight is
negative. A neuron sums all input entries and multiplies these entries with the associated weight
of the node.

ANN Layers
ANNs are arranged in layers. Nodes interconnect layers. Each node is characterized by a
specific activation function. Typically, an ANN may have flowing layers.

Input layer
It receives the data, which usually is in the form of vectors. The vector may contain any number
of parameters. Generally, the number of input nodes in the input layer is equal to the number of
parameters in the input vector. Input layers preprocess the data and feed it to subsequent
hidden layers. The input nodes do not change the data; it simply checks if the data is in a valid
format and then passes it along to the next layer.

Hidden Layer
The main processing happens in the hidden layers. The number of hidden layers can vary. The
incoming information passes through weighted connections in the hidden layer, where the input
values are multiplied with weights. Subsequently, the weighted inputs are summed up to
produce a single number.

Output layer
The processed information is directed towards the output layer. The output layer can be
connected with the hidden layer, input layer, or both. In some cases, the output layers feed the
information back to the input layer. The output layer generates the final prediction value. There
is typically one output node in classification networks. The activation functions at the nodes of
the output layer add and change the data to produce the output values. Proper weight
adjustment is vital for neural networks to find useful data patterns and prevent overfitting.

Biological Neurons Vs Artificial Neurons

In recent times the Artificial Neurons of Deep Learning AI models is being perceived as
Biological Neurons and how they are somewhat related in functioning. However the
difference between them are numerous and distinct, let us explore them. In this post we’ll
see few key characteristics of biological neurons, and how they are simplified to obtain
artificial neurons. We’ll then try to understand how these differences impose limits on deep
learning networks, and how advancement towards a better model close to biological neuron
can improve AI as we know of it now.
Biological Neurons
Neurons are the basic functional units of the nervous system, and they generate electrical
signals called action potentials, which allows them to quickly transmit information over long
distances. Almost all the neurons have three basic functions essential for the normal
functioning of all the cells in the body.

These are to

1. Receive signals from other senses or neuron.


2. Process the incoming signals and determine whether or not the information should be
passed along.
3. Communicate signals to target cells which might be other neurons or muscles or other
parts of body.

Now let us understand the basic parts of a neuron to know how they actually work. It is
mainly composed of 3 main parts:

Neuron; Source: Wikipedia

1. Dendrite
These thin filaments dendrites are responsible for getting incoming signals from outside and
propagate the electrochemical stimulation received from other neural cells to the cell body,
or soma, of the neuron.

2. Cell Body/Soma
Soma is the cell body responsible for the processing of input signals and deciding whether a
neuron should fire an output signal. It contains the cell’s nucleus.
3. Axon
Axon is responsible for typically conducting electrical impulses known as action potentials
away from the nerve cell body. It normally ends with a number of synapses connecting to
the dendrites of other neurons.

Working of the parts


The function of receiving the incoming information is done by dendrites, and processing
usually takes place in the Soma. Incoming signals can be either excitatory — which means
they tend to make the neuron fire (generate electrical impulses) — or inhibitory — which
means that they tend to block/ keep the neuron from firing.

Most neurons receive multiple input signals throughout their dendritic trees. A single neuron
may have more than one set of dendrites and can receive thousands of input signals.
Whether or not a neuron is excited into firing an impulse depends on the scale of all of the
excitatory and inhibitory signals it receives. The processing of this information happens
in soma which is neuron cell body. When a neuron does end up firing, the nerve impulse,
or action potential, is conducted down the axon.

Towards the end of it, the axon divides into many branches and has large swellings known
as axon terminals (or nerve terminals). These axon terminals make contact with the
target cells.

Artificial Neurons
Artificial neuron also known as perceptron is the basic unit of the neural network. In simple
terms, it is a mathematical function based on a model of biological neurons. It can also be
seen as a simple logic gate with binary outputs. They are sometimes also
called perceptrons. Perceptron is a single layer neural network and a multi-layer
perceptron is called Neural Network / Deep Neural Network.

Each artificial neuron has the following main functions:

1. Takes inputs from the input layer.


2. Weighs them separately and does summation.
3. Pass this summation through a nonlinear function to produce output.
Input values
We transfer the input values to a neuron using this layer. It can be as simple as an array of
the values. It is similar to a dendrite in biological neurons.
1. Weights and Bias
Weights are a collection of array values which are multiplied to the corresponding input
values. We then take a sum of all these multiplied values which is called a weighted sum.
Next, we add a bias value to the weighted sum to get final predictable value by our neuron.
This is a technical step that makes it possible to move the activation function curve up and
down, or left and right on the number graph. It makes it possible to fine-tune the numeric
output of the perceptron.
2. Activation Function
Activation Function decides whether or not a neuron is fired. It maps the input values to the
required output values and decides which of the two output values should be generated by
the neuron.
3. Output Layer
Output layer gives the final output of a neuron which can then be passed to other neurons in
the network or taken as the final output value.

Activation functions are mathematical equations that determine the output of a neural network model.
Activation functions also have a major effect on the neural network’s ability to converge and the
convergence speed, or in some cases, activation functions might prevent neural networks from
converging in the first place. Activation function also helps to normalize the output of any input in the
range between 1 to -1 or 0 to 1.

Activation function must be efficient and it should reduce the computation time because the neural
network sometimes trained on millions of data points.

Let’s consider the simple neural network model without any hidden layers.

Here is the output-

Y = ∑ (weights*input + bias)

and it can range from -infinity to +infinity. So it is necessary to bound the output to get the desired
prediction or generalized results.

Y = Activation function(∑ (weights*input + bias))

So the activation function is an important part of an artificial neural network. They decide whether a
neuron should be activated or not and it is a non-linear transformation that can be done on the input
before sending it to the next layer of neurons or finalizing the output.

What is Feedforward Neural Network?


Commonly known as a multi-layered network of neurons, feedforward neural
networks are called so due to the fact that all the information travels only in the
forward direction.
The information first enters the input nodes, moves through the hidden layers, and
finally comes out through the output nodes. The network contains no connections to
feed the information coming out at the output node back into the network.
Feedforward neural networks are meant to approximate functions.

Here’s how it works.


There is a classifier y = f*(x).
This feeds input x into category y.
The feedforward network will map y = f (x; θ). It then memorizes the value of θ that
approximates the function the best.
Feedforward neural network for the base for object recognition in images, as you can
spot in the Google Photos app.
The Layers of a Feedforward Neural Network
A feedforward neural network consists of the following.
Input layer
It contains the input-receiving neurons. They then pass the input to the next layer. The
total number of neurons in the input layer is equal to the attributes in the dataset.
Hidden layer
This is the middle layer, hidden between the input and output layers. There is a huge
number of neurons in this layer that apply transformations to the inputs. They then
pass it on to the output layer.
Output layer
It is the last layer and is dependent upon the built of the model. Also, the output layer
is the predicted feature as you know what you want the result to be.
Neuron weights
The strength of a connection between the neurons is called weights. The value of a
weight ranges 0 to 1.

What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method
of fine-tuning the weights of a neural network based on the error rate obtained
in the previous epoch (i.e., iteration). Proper tuning of the weights allows you
to reduce error rates and make the model reliable by increasing its
generalization.
Backpropagation in neural network is a short form for “backward propagation
of errors.” It is a standard method of training artificial neural networks. This
method helps calculate the gradient of a loss function with respect to all the
weights in the network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

Consider the following Back propagation neural network example diagram to


understand:
Learn Java Programming with Beginners Tutorial

How Backpropagation Algorithm Works

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden
layers, to the output layer.
4. Calculate the error in the outputs

ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:

• Backpropagation is fast, simple and easy to program


• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be
learned.

What is a Feed Forward Network?


A feedforward neural network is an artificial neural network where the nodes never
form a cycle. This kind of neural network has an input layer, hidden layers, and an
output layer. It is the first and simplest type of artificial neural network.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:

• Static Back-propagation
• Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in
static back-propagation while it is nonstatic in recurrent backpropagation.

Feed-forward neural networks:

• The signals in a feedforward network flow in one direction, from input, through successive
hidden layers, to the output.

• The connections between the nodes do not form a cycle as such, it is different from recurrent
neural networks.

Backpropagation is a training algorithm consisting of 2 steps:

• Feedforward the values.

• Calculate the error and propagate it back to the earlier layers.


Forward-propagation is a part of the backpropagation algorithm but comes before back-
propagating the signals from the nodes. The basic type of neural network is a multi-layer
perceptron, which is a Feed-forward backpropagation neural network.

Perceptron

The fundamental building block of Deep Learning is the


Perceptron which is a single neuron in a Neural Network. An
artificial neuron is inspired by biological neurons known as
Perceptron.

Single-layer Perceptron
Perceptron has just two layers of input layers and output
layers. Often called a single-layer network on account of
having 1 layer of links, between input and output.

Input nodes are connected fully to a node or multiple nodes


in the next layer. A node in the next layer takes a weighted
sum of all its inputs.

Multi-Layer Perceptron (MLP)

A multilayer perceptron (MLP) is a feed-forward artificial


neural network that generates a set of outputs from a set of
inputs. An MLP is a neural network connecting multiple layers
in a directed graph, which means that the signal path
through the nodes only goes one way. The MLP network
consists of input, output, and hidden layers. Each hidden
layer consists of numerous perceptron’s which are called
hidden layers or hidden unit.
Input Layer Hidden Layers Output Layer

Input Layer: – The Input Layers provide information from


the outside world (environment) to the network.

Hidden Layers: – Hidden Layers perform computations and


transfer information from the input layer to the output layers.
Hidden layers have no direct connection with the outside
world.

Output Nodes: – The Output Layer are responsible for


computations and transferring information from the network
to the outside function (Environment).

• This type of network is trained with the back


propagation learning algorithm.
• Multi-Layer Perceptron (MLP) can solve problems which
are not linearly separable.
• Multi-layer perceptron is often applied to supervised
learning problems.
• Multi-Layer Perceptron (MLP) are widely used for pattern
recognition, classification, prediction, and
approximation.
What Is a Recurrent Neural Network (RNN)?

RNN works on the principle of saving the output of a particular layer and feeding this back to the
input in order to predict the output of the layer.

Below is how you can convert a Feed-Forward Neural Network into a Recurrent Neural
Network:

Fig: Simple Recurrent Neural Network

The nodes in different layers of the neural network are compressed to form a single layer of
recurrent neural networks. A, B, and C are the parameters of the network.
Fig: Fully connected Recurrent Neural Network

Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are
the network parameters used to improve the output of the model. At any given time t, the current
input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to
the network to improve on the output.
Fig: Fully connected Recurrent Neural Network

Now that you understand what a recurrent neural network is let’s look at the different types of
recurrent neural networks.

Why Recurrent Neural Networks?

RNN were created because there were a few issues in the feed-forward neural network:

• Cannot handle sequential data

• Considers only the current input

• Cannot memorize previous inputs

The solution to these issues is the RNN. An RNN can handle sequential data, accepting the
current input data, and previously received inputs. RNNs can memorize previous inputs due to
their internal memory.
How Does Recurrent Neural Networks Work?

In Recurrent Neural networks, the information cycles through a loop to the middle hidden layer.

Fig: Working of Recurrent Neural Network

The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.

The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions
and weights and biases. If you have a neural network where the various parameters of different
hidden layers are not affected by the previous layer, ie: the neural network does not have
memory, then you can use a recurrent neural network.

The Recurrent Neural Network will standardize the different activation functions and weights
and biases so that each hidden layer has the same parameters. Then, instead of creating multiple
hidden layers, it will create one and loop over it as many times as required.

Feed-Forward Neural Networks vs Recurrent Neural Networks


A feed-forward neural network allows information to flow only in the forward direction, from
the input nodes, through the hidden layers, and to the output nodes. There are no cycles or loops
in the network.

Below is how a simplified presentation of a feed-forward neural network looks like:

Fig: Feed-forward Neural Network

In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.

Applications of Recurrent Neural Networks

Image Captioning

RNNs are used to caption an image by analyzing the activities present.


Time Series Prediction

Any time series problem, like predicting the prices of stocks in a particular month, can be solved
using an RNN.

Natural Language Processing

Text mining and Sentiment analysis can be carried out using an RNN for Natural Language
Processing (NLP).
Machine Translation

Given an input in one language, RNNs can be used to translate the input into different languages
as output.

Types of Recurrent Neural Networks

There are four types of Recurrent Neural Networks:

1. One to One

2. One to Many

3. Many to One

4. Many to Many

One to One RNN

This type of neural network is known as the Vanilla Neural Network. It's used for general
machine learning problems, which has a single input and a single output.
One to Many RNN

This type of neural network has a single input and multiple outputs. An example of this is the
image caption.
Many to One RNN

This RNN takes a sequence of inputs and generates a single output. Sentiment analysis is a good
example of this kind of network where a given sentence can be classified as expressing positive
or negative sentiments.
Many to Many RNN

This RNN takes a sequence of inputs and generates a sequence of outputs. Machine translation is
one of the examples.
Two Issues of Standard RNNs

1. Vanishing Gradient Problem

Recurrent Neural Networks enable you to model time-dependent and


sequential data problems, such as stock market prediction, machine
translation, and text generation. You will find, however, RNN is hard to
train because of the gradient problem.

RNNs suffer from the problem of vanishing gradients. The gradients


carry information used in the RNN, and when the gradient becomes
too small, the parameter updates become insignificant. This makes
the learning of long data sequences difficult.
2. Exploding Gradient Problem

While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training process.

Long training time, poor performance, and bad accuracy are the major issues in gradient

Gradient Problem Solutions


Now, let’s discuss the most popular and efficient way to deal with
gradient problems, i.e., Long Short-Term Memory Network (LSTMs).

First, let’s understand Long-Term Dependencies.

Suppose you want to predict the last word in the text: “The clouds are
in the ______.”

The most obvious answer to this is the “sky.” We do not need any
further context to predict the last word in the above sentence.

Consider this sentence: “I have been staying in Spain for the last 10
years…I can speak fluent ______.”

The word you predict will depend on the previous few words in
context. Here, you need the context of Spain to predict the last word in
the text, and the most suitable answer to this sentence is “Spanish.”
The gap between the relevant information and the point where it's
needed may have become very large. LSTMs help you solve this
problem.

Backpropagation Through Time


Backpropagation through time is when we apply a Backpropagation
algorithm to a Recurrent Neural network that has time series data as
its input.

In a typical RNN, one input is fed into the network at a time, and a
single output is obtained. But in backpropagation, you use the current
as well as the previous inputs as input. This is called a timestep and
one timestep will consist of many time series data points entering the
RNN simultaneously.

Once the neural network has trained on a timeset and given you an
output, that output is used to calculate and accumulate the errors.
After this, the network is rolled back up and weights are recalculated
and updated keeping the errors in mind.

Unit 2
MCP Neuron

Definition
We need a basic building block of ANNs: the artificial neuron. The first mathematical model dates back to
Warren McCulloch and Walter Pitts (MCP)[MP43], who proposed it in 1942, hence at the very beginning of
the electronic computer age during World War II. The MCP neuron depicted in Fig. 4 is a basic ingredient
of all ANNs discussed in this course. It is built on very simple general rules, inspired neatly by the
biological neuron:

• The signal enters the nucleus via dendrites from other neurons.
• The synaptic connection for each dendrite may have a different (and adjustable) strength (weight).
• In the nucleus, the signal from all the dendrites is combined (summed up) into ss.
• If the combined signal is stronger than a given threshold, then the neuron fires along the axon, in
the opposite case it remains still.
• In the simplest realization, the strength of the fired signal has two possible levels: on or off, i.e. 1
or 0. No intermediate values are needed.
• Axon terminal connects to dendrites of other neurons.

Fig. 4 MCP neuron: xixi is the input, wiwi are the weights, ss is the signal, bb is
the bias, and f(s;b)f(s;b) represents an activation function, yielding the
output y=f(s;b)y=f(s;b). The blue oval encircles the whole neuron, as used e.g.
in Fig. 3.¶
Translating this into a mathematical prescription, one assigns to the input cells the
numbers x1,x2…,xnx1,x2…,xn (input data point). The strength of the synaptic connections is controlled
with the weights wiwi. Then the combined signal is defined as the weighted sum

s=∑i=1nxiwi.s=∑i=1nxiwi.

The signal becomes an argument of the activation function, which, in the simplest case, takes the form
of the step function

f(s;b)={1 for s≥b0 for s<bf(s;b)={1 for s≥b0 for s<b

When the combined signal ss is larger than the bias (threshold) bb, the nucleus fires. i.e. the signal passed
along the axon is 1. in the opposite case, the generated signal value is 0 (no firing). This is precisely what
we need to mimic the biological prototype.
There is a convenient notational convention that is frequently used. Instead of splitting the bias from the
input data, we may treat all uniformly. The condition for firing may be trivially transformed as

s≥b→s−b≥0→∑i=1nxiwi−b≥0→∑i=1nxiwi+x0w0≥0→∑i=0nxiwi≥0,s≥b→s−b≥0→∑i=1nxiwi−b≥0→∑i=1
nxiwi+x0w0≥0→∑i=0nxiwi≥0,

where x0=1x0=1 and w0=−bw0=−b. In other words, we may treat the bias as a weight on the edge
connected to an additional cell with the input always fixed to 1. This notation is shown in Fig. 5. Now, the
activation function is simply

(1)¶

f(s)={1 for s≥00 for s<0,f(s)={1 for s≥00 for s<0,

with the summation index in ss starting from 00:

(2)¶

s=∑i=0nxiwi=x0w0+x1w1+�+xnwn.s=∑i=0nxiwi=x0w0+x1w1+�+xnwn.

Fig. 5 Alternative, more uniform representation of the MCP neuron,


with x0=1x0=1 and w0=−bw0=−b.¶

Hyperparameters

The weights w0=−b,w1,…,wnw0=−b,w1,…,wn are generally referred to as hyperparameters. They


determine the functionality of the MCP neuron and may be changed during the learning (training) process
(see the following). However, they are kept fixed when using the trained neuron on a particular input data
sample
Hebb Rule
.
.

Hebbian Learning Rule, also known as Hebb Learning Rule, was proposed by
Donald O Hebb. It is one of the first and also easiest learning rules in the neural
network. It is used for pattern classification. It is a single layer neural network,
i.e. it has one input layer and one output layer. The input layer can have many
units, say n. The output layer only has one unit. Hebbian rule works by updating
the weights between neurons in the neural network for each training sample.
Hebbian Learning Rule Algorithm :
1. Set all weights to zero, wi = 0 for i=1 to n, and bias to zero.
2. For each input vector, S(input vector) : t(target output pair), repeat steps 3-5.
3. Set activations for input units with the input vector Xi = Si for i = 1 to n.
4. Set the corresponding output value to the output neuron, i.e. y = t.
5. Update weight and bias by applying Hebb rule for all i = 1 to n:

Perceptron

Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron learning rule
based on the original MCP neuron. A Perceptron is an algorithm for supervised learning of
binary classifiers. This algorithm enables neurons to learn and processes elements in the training
set one at a time.
There are two types of Perceptrons: Single layer and Multilayer.

• Single layer - Single layer perceptrons can learn only linearly separable patterns

• Multilayer - Multilayer perceptrons or feedforward neural networks with two or more layers have
the greater processing power

The Perceptron algorithm learns the weights for the input signals in order to draw a linear
decision boundary.

This enables you to distinguish between the two linearly separable classes +1 and -1.

Note: Supervised Learning is a type of Machine Learning used to learn models from labeled
training data. It enables output prediction for future or unseen data. Let us focus on the
Perceptron Learning Rule in the next section.

Perceptron Learning Rule

Perceptron Learning Rule states that the algorithm would automatically learn the optimal weight
coefficients. The input features are then multiplied with these weights to determine if a neuron
fires or not.
The Perceptron receives multiple input signals, and if the sum of the input signals exceeds a
certain threshold, it either outputs a signal or does not return an output. In the context of
supervised learning and classification, this can then be used to predict the class of a sample.

Implementation of Perceptron
Algorithm for AND Logic Gate with
2-bit Binary Input
In the field of Machine Learning, the Perceptron is a Supervised Learning
Algorithm for binary classifiers. The Perceptron Model implements the
following function:

For a particular choice of the weight vector and bias parameter , the model
predicts output for the corresponding input vector .
AND logical function truth table for 2-bit binary variables, i.e, the input vector
and the corresponding output –

0 0 0

0 1 0

1 0 0

1 1 1

Now for the corresponding weight vector of the input


vector , the associated Perceptron Function can be defined as:

For the implementation, considered weight parameters are


and the bias parameter is

Linear Activation Functions


It is a simple straight-line function which is directly proportional to the
input i.e. the weighted sum of neurons. It has the equation:

f(x) = kx

where k is a constant.

Unlike Binary Step Function, Linear Activation Function can handle


more than one class but it has its own drawbacks.

The problem with Linear Activation Function is that it cannot be


defined in a particular range. It has a range of (-∞, ∞). No matter how
many layers the neural network has, the final layer always works as a
linear function of the first layer. This makes the neural network unable
to deal with complex problems.
Another problem is that the gradient is a constant and does not
depend on the input at all. As a result during backpropagation the rate
of change of error is constant. Thus the neural network will not really
improve with constant gradient.

Comparison Between Linear Activation Function and


Other Activation Functions
Unit 4

Associative memory can be considered as a memory unit whose stored


data can be identified for access by the content of the data itself rather than
by an address or memory location.

Associative memory is often referred to as Content Addressable Memory


(CAM).

When a write operation is performed on associative memory, no address or


memory location is given to the word. The memory itself is capable of finding
an empty unused location to store the word.

On the other hand, when the word is to be read from an associative memory,
the content of the word, or part of the word, is specified. The words which
match the specified content are located by the memory and are marked for
reading.
3

The following diagram shows the block representation of an Associative


memory.
Auto Associative Memory
This is a single layer neural network in which the input training vector and the
output target vectors are the same. The weights are determined so that the
network stores a set of patterns.

Architecture
As shown in the following figure, the architecture of Auto Associative memory
network has ‘n’ number of input training vectors and similar ‘n’ number of
output target vectors.
Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i = 1 to n, j=1 to n
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1 to n)
Step 4 − Activate each output unit as follows −
yj=sj(j=1 to n)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj

Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n
Hetero Associative memory
Similar to Auto Associative Memory network, this is also a single layer neural
network. However, in this network the input training vector and the output
target vectors are not the same. The weights are determined so that the
network stores a set of patterns. Hetero associative network is static in
nature, hence, there would be no non-linear and delay operations.

Architecture
As shown in the following figure, the architecture of Hetero Associative
Memory network has ‘n’ number of input training vectors and ‘m’ number of
output target vectors.

Training Algorithm
For training, this network is using the Hebb or Delta learning rule.
Step 1 − Initialize all the weights to zero as wij = 0 i=1ton,j=1tom
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1ton)
Step 4 − Activate each output unit as follows −
yj=sj(j=1tom)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj

Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to m;

Hebb rule for Training and Testing

Training Algorithm
Step 1 − Initialize all the weights to zero as wij = 0 i = 1 to n, j=1 to n
Step 2 − Perform steps 3-4 for each input vector.
Step 3 − Activate each input unit as follows −
xi=si(i=1 to n)
Step 4 − Activate each output unit as follows −
yj=sj(j=1 to n)
Step 5 − Adjust the weights as follows −
wij(new)=wij(old)+xiyj

Testing Algorithm
Step 1 − Set the weights obtained during training for Hebb’s rule.
Step 2 − Perform steps 3-5 for each input vector.
Step 3 − Set the activation of the input units equal to that of the input vector.
Step 4 − Calculate the net input to each output unit j = 1 to n

Example
Outer product rule for training and Testing
the outer product of two coordinate vectors is a matrix. If the two vectors have
dimensions n and m, then their outer product is an n × m matrix. More generally, given
two tensors (multidimensional arrays of numbers), their outer product is a tensor. The outer
product of tensors is also referred to as their tensor product, and can be used to define
the tensor algebra.
What Does Storage Capacity Mean?
Storage capacity refers to the specific amount of data storage that a
device or system can accommodate

In order to precisely represent storage capacity, IT professionals


and others use terms like kilobytes, megabytes and gigabytes. In
the earlier days of computing, storage capacity, or disk space, was
often measured in kilobytes. As new storage media began to
accommodate the storage of digital image and video, megabytes
quickly replaced kilobytes, and gigabytes quickly replaced
megabytes. New storage capacity measurements are often
presented in terms of hundreds of gigabytes.

One major advance in storage capacity has been powered by


something called solid-state design. In more primitive data storage
hard drives, data was encoded into the physical drive on a platter
and read by a stylus as that platter revolved. Now, many of these
types of hard drives have been replaced by a solid-state storage
system. In solid-state data storage, large amounts of data can be
written on very small storage media through the use of silicon or
similar materials and various chemical elements that provide
charging at a molecular level to encode data. This process is called
doping. Assessing storage capacity is a major part of providing
upgrades to systems. It's also part of looking at the most
fundamental advances in IT manufacturing that will power the next
generation of devices and systems.

Bidirectional Associative Memory

Bidirectional Associative Memory (BAM) is a supervised


learning model in Artificial Neural Network. This is hetero-
associative memory, for an input pattern, it returns another
pattern which is potentially of a different size. This phenomenon is
very similar to the human brain. Human memory is necessarily
associative. It uses a chain of mental associations to recover a lost
memory like associations of faces with names, in exam questions
with answers, etc.
In such memory associations for one type of object with another,
a Recurrent Neural Network (RNN) is needed to receive a pattern
of one set of neurons as an input and generate a related, but
different, output pattern of another set of neurons.

Why BAM is required?


The main objective to introduce such a network model is to store
hetero-associative pattern pairs.
This is used to retrieve a pattern given a noisy or incomplete
pattern.

BAM Architecture:
When BAM accepts an input of n-dimensional vector X from
set A then the model recalls m-dimensional vector Y from set B.
Similarly when Y is treated as input, the BAM recalls X.
Algorithm
Limitations of BAM:
• Storage capacity of the BAM: In the BAM, stored number
of associations should not be exceeded the number of
neurons in the smaller layer.

• Incorrect convergence: Always the closest association may


not be produced by BAM.
Unit 3

What is Machine Learning?


Machine learning refers to the field of study, which enables
machines to keep improving their performance without the need for
programming.
Through machine learning, your software and bots can learn new
things always and give better results.
Those machines require a lot of programming in the beginning. But
once they start the process, they begin to learn different aspects of
the task themselves. As machine learning can help so many
industries, the future scope of machine learning in bright.
Machine learning is an essential branch of AI, and it finds its uses in
multiple sectors, including:
• E-commerce
• Healthcare
• Social Media
• Finance
• Automotive

Types of Machine Learning


Supervised Learning
Supervised learning is when you provide the machine with a lot of
training data to perform a specific task.
For example, to teach a kid the color red, you’d show him a bunch
of red things like an apple, a red ball, right?
After showing the kind of a bunch of red things, you’d then show
him a red thing and ask him what color it is to find out if the kid has
learned it or not.
In supervised learning, you similarly teach the machine.
It is the most accessible type of ML to implement, and it’s also the
most common one.
In the training data, you’d feed the machine with a lot of similar
examples, and the computer will predict the answer. You would
then give feedback to the computer as to whether it made the right
prediction or not.
Example of Supervised Learning
You give the machine with the following information:
2,7 = 9
5,6 = 11
9,10 = 19

Now you give the machine the following questions:


9,1 = ?
8,9 = ?
20,4 = ?

Depending on the machine’s answers, you’d give it more training


data or give it more complex problems.
Supervised learning is task-specific, and that’s why it’s quite
common.
Unsupervised Learning
As the name suggests, unsupervised learning is the opposite of
supervised learning. In this case, you don’t provide the machine
with any training data.
The machine has to reach conclusions without any labeled data. It’s
a little challenging to implement than supervised learning.
It is used for clustering data and for finding anomalies.
Following the example we discussed above, suppose you didn’t
show the kid different red-colored things in the beginning.
Instead, you put a bunch of red-colored and green-colored things in
front of him and asked him to separate them.
Unsupervised learning is similar to this example.
Example of Unsupervised Learning
Suppose you have different news articles, and you want them sorted
into different categories. You’d give the articles to the machine, and
it will detect commonalities between them.
It will then divide the articles into different categories according to
the data it finds.
Now, when you give a new article to the machine, it will categorize
it automatically.
Just like other machine learning types, it is also quite popular as it is
data-driven.
Reinforcement Learning
Reinforcement learning is quite different from other types of
machine learning (supervised and unsupervised).
The relation between data and machine is quite different from other
machine learning types as well.
In reinforcement learning, the machine learns by its mistakes. You
give the machine a specific environment in which it can perform a
given set of actions. Now, it will learn by trial and error.
In the example we discussed above, suppose you show the kid an
apple and a banana then ask him which one is red.
If the child answers correctly, you give him candy (or chocolate),
and if the kid gives a wrong answer, you don’t give him the same.
In reinforcement learning, the machine learns similarly.
Example of Reinforcement Learning
You give the machine a maze to solve. The machine will attempt to
decipher the maze and make mistakes. Whenever it fails in solving
the maze, it will try again. And with each error, the machine will
learn what to avoid.

By repeating this activity, the machine will keep learning more


information about the maze. By using that information, it will solve
the maze in some time as well.
Although reinforcement learning is quite challenging to implement,
it finds applications in many industries.
Applications of Different Types of
Machine Learning
Supervised Learning
• Face Recognition – Recognizing faces in images (Facebook and
Google Photos)
• Spam Filter – Identify spam emails by checking their content
Unsupervised Learning
• Recommendation systems – Recommend products to buyers (such
as Amazon)
• Data categorization – Categorize data for better organization
• Customer segmentation – Classify customers into different
categories according to different qualities
Reinforcement Learning
• Manufacturing Industry – Streamline the automated manufacturing
process
• Robotics – Teach machines on how to avoid mistakes
• Video Games – Better AI for video game characters and NPCs

Gradient descent (GD) is an iterative first-order


optimisation algorithm used to find a local
minimum/maximum of a given function. This method is
commonly used in machine learning (ML) and deep
learning(DL) to minimise a cost/loss function (e.g. in a linear
regression). Due to its importance and ease of
implementation, this algorithm is usually taught at the
beginning of almost all machine learning courses.

However, its use is not limited to ML/DL only, it’s being


widely used also in areas like:

• control engineering (robotics, chemical, etc.)

• computer games

• mechanical engineering
This method was proposed before the era of modern
computers and there was an intensive development meantime
which led to numerous improved versions of

Function requirements

Gradient descent algorithm does not work for all functions.


There are two specific requirements. A function has to be:

• differentiable

• convex

Gradient Descent Algorithm

Gradient Descent Algorithm iteratively calculates the next


point using gradient at the current position, then scales it (by a
learning rate) and subtracts obtained value from the current
position (makes a step). It subtracts the value because we want
to minimise the function (to maximise it would be adding).
This process can be written as:

There’s an important parameter η which scales the gradient


and thus controls the step size. In machine learning, it is
called learning rate and have a strong influence on
performance.

• The smaller learning rate the longer GD converges, or may


reach maximum iteration before reaching the optimum
point

• If learning rate is too big the algorithm may not converge


to the optimal point (jump around) or even to diverge
completely.

In summary, Gradient Descent method’s steps are:

1. choose a starting point (initialisation)

2. calculate gradient at this point

3. make a scaled step in the opposite direction to the gradient


(objective: minimise)

4. repeat points 2 and 3 until one of the criteria is met:

• maximum number of iterations reached

• step size is smaller than the tolerance.


Derivation of Gradient Descent Algorithm
Widrow –Hoff Learning rule
The WIDROW-HOFF Learning rule is very similar
to the perception Learning rule. However the origins are
different.
The units with linear activation functions are called linear
units. A network with a single linear unit is called as adaline
(adaptive linear neuron). That is in an ADALINE, the input-
output relationship is linear. Adaline uses bipolar activation
for its input signals and its target output. The weights
between the input and the output are adjustable. Adaline is a
net which has only one output unit. The adaline network may
be trained using the delta learning rule. The delta learning
rule may also b called as least mean square
(LMS) rule or Widrow-Hoff rule. This learning rule is found
to minimize the mean-squared error between the activation
and the target value

Delta Learning rule


▪ The perceptron learning rule originates from the Hebbian
assumption while the delta rule is derived from the
gradient- descent method (it can be generalised to more
than one layer).
▪ The delta rule updates the weights between the connections
so as to minimize the difference between the net input to
the output unit and the target value.
▪ The major aim is to minimize all errors over all training
patterns. This is done by reducing the error for each
pattern, one at a time
▪ The delta rule for adjusting the weight of ith pattern (i =1
to n) is
Hebbian Learning

Donald O. Hebb proposed a mechanism to update weights


between neurons in a neural network. This method of weight
updation enabled neurons to learn and was named as
Hebbian Learning

Hebbian Learning is inspired by the biological neural weight


adjustment mechanism. It describes the method to convert a
neuron an inability to learn and enables it to develop
cognition with response to external stimuli. These concepts
are still the basis for neural learning today.

Three major points were stated as a part of this learning


mechanism :

• Information is stored in the connections between


neurons in neural networks, in the form of weights.

• Weight change between neurons is proportional to the


product of activation values for neurons.
As learning takes place, simultaneous or repeated
activation of weakly connected neurons incrementally
changes the strength and pattern of weights, leading to
stronger connections.

Implementation of Hebbian Learning in a


Perceptron
Frank Rosenblatt in 1950, inferred that threshold neuron
cannot be used for modeling cognition as it cannot learn or
adopt from the environment or develop capabilities for
classification, recognition or similar capabilities.

A perceptron draws inspiration from a biological visual


neural model with three layers illustrated as follows :

• Input Layer is synonymous to sensory cells in the retina,


with random connections to neurons of the succeeding
layer.

• Association layers have threshold neurons with bi-


directional connections to the response layer.
• Response layer has threshold neurons that are
interconnected with each other for competitive inhibitory
signaling.

Response layer neurons compete with each other by sending


inhibitory signals to produce output. Threshold functions are
set at the origin for the association and response layers. This
forms the basis of learning between these layers. The goal of
the perception is to activate correct response neurons for
each input pattern.

Competitive Learning in ANN or Winners take-all


It is concerned with unsupervised training in which the output nodes try to
compete with each other to represent the input pattern. To understand this
learning rule we will have to understand competitive net which is explained as
follows −
Basic Concept of Competitive Network
This network is just like a single layer feed-forward network having feedback
connection between the outputs. The connections between the outputs are
inhibitory type, which is shown by dotted lines, which means the competitors
never support themselves.

Basic Concept of Competitive Learning Rule


As said earlier, there would be competition among the output nodes so the
main concept is - during training, the output unit that has the highest
activation to a given input pattern, will be declared the winner. This rule is
also called Winner-takes-all because only the winning neuron is updated and
the rest of the neurons are left unchanged.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method
of fine-tuning the weights of a neural network based on the error rate
obtained in the previous epoch (i.e., iteration). Proper tuning of the weights
allows you to reduce error rates and make the model reliable by increasing its
generalization.
Backpropagation in neural network is a short form for “backward
propagation of errors.” It is a standard method of training artificial neural
networks. This method helps calculate the gradient of a loss function with
respect to all the weights in the network.

How Backpropagation Algorithm Works or


Architecture of Backpropagation network
The Back propagation algorithm in neural network computes the gradient of
the loss function for a single weight by the chain rule. It efficiently computes
one layer at a time, unlike a native direct computation. It computes the
gradient, but it does not define how the gradient is used. It generalizes the
computation in the delta rule.

Consider the following Back propagation neural network example diagram to


understand:

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the
weights such that the error is decreased.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a
static input for static output. It is useful to solve static classification issues
like optical character recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved. After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is
rapid in static back-propagation while it is nonstatic in recurrent
backpropagation.

Backpropagation Key Points


• Simplifies the network structure by elements weighted links that have
the least effect on the trained network
• You need to study a group of input and activation values to develop the
relationship between the input and hidden unit layers.
• It helps to assess the impact that a given input variable has on a
network output. The knowledge gained from this analysis should be
represented in rules.
• Backpropagation is especially useful for deep neural networks working
on error-prone projects, such as image or speech recognition.
• Backpropagation takes advantage of the chain and power rules allows
backpropagation to function with any number of outputs.

Disadvantages of using Backpropagation


• The actual performance of backpropagation on a specific problem is
dependent on the input data.
• Back propagation algorithm in data mining can be quite sensitive to
noisy data
• You need to use the matrix-based approach for backpropagation instead
of mini-batch.
Activation Functions in Neural
Networks Explained
Introduction
Activation functions are mathematical equations that determine the output of a neural
network model. Activation functions also have a major effect on the neural network’s ability
to converge and the convergence speed, or in some cases, activation functions might prevent
neural networks from converging in the first place. Activation function also helps to
normalize the output of any input in the range between 1 to -1 or 0 to 1.

Activation function must be efficient and it should reduce the computation time because the
neural network sometimes trained on millions of data points.

Let’s consider the simple neural network model without any hidden layers.

Here is the output-

Y = ∑ (weights*input + bias)

and it can range from -infinity to +infinity. So it is necessary to bound the output to get the
desired prediction or generalized results.

Y = Activation function(∑ (weights*input + bias))


So the activation function is an important part of an artificial neural network. They decide
whether a neuron should be activated or not and it is a non-linear transformation that can be
done on the input before sending it to the next layer of neurons or finalizing the output.

Properties of activation functions


1. Non Linearity
2. Continuously differentiable
3. Range
4. Monotonic
5. Approximates identity near the origin

Types of Activation Functions


The activation function can be broadly classified into 2 categories.

1. Binary Step Function


2. Linear Activation Function

Binary Step Function


A binary step function is generally used in the Perceptron linear classifier. It thresholds the
input values to 1 and 0, if they are greater or less than zero, respectively.
The step function is mainly used in binary classification problems and works well for
linearly severable pr. It can’t classify the multi-class problems.

Also Read: 3 Things to Know before deep diving into Neural Networks

Linear Activation Function

The equation for Linear activation function is:


f(x) = a.x

When a = 1 then f(x) = x and this is a special case known as identity.

Properties:
1. Range is -infinity to +infinity
2. Provides a convex error surface so optimisation can be achieved faster
3. df(x)/dx = a which is constant. So cannot be optimised with gradient descent

Limitations:
1. Since the derivative is constant, the gradient has no relation with input
2. Back propagation is constant as the change is delta x

Non-Linear Activation Functions


Modern neural network models use non-linear activation functions. They allow the model to
create complex mappings between the network’s inputs and outputs, such as images, video,
audio, and data sets that are non-linear or have high dimensionality.

Majorly there are 3 types of Non-Linear Activation functions.

1. Sigmoid Activation Functions


2. Rectified Linear Units or ReLU
3. Complex Nonlinear Activation Functions

Sigmoid Activation Functions


Sigmoid functions are bounded, differentiable, real functions that are defined for all real
input values, and have a non-negative derivative at each point.

Sigmoid or Logistic Activation Function


The sigmoid function is a logistic function and the output is ranging between 0 and 1.
The output of the activation function is always going to be in range (0,1) compared to (-inf,
inf) of linear function. It is non-linear, continuously differentiable, monotonic, and has a
fixed output range. But it is not zero centred.

Hyperbolic Tangent
The function produces outputs in scale of [-1, 1] and it is a continuous function. In other
words, function produces output for every x value.

Y = tanh(x)
tanh(x) = (ex – e-x) / (ex + e-x)
Inverse Hyperbolic Tangent (arctanh)
It is similar to sigmoid and tanh but the output ranges from [-pi/2,pi/2]

Softmax
The softmax function is sometimes called the soft argmax function, or multi-class logistic
regression. This is because the softmax is a generalization of logistic regression that can be
used for multi-class classification, and its formula is very similar to the sigmoid function
which is used for logistic regression. The softmax function can be used in a classifier only
when the classes are mutually exclusive.

Gudermannian
The Gudermannian function relates circular functions and hyperbolic functions without
explicitly using complex numbers.

The below is the mathematical equation for Gudermannian function:

GELU (Gaussian Error Linear Units)


An activation function used in the most recent Transformers such as Google’s BERT and
OpenAI’s GPT-2. This activation function takes the form of this equation:

GELU(x)=0.5x(1+tanh(√2/π(x+0.044715×3)))

So it’s just a combination of some functions (e.g. hyperbolic tangent tanh) and approximated
numbers.

It has a negative coefficient, which shifts to a positive coefficient. So when x is greater than
zero, the output will be x, except from when x=0 to x=1, where it slightly leans to a smaller
y-value.

Also Read: What is Recurrent Neural Network | Introduction of Recurrent Neural Network

Problems with Sigmoid Activation Functions

1. Vanishing Gradients Problem


The main problem with deep neural networks is that the gradient diminishes dramatically as
it is propagated backward through the network. The error may be so small by the time it
reaches layers close to the input of the model that it may have very little effect. As such, this
problem is referred to as the “vanishing gradients” problem.
A small gradient means that the weights and biases of the initial layers will not be updated
effectively with each training session. Since these initial layers are often crucial to
recognizing the core elements of the input data, it can lead to overall inaccuracy of the whole
network.

2. Exploding Gradients
Exploding gradients are a problem where large error gradients accumulate and result in very
large updates to neural network model weights during training. These large updates in turn
results in an unstable network. At an extreme, the values of weights can become so large as
to overflow and result in NaN values.

Rectified Linear Units or ReLU


The sigmoid and hyperbolic tangent activation functions cannot be used in networks with
many layers due to the vanishing gradient problem. The rectified linear activation function
overcomes the vanishing gradient problem, allowing models to learn faster and perform
better. The rectified linear activation is the default activation when developing multilayer
Perceptron and convolutional neural networks.

Rectified Linear Units(ReLU)


ReLU is the most commonly used activation function in neural networks and The
mathematical equation for ReLU is:

ReLU(x) = max(0,x)
So if the input is negative, the output of ReLU is 0 and for positive values, it is x.

Though it looks like a linear function, it’s not. ReLU has a derivative function and allows for
backpropagation.

There is one problem with ReLU. Let’s suppose most of the input values are negative or 0,
the ReLU produces the output as 0 and the neural network can’t perform the back
propagation. This is called the Dying ReLU problem. Also, ReLU is an unbounded function
which means there is no maximum value.

Pros:

1. Less time and space complexity


2. Avoids the vanishing gradient problem.
Cons:

1. Introduces the dead relu problem.


2. Does not avoid the exploding gradient problem.
Leaky ReLU
The dying ReLU problem is likely to occur when:

1. Learning rate is too high


2. There is a large negative bias
Leaky ReLU is the most common and effective method to solve a dying ReLU problem. It
adds a slight slope in the negative range to prevent the dying ReLU issue.

Again this doesn’t solve the exploding gradient problem.

Parametric ReLU
PReLU is actually not so different from Leaky ReLU.

So for negative values of x, the output of PReLU is alpha times x and for positive values, it
is x.
Parametric ReLU is the most common and effective method to solve a dying ReLU problem
but again it doesn’t solve exploding gradient problem.

Exponential Linear Unit (ELU)


ELU speeds up the learning in neural networks and leads to higher classification accuracies,
and it solves the vanishing gradient problem. ELUs have improved learning characteristics
compared to the other activation functions. ELUs have negative values that allow them to
push mean unit activations closer to zero like batch normalization but with lower
computational complexity.

The mathematical expression for ELU is:


ELU is designed to combine the good parts of ReLU and leaky ReLU and it doesn’t have the
dying ReLU problem. it saturates for large negative values, allowing them to be essentially
inactive.

Scaled Exponential Linear Unit (SELU)


SELU incorporates normalization based on the central limit theorem. SELU is a
monotonically increasing function, where it has an approximately constant negative output
for large negative input. SELU’s are mostly commonly used in Self Normalizing Networks
(SNN).
The output of a SELU is normalized, which could be called internal normalization, hence the
fact that all the outputs are with a mean of zero and standard deviation of one. The main
advantage of SELU is that the Vanishing and exploding gradient problem is impossible and
since it is a new activation function, it requires more testing before usage.

Softplus or SmoothReLU
The derivative of the softplus function is the logistic function.

The mathematical expression is:

And the derivative of softplus is:


Swish function
The Swish function was developed by Google, and it has superior performance with the same
level of computational efficiency as the ReLU function. ReLU still plays an important role in
deep learning studies even for today. But experiments show that this new activation function
overperforms ReLU for deeper networks

The mathematical expression for Swish Function is:


The modified version of swish function is:

Here, β is a parameter that must be tuned. If β gets closer to ∞, then the function looks like
ReLU. Authors of the Swish function proposed to assign β as 1 for reinforcement learning
tasks.

You might also like