DL Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

Unit 2: Artificial Neural

Networks

Prof . Sachin S. Patil


D . Y. Patil University Ambi Pune
Prof.Sachin Sambhaji Patil 1
The Perceptron

• Perceptron was introduced by Frank Rosenblatt in 1957.

• He proposed a Perceptron learning rule based on the original MCP


neuron.

• A Perceptron is an algorithm for supervised learning of binary


classifiers.

• This algorithm enables neurons to learn and processes elements in the


training set one at a time.
Prof.Sachin Sambhaji Patil 2
The Perceptron

Prof.Sachin Sambhaji Patil 3


Basic Components of Perceptron
• Input Layer: The input layer consists of one or more input neurons, which
receive input signals from the external world or from other layers of the neural
network.

• Weights: Each input neuron is associated with a weight, which represents the
strength of the connection between the input neuron and the output neuron.

• Bias: A bias term is added to the input layer to provide the perceptron with
additional flexibility in modeling complex patterns in the input data.

• Activation Function: The activation function determines the output of the


perceptron based on the weighted sum of the inputs and the bias term.
Common activation functions used in perceptrons include the step function,
sigmoid function, and ReLU function.
Prof.Sachin Sambhaji Patil 4
Basic Components of Perceptron
• Output: The output of the perceptron is a single binary value, either 0 or 1,
which indicates the class or category to which the input data belongs.

• Training Algorithm: The perceptron is typically trained using a supervised


learning algorithm such as the perceptron learning algorithm or
backpropagation. During training, the weights and biases of the perceptron are
adjusted to minimize the error between the predicted output and the true
output for a given set of training examples.

• Overall, the perceptron is a simple yet powerful algorithm that can be used to
perform binary classification tasks and has paved the way for more complex
neural networks used in deep learning today.
Prof.Sachin Sambhaji Patil 5
Biological Neuron

Prof.Sachin Sambhaji Patil 6


Biological Neuron

• A human brain has billions of neurons.

• Neurons are interconnected nerve cells in the human brain that are
involved in processing and transmitting chemical and electrical
signals.

• Dendrites are branches that receive information from other neurons.

Prof.Sachin Sambhaji Patil 7


Biological Neuron
• Cell nucleus or Soma processes the information received from
dendrites.

• Axon is a cable that is used by neurons to send information.

• Synapse is the connection between an axon and other neuron


dendrites.

Prof.Sachin Sambhaji Patil 8


What is Artificial Neuron
• An artificial neuron is a mathematical function based on a model of
biological neurons, where each neuron takes inputs, weights them
separately, sums them up and passes this sum through a nonlinear
function to produce output.

Prof.Sachin Sambhaji Patil 9


Compare the biological neuron with the artificial neuron.
Biological Neuron Artificial Neuron

Cell Nucleus (Soma) Node

Dendrites Input

Weights or
Synapse
interconnections

Axon Output
Prof.Sachin Sambhaji Patil 10
Artificial Neuron
• A neuron is a mathematical function modeled on the working of biological
neurons

• It is an elementary unit in an artificial neural network

• One or more inputs are separately weighted

• Inputs are summed and passed through a nonlinear function to produce


output

• Every neuron holds an internal state called activation signal

• Each connection link carries information about the input signal

• Every neuron is connected to another neuron via connection link


Prof.Sachin Sambhaji Patil 11
Types of Perceptron:
• Single layer: Single layer perceptron can learn only linearly separable
patterns.

• Multilayer: Multilayer perceptrons can learn about two or more layers


having a greater processing power.

• The Perceptron algorithm learns the weights for the input signals in order
to draw a linear decision boundary.

Prof.Sachin Sambhaji Patil 12


Types of Perceptron:
• Supervised Learning is a types of machine learning used to learn models
from labeled training data. It enables output prediction for future or
unseen data.

Prof.Sachin Sambhaji Patil 13


How Does Perceptron Work?

Prof.Sachin Sambhaji Patil 14


How Does Perceptron Work?
• Perceptron is considered a single-layer neural link with four main parameters.

• The perceptron model begins with multiplying all input values and their
weights, then adds these values to create the weighted sum.

• Further, this weighted sum is applied to the activation function ‘f’ to obtain
the desired output.

• This activation function is also known as the step function and is represented
by ‘f.

Prof.Sachin Sambhaji Patil 15


How Does Perceptron Work?
• This step function or Activation function is vital in ensuring that output is
mapped between (0,1) or (-1,1).

• Take note that the weight of input indicates a node’s strength. Similarly,
an input value gives the ability the shift the activation function curve up
or down.

Prof.Sachin Sambhaji Patil 16


How Does Perceptron Work?
• Step 1: Multiply all input values with corresponding weight values and then add to
calculate the weighted sum. The following is the mathematical expression of it:

• ∑wi*xi = x1*w1 + x2*w2 + x3*w3+……..x4*w4

• Add a term called bias ‘b’ to this weighted sum to improve the model’s
performance.

• Step 2: An activation function is applied with the above-mentioned weighted sum


giving us an output either in binary form or a continuous value as follows:

• Y=f(∑wi*xi + b)
Prof.Sachin Sambhaji Patil 17
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural
Networks) types consists of a feed-forward network and includes a threshold
transfer inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes. A
Single-layer perceptron can learn only linearly separable patterns.
• Multi-Layered Perceptron model: It is mainly similar to a single-layer
perceptron model but has more hidden layers.
• Forward Stage: From the input layer in the on stage, activation functions
begin and terminate on the output layer.
• Backward Stage: In the backward stage, weight and bias values are modified
per the model’s requirement. The backstage removed the error between the
actual output and demands originating backward on the output layer. A
multilayer perceptron model has a greater processing power and can process
linear and non-linear patterns. Further, it also implements logic gates such as
AND, OR, XOR, XNOR, and NOR. Prof.Sachin Sambhaji Patil 18
Perceptron models
• Advantages:

• A multi-layered perceptron model can solve complex non-linear


problems.

• It works well with both small and large input data.

• Helps us to obtain quick predictions after the training.

• Helps us obtain the same accuracy ratio with big and small data.

Prof.Sachin Sambhaji Patil 19


Perceptron models
• Disadvantages:

• In multi-layered perceptron model, computations are time-consuming and


complex.

• It is tough to predict how much the dependent variable affects each


independent variable.

• The model functioning depends on the quality of training.

Prof.Sachin Sambhaji Patil 20


Characteristics of the Perceptron Model
• It is a machine learning algorithm that uses supervised learning of binary classifiers.

• In Perceptron, the weight coefficient is automatically learned.

• Initially, weights are multiplied with input features, and then the decision is made whether
the neuron is fired or not.

• The activation function applies a step rule to check whether the function is more significant
than zero.

• The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.

• If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output willProf.Sachin
be shown.Sambhaji Patil 21
Limitation of Perceptron Model

• The output of a perceptron can only be a binary number (0 or 1) due to


the hard-edge transfer function.

• It can only be used to classify the linearly separable sets of input vectors.

If the input vectors are non-linear, it is not easy to classify them correctly.

Prof.Sachin Sambhaji Patil 22


Perceptron Function
• Perceptron is a function that maps its input “x,” which is multiplied with
the learned weight coefficient; an output value ”f(x)”is generated.

In the equation given above:


“w” = vector of real-valued weights
“b” = bias (an element that adjusts the boundary away from origin without
any dependence on the input value)
“x” = vector of input x values
Prof.Sachin Sambhaji Patil 23
Perceptron Function

“m” = number of inputs to the Perceptron


The output can be represented as “1” or “0.” It can also be represented as “1” or “-1”
depending on which activation function is used.

https://www.simplilearn.com/tutorials/deep-learning-tutorial/perceptron

Prof.Sachin Sambhaji Patil 24


The Architecture of the Multilayer Feed-Forward Neural Network:

Prof.Sachin Sambhaji Patil 25


The Architecture of the Multilayer Feed-Forward Neural Network:

• This Neural Network or Artificial Neural Network has multiple hidden layers
that make it a multilayer neural Network and it is feed-forward because it is
a network that follows a top-down approach to train the network. In this
network there are the following layers:

1. Input Layer

2. Hidden Layer

3. Output Layer

Prof.Sachin Sambhaji Patil 26


The Architecture of the Multilayer Feed-Forward Neural Network:
• Input Layer: It is starting layer of the network that has a weight associated
with the signals.

• Hidden Layer: This layer lies after the input layer and contains multiple
neurons that perform all computations and pass the result to the output
unit.

• Output Layer: It is a layer that contains output units or neurons and receives
processed data from the hidden layer, if there are further hidden layers
connected to it then it passes the weighted unit to the connected hidden
layer for further processing to get theSambhaji
Prof.Sachin desired
Patil result. 27
The Architecture of the Multilayer Feed-Forward Neural Network:

• The input and hidden layers use sigmoid and linear activation functions
whereas the output layer uses a step activation function at nodes because it is
a two-step activation function that helps in predicting results as per
requirements.

• All units also known as neurons have weights and calculation at the hidden
layer is the summation of the dot product of all weights and their signals and
finally the sigmoid function of the calculated sum.

• Multiple hidden and output layer increases the accuracy of the output.
Prof.Sachin Sambhaji Patil 28
What is a neural network

• A neural network is a method in artificial intelligence that teaches


computers to process data in a way that is inspired by the human brain.

• It is a type of machine learning process, called deep learning, that uses


interconnected nodes or neurons in a layered structure that resembles the
human brain.

Prof.Sachin Sambhaji Patil 29


Neural network
A neural network is a series of
algorithms that endeavors to recognize
underlying relationships in a set of data
through a process that mimics the way the
human brain operates.
In this sense, neural networks refer to
systems of neurons, either organic or
artificial in nature.

Prof.Sachin Sambhaji Patil 30


Back propagation Forward propagation
• Backward Propagation is the process of moving from right (output layer) to
left (input layer).

• Forward propagation is the way data moves from left (input layer) to right
(output layer) in the neural network.

• A neural network can be understood by a collection of connected


input/output nodes.

• The accuracy of a node is expressed as a loss function or error rate.


Backpropagation calculates the slope of a loss function of other weights in
the neural network. Prof.Sachin Sambhaji Patil 31
Back propagation Forward propagation
To train a neural network, there are 2
passes (phases):
1. Forward
2. Backward

The process of propagating the inputs


from the input layer to the output layer
is called forward propagation.
Once the network error is calculated,
then the forward propagation phase has
ended, and backward pass starts.
Prof.Sachin Sambhaji Patil 32
Forward and backward passes in Neural Networks

• The forward and backward phases are repeated from some epochs. In each
epoch, the following occurs:

• The inputs are propagated from the input to the output layer.

• The network error is calculated.

• The error is propagated from the output layer to the input layer.

Prof.Sachin Sambhaji Patil 33


Forward and backward passes in Neural Networks
• In the forward pass, we start by propagating the data inputs to the input layer,
go through the hidden layer(s), measure the network’s predictions from the
output layer, and finally calculate the network error based on the predictions
the network made.

• This network error measures how far the network is from making the correct
prediction. For example, if the correct output is 4 and the network’s prediction
is 1.3, then the absolute error of the network is 4-1.3=2.7.

Prof.Sachin Sambhaji Patil 34


How backpropagation algorithm works
• How the algorithm works is best explained based on a simple network, like the
one given in the next figure. It only has an input layer with 2 inputs (X1 and X2),
and an output layer with 1 output. There are no hidden layers.

• The weights of the inputs are W1 and W2, respectively. The bias is treated as a
new input neuron to the output neuron which has a fixed value +1 and a
weight b. Both the weights and biases could be referred to as parameters.

Prof.Sachin Sambhaji Patil 35


How backpropagation algorithm works
Output layer uses the sigmoid activation
function defined by the following
equation:

Where s is the sum of products (SOP) between each input and its corresponding
weight:
S = X1* W1 + X2*W2 + b
Prof.Sachin Sambhaji Patil 36
How backpropagation algorithm works

Forward pass
The input of the activation function will be the SOP between each input and its
weight. The SOP is then added to the bias to return the output of the neuron:
S = X1* W1 + X2*W2 + b
S = 0.1* 0.5 + 0.3*0.2 + 1.83
S = 1.94 Prof.Sachin Sambhaji Patil 37
Compare Single and Multi layer Feed-Forward Neural Network

Prof.Sachin Sambhaji Patil 38


Activation Functions –ReLu, linear Sigmoid, SoftMax, Tanh

Activation functions are generally two types, These are

1. Linear or Identity Activation Function

2. Non-Linear Activation Function

Prof.Sachin Sambhaji Patil 39


Non-linear Activation Functions
• Generally, neural networks use non-linear activation functions, which
can help the network learn complex data, compute and learn almost any
function representing a question, and provide accurate predictions.

• They allow back-propagation because they have a derivative function


which is related to the inputs.

Prof.Sachin Sambhaji Patil 40


• Non-linear Activation Functions:

• Above listed all activation functions are belong to non-linear activation functions.
And we will discuss below more in details.

• Sigmoid Activation Function:

• Sigmoid Activation function is very simple which takes a real value as input and
gives probability that ‘s always between 0 or 1. It looks like ‘S’ shape.

Prof.Sachin Sambhaji Patil 41


Non-linear Activation Functions
• 2. Tanh or Hyperbolic tangent:
• Tanh help to solve non zero centered problem of sigmoid function. Tanh
squashes a real-valued number to the range [-1, 1]. It’s non-linear too.

Prof.Sachin Sambhaji Patil 42


It solve sigmoid’s drawback but it still can’t remove the vanishing
gradient problem completely.
When we compare tanh activation function with sighmoid , this picture
give you clear idea.
# tanh activation function

def tanh(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) +
np.exp(-z))

# Derivative of Tanh Activation Function

def tanh_prime(z):
return 1 - np.power(tanh(z), 2)

Prof.Sachin Sambhaji Patil 43


Non-linear Activation Functions
• 3. ReLU (Rectified Linear Unit):
• This is most popular activation function which is used in hidden layer of NN.

• The formula is deceptively simple: (0, ) max(0,z). Despite its name and
appearance, it’s not linear and provides the same benefits as Sigmoid but with
better performance.

Prof.Sachin Sambhaji Patil 44


Non-linear Activation Functions
• It’s main advantage is that it avoids and rectifies vanishing gradient problem and less
computationally expensive than tanh and sigmoid.

• But it has also some draw back . Sometime some gradients can be fragile during training
and can die. That leads to dead neurons.

• In another words, for activations in the region (x<0) of ReLu, gradient will be 0 because
of which the weights will not get adjusted during descent.

• That means, those neurons which go into that state will stop responding to variations
in error/ input ( simply because gradient is 0, nothing changes ). So We should be very
carefully to choose activation function , and activation function should be as per
business requirement. Prof.Sachin Sambhaji Patil 45
Non-linear Activation Functions
• 4. Leaky ReLU
• It prevents dying ReLU problem. T his variation of ReLU has a small
positive slope in the negative area, so it does enable back-
propagation, even for negative input values

Prof.Sachin Sambhaji Patil 46


Non-linear Activation Functions
• 5. Softmax
• Generally, we use the function at last layer of neural network which calculates
the probabilities distribution of the event over ’n’ different events. The main
advantage of the function is able to handle multiple classes.

Prof.Sachin Sambhaji Patil 47


Losses in neural network
• When you train Deep learning models, you feed data to the network, generate
predictions, compare them with the actual values (the targets) and then
compute what is known as a loss.

• This loss essentially tells you something about the performance of the network:
the higher it is, the worse your network performs overall.

Prof.Sachin Sambhaji Patil 48


Losses in neural network
• Loss functions are mainly classified into two different categories
Classification loss and Regression Loss.

• Classification loss is the case where the aim is to predict the output
from the different categorical values

• for example, if we have a dataset of handwritten images and the digit is


to be predicted that lies between (0–9), in these kinds of scenarios
classification loss is used.

Prof.Sachin Sambhaji Patil 49


Losses in neural network

• Whereas if the problem is regression like predicting the continuous


values for example, if need to predict the weather conditions or
predicting the prices of houses on the basis of some features. In this
type of case, Regression Loss is used.

Prof.Sachin Sambhaji Patil 50


Losses in neural network
1. Mean Absolute Error (L1 Loss)

2. Mean Squared Error (L2 Loss)

3. Huber Loss

4. Cross-Entropy(a.k.a Log loss)

5. Relative Entropy(a.k.a Kullback–Leibler divergence)

6. Squared Hinge

Prof.Sachin Sambhaji Patil 51


Losses in neural network
• Mean Absolute Error (MAE)
• Mean absolute error (MAE) also called L1 Loss is a loss function used
for regression problems. It represents the difference between the
original and predicted values extracted by averaging the absolute
difference over the data set.

Prof.Sachin Sambhaji Patil 52


Losses in neural network
• Mean Absolute Error (MAE)

• Use Mean absolute error when you are doing regression and don’t want
outliers to play a big role. It can also be useful if you know that your
distribution is multimodal, and it’s desirable to have predictions at one of
the modes, rather than at the mean of them.

Prof.Sachin Sambhaji Patil 53


Losses in neural network

• Example: When doing image reconstruction, MAE encourages less blurry


images compared to MSE. This is used for example in the paper Image-to-
Image Translation with Conditional Adversarial Networks.

Prof.Sachin Sambhaji Patil 54


Mean Squared Error (MSE)
• Mean Squared Error (MSE) also called L2 Loss is also a loss function used
for regression. It represents the difference between the original and
predicted values extracted by squared the average difference over the
data set.

Prof.Sachin Sambhaji Patil 55


Mean Squared Error (MSE)
• MSE is sensitive towards outliers and given several examples with the same
input feature values, the optimal prediction will be their mean target value.

• This should be compared with Mean Absolute Error, where the optimal
prediction is the median.

• MSE is thus good to use if you believe that your target data, conditioned on
the input, is normally distributed around a mean value, and when it’s
important to penalize outliers extra much.

Prof.Sachin Sambhaji Patil 56


Mean Squared Error (MSE)
• When to use it?

• Use MSE when doing regression, believing that your target, conditioned
on the input, is normally distributed, and want large errors to be
significantly (quadratically) more penalized than small ones.

Prof.Sachin Sambhaji Patil 57


Mean Squared Error (MSE)

• Example: You want to predict future house prices.

• The price is a continuous value, and therefore we want to do


regression. MSE can here be used as the loss function.

Prof.Sachin Sambhaji Patil 58


Huber Loss
• Huber Loss is typically used in regression problems. It’s less sensitive to
outliers than the MSE as it treats error as square only inside an interval.

• Consider an example where we have a dataset of 100 values we would like our
model to be trained to predict. Out of all that data, 25% of the expected
values are 5 while the other 75% are 10.

Prof.Sachin Sambhaji Patil 59


Huber Loss
• The Huber Loss offers the best of both worlds by balancing the MSE and
MAE together. We can define it using the following piecewise function:

Here, ( ) delta → hyper parameter defines the range for MAE and MSE.

In simple terms, the above radically says is: for loss values less than ( ) delta, use the MSE;
for loss values greater than delta, use the MAE.
This way Huber loss provides the best of both MAE and MSE.
Prof.Sachin Sambhaji Patil 60
Cross-Entropy Loss
• The concept of cross-entropy traces back into the field of Information Theory
where Shannon introduced the concept of entropy in 1948.

• Entropy — it is a measure of disorder, or unpredictability, in a system.

• p(x) — probability distribution and a random variable X,

• Entropy is defined as follows:

Prof.Sachin Sambhaji Patil 61


Cross-Entropy Loss
• Cross-Entropy loss is also called logarithmic loss, log loss, or
logistic loss.

• Each predicted class probability is compared to the actual class


desired output 0 or 1

• Where x represents the predicted results by ML algorithm, p(x) is


the probability distribution of “true” label from training samples
and q(x) depicts the estimation of the ML algorithm.
https://www.theaidream.com/post/loss-functions-in-neural-networks
Prof.Sachin Sambhaji Patil 62
Cross-Entropy Loss

• Cross-entropy loss measures the performance of a classification model


whose output is a probability value between 0 and 1.

Prof.Sachin Sambhaji Patil 63


Basic concepts of artificial neurons
• Basic concepts of artificial neurons,

• The artificial neuron is the building component of the ANN designed to


simulate the function of the biological neuron. The arriving signals, called
inputs, multiplied by the connection weights (adjusted) are first summed
(combined) and then passed through a transfer function to produce the
output for that neuron.

Prof.Sachin Sambhaji Patil 64


Basic concepts of artificial neurons

Prof.Sachin Sambhaji Patil 65


Basic concepts of artificial neurons

• Artificial neurons (also called Perceptrons, Units or Nodes) are the simplest
elements or building blocks in a neural network. They are inspired by
biological neurons that are found in the human brain.

Prof.Sachin Sambhaji Patil 66


Basic concepts of artificial neurons
• A biological neuron receives its input signals from other neurons through dendrites (small
fibers). Likewise, a perceptron receives its data from other perceptron's through input
neurons that take numbers.

• The connection points between dendrites and biological neurons are called synapses.
Likewise, the connections between inputs and perceptron's are called weights. They measure
the importance level of each input.

• In a biological neuron, the nucleus produces an output signal based on the signals provided by
dendrites. Likewise, the nucleus (colored in blue) in a perceptron performs some calculations
based on the input values and produces an output.

• In a biological neuron, the output signal is carried away by the axon. Likewise, the axon in a
Prof.Sachin Sambhaji Patil 67

perceptron is the output value which will be the input for the next perceptron's.
Optimizers
• An optimizer is an algorithm or function that adapts the neural network's
attributes, like learning rate and weights. Hence, it assists in improving the
accuracy and reduces the total loss.

• Hyperparameters: Learning Rate,

• Regularization,

• Momentum,

• Gradient-Based Learning,

Prof.Sachin Sambhaji Patil 68


Hyperparameters:
1. Learning Rate,

2. Regularization,

3. Momentum,

4. Gradient-Based Learning,

Prof.Sachin Sambhaji Patil 69


Hyperparameters:
1. Learning Rate-
It offers a degree that denotes how much the model weights
should be updated.
The amount that the weights are updated during training is
referred to as the step size or the “learning rate.” Specifically, the
learning rate is a configurable hyperparameter used in the
training of neural networks that has a small positive value, often
in the range between 0.0 and 1.0.1
Prof.Sachin Sambhaji Patil 70
Hyperparameters:
1. Learning Rate-

• A few different values and see which one gives you the best loss without
sacrificing speed of training. We might start with a large value like 0.1,
then try exponentially lower values: 0.01, 0.001, etc.

Prof.Sachin Sambhaji Patil 71


Hyperparameters:
• Epoch: It denotes the number of times the algorithm operates on the entire training
dataset.
• Batch: It is the number of samples to be considered for updating the model
parameters.
• Cost Function/Loss Function: A cost function helps you calculate the cost,
representing the difference between the actual value and the predicted value.
• Learning rate: It offers a degree that denotes how much the model weights should
be updated.
• Weights/ Bias: They are learnable parameters that control the signal between two
neurons in a deep learning model.
Prof.Sachin Sambhaji Patil 72
Hyperparameters:
• Regularization is a set of techniques that can prevent overfitting in
neural networks and thus improve the accuracy of a Deep Learning
model when facing completely new data from the problem domain.

• A. Regularization in deep learning is a technique used to prevent


overfitting and improve the generalization of neural networks.

• Popular regularization techniques which are called L1, L2, and dropout.

Prof.Sachin Sambhaji Patil 73


Hyperparameters:
• Momentum method is a technique that can accelerate gradient
descent by taking accounts of previous gradients in the update rule at
each iteration.

• Momentum is a widely-used strategy for accelerating the convergence


of gradient-based optimization techniques. Momentum was designed
to speed up learning in directions of low curvature, without becoming
unstable in directions of high curvature.

Prof.Sachin Sambhaji Patil 74


Hyperparameters:
• In deep learning, a variant called stochastic gradient descent (SGD) is
often used. It updates the parameters based on a randomly selected
subset of training samples in each iteration, rather than the entire
dataset. This helps in speeding up the training process and making it
feasible for large-scale problems.

Prof.Sachin Sambhaji Patil 75


Gradient-Based Optimizers in Deep Learning

Prof.Sachin Sambhaji Patil 76


Role of Learning Rate
• Learning rate represents the size of the steps our optimization algorithm
takes to reach the global minima. To ensure that the gradient descent
algorithm reaches the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high.

• Taking very large steps i.e, a large value of the learning rate may skip the
global minima, and the model will never reach the optimal value for the
loss function. On the contrary, taking very small steps i.e, a small value of
learning rate will take forever to converge.

Prof.Sachin Sambhaji Patil 77


Role of Gradient
• In general, Gradient represents the slope of the equation while gradients
are partial derivatives and they describe the change reflected in the loss
function with respect to the small change in parameters of the function.

• Now, this slight change in loss functions can tell us about the next step to
reduce the output of the loss function.

Prof.Sachin Sambhaji Patil 78


Role of Learning Rate

• Learning rate represents the size of the steps our optimization


algorithm takes to reach the global minima. To ensure that the
gradient descent algorithm reaches the local minimum we must set
the learning rate to an appropriate value, which is neither too low nor
too high.

https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-gradient-based-optimizers/

Prof.Sachin Sambhaji Patil 79


Role of Learning Rate

Taking very large steps i.e, a


large value of the learning rate
may skip the global minima,
and the model will never reach
the optimal value for the loss
function.
On the contrary, taking very
small steps i.e, a small value of
learning rate will take forever to
converge.

Prof.Sachin Sambhaji Patil 80


Implementing Gradient Descent
• Implementing gradient descent involves updating the
parameters iteratively.
• The update formula for parameter w is given by
• w = w — α * (dJ/dw) ,
• where α is the learning rate and
• (dJ/dw) is the derivative term of the cost function with respect
to w.
Prof.Sachin Sambhaji Patil 81
Implementing Gradient Descent
• Gradient descent is an iterative optimization algorithm used to find the
values of model parameters that result in the smallest possible cost.

• It aims to minimize the cost function by adjusting the parameters in a


systematic way.

• The algorithm makes small updates to the parameters based on the


calculated gradient of the cost function.

Prof.Sachin Sambhaji Patil 82


The Process of Gradient Descent
• To apply gradient descent, we start with initial guesses for the
parameters.

• The algorithm then iteratively updates the parameters by taking


steps proportional to the negative gradient of the cost function.

• By repeating this process, the algorithm gradually converges towards


the optimal parameter values that minimize the cost.

Prof.Sachin Sambhaji Patil 83


Visualizing Gradient Descent

Prof.Sachin Sambhaji Patil 84


Implementing Gradient Descent

Simultaneous updates
of both parameters
(weight and bias) are
crucial for correct
gradient descent
implementation.

Prof.Sachin Sambhaji Patil 85


Types of Gradient Descent

Prof.Sachin Sambhaji Patil 86


Types of Gradient Descent 1. Batch Gradient Descent
• Batch gradient descent, also known as vanilla gradient descent,
computes the gradient using the entire training dataset at each
iteration.

• It calculates the average of the gradients for all training examples


before updating the model’s parameters.

https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-773ba7cd3dfe#:~:text=III.-,
Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.

Prof.Sachin Sambhaji Patil 87


Types of Gradient Descent 1. Batch Gradient Descent
• Batch gradient descent ensures stability during training but can be
computationally expensive when working with large datasets.

• Additionally, it may lead to slower convergence for noisy or


redundant data.

Prof.Sachin Sambhaji Patil 88


Types of Gradient Descent 2. Stochastic Gradient Descent
• Stochastic gradient descent (SGD) takes a different approach by updating
the parameters for each training example individually.

• It computes the gradient using only one randomly selected training


example, making it faster than batch gradient descent.

• SGD has the advantage of adapting quickly to changing patterns in the data.

• However, it can exhibit more oscillations and may take longer to converge
due to the noise introduced by individual samples.

Prof.Sachin Sambhaji Patil 89


Types of Gradient Descent 3. Mini-Batch Gradient Descent:
• Mini-batch gradient descent is a compromise between batch gradient
descent and stochastic gradient descent.

• It computes the gradient using a small subset, or mini-batch, of training


examples.

• This approach combines the advantages of both previous methods.

• By using mini-batches, the algorithm achieves a balance between stability


and computational efficiency.

• It reduces the noise introduced by individual samples and provides a more


accurate estimate of the true gradient.
Prof.Sachin Sambhaji Patil 90
The Importance of Learning Rate in Gradient Descent
• Gradient descent is a fundamental optimization algorithm used in machine
learning 1. to minimize a cost function and 2. to find the optimal values for
model parameters.

• The learning rate, denoted as alpha (α), plays a crucial role in determining
how quickly the algorithm converges to the minimum of the cost function.

• It essentially controls the step size taken in each iteration of the gradient
descent process.

Prof.Sachin Sambhaji Patil 91


The Importance of Learning Rate in Gradient Descent
• To better understand the impact of the learning rate, let’s consider two
scenarios:

• 1. a learning rate that is too small and

• 2. a learning rate that is too large.

https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-
773ba7cd3dfe#:~:text=III.,Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.

Prof.Sachin Sambhaji Patil 92


1. a learning rate that is too small
• Learning Rate Too Small: When the learning rate is set to a very small
value, the algorithm takes tiny steps towards the minimum of the cost
function.

• These small steps can cause the convergence process to be extremely slow.

• Imagine taking small, hesitant steps towards a destination — it would take a


significant amount of time to reach your goal.

• Similarly, with a small learning rate, gradient descent takes many iterations to
approach the minimum, resulting in slower convergence.
Prof.Sachin Sambhaji Patil 93
1. a learning rate that is too small
• Learning Rate Too Large: Conversely, if the learning rate is set to a very large
value, gradient descent can overshoot the minimum and fail to converge.

• With a large learning rate, the algorithm takes big steps towards the
minimum, but it may continuously overshoot, causing the cost function to
increase rather than decrease.

• This can lead to divergence, where the algorithm fails to find the optimal
solution and keeps moving away from the minimum.


Prof.Sachin Sambhaji Patil 94
Gradient descent — The learning rate

Prof.Sachin Sambhaji Patil 95


Finding the Right Learning Rate:
• Finding the Right Learning Rate:

• Selecting an appropriate learning rate is crucial to ensure efficient


convergence of gradient descent.

• Ideally, you want to find a learning rate that allows the algorithm to
converge quickly without overshooting or getting stuck in local
minima.

Prof.Sachin Sambhaji Patil 96


Here are some steps to guide you in choosing an appropriate learning rate:

• 1. Experimentation: It’s often a trial-and-error process to find the optimal


learning rate.

• Start with a reasonable initial value and observe the behavior of the
algorithm.

• If it converges too slowly, increase the learning rate;

• if it diverges or overshoots, decrease the learning rate. Iterate this process


until you find the right balance.
Prof.Sachin Sambhaji Patil 97
Here are some steps to guide you in choosing an appropriate learning rate:

• 2. Learning Rate Schedules:

• Instead of using a fixed learning rate throughout the entire training process,
you can employ learning rate schedules.

• These schedules gradually decrease the learning rate over time, allowing for
faster convergence in the beginning and finer adjustments towards the end.

Prof.Sachin Sambhaji Patil 98


Here are some steps to guide you in choosing an appropriate learning rate:

• 3. Adaptive Learning Rates: Advanced optimization algorithms, such as


AdaGrad, RMSprop, or Adam, automatically adapt the learning rate during
training based on the gradients observed in previous iterations.

• These adaptive methods can handle different learning rates for different
parameters and mitigate some of the challenges associated with manually
tuning the learning rate.

Prof.Sachin Sambhaji Patil 99


Gradient descent
• Gradient descent is a powerful optimization algorithm used in various
machine learning applications.

• By iteratively updating model parameters based on the gradient of the cost


function, it helps find the values that minimize the cost.

• Understanding and implementing gradient descent allows for effective


model training and optimization.

• By following the principles of gradient descent, you can make significant


strides in model optimization.
Prof.Sachin Sambhaji Patil 100
Back propagation Algorithm

Prof.Sachin Sambhaji Patil 101


Back propagation Algorithm
• The back propagation algorithm is the heart of neural network training.

• The signal needs to flow properly both in the forward direction when making
predictions as well as in the backward direction while calculating gradients.

• After propagating the input features forward to the output layer through the
various hidden layers consisting of different/same activation functions, we
come up with a predicted probability of a sample belonging to the positive
class ( generally, for classification tasks).

Prof.Sachin Sambhaji Patil 102


Back propagation Algorithm

• Now, the back propagation algorithm propagates backward from the


output layer to the input layer calculating the error gradients on the way.

• Once the computation for gradients of the cost function w.r.t each
parameter (weights and biases) in the neural network is done, the
algorithm takes a gradient descent step towards the minimum to update
the value of each parameter in the network using these gradients.

Prof.Sachin Sambhaji Patil 103


What is Vanishing Gradient Problem ?

• As the back propagation algorithm advances downwards(or backward) from


the output layer towards the input layer,

• the gradients often get smaller and smaller and approach zero which
eventually leaves the weights of the initial or lower layers nearly
unchanged.

• As a result, the gradient descent never converges to the optimum. This is


known as the vanishing gradients problem.

Prof.Sachin Sambhaji Patil 104


What is Exploding Gradient Problem ?

• On the contrary, in some cases, the gradients keep on getting larger


and larger as the backpropagation algorithm progresses.

• This, in turn, causes very large weight updates and causes the
gradient descent to diverge. This is known as the exploding
gradients problem.

Prof.Sachin Sambhaji Patil 105


Compare vanishing and exploding gradient descent.

Prof.Sachin Sambhaji Patil 106


Thank You

Prof.Sachin Sambhaji Patil 107

You might also like