DL Unit 2
DL Unit 2
DL Unit 2
Networks
• Weights: Each input neuron is associated with a weight, which represents the
strength of the connection between the input neuron and the output neuron.
• Bias: A bias term is added to the input layer to provide the perceptron with
additional flexibility in modeling complex patterns in the input data.
• Overall, the perceptron is a simple yet powerful algorithm that can be used to
perform binary classification tasks and has paved the way for more complex
neural networks used in deep learning today.
Prof.Sachin Sambhaji Patil 5
Biological Neuron
• Neurons are interconnected nerve cells in the human brain that are
involved in processing and transmitting chemical and electrical
signals.
Dendrites Input
Weights or
Synapse
interconnections
Axon Output
Prof.Sachin Sambhaji Patil 10
Artificial Neuron
• A neuron is a mathematical function modeled on the working of biological
neurons
• The Perceptron algorithm learns the weights for the input signals in order
to draw a linear decision boundary.
• The perceptron model begins with multiplying all input values and their
weights, then adds these values to create the weighted sum.
• Further, this weighted sum is applied to the activation function ‘f’ to obtain
the desired output.
• This activation function is also known as the step function and is represented
by ‘f.
• Take note that the weight of input indicates a node’s strength. Similarly,
an input value gives the ability the shift the activation function curve up
or down.
• Add a term called bias ‘b’ to this weighted sum to improve the model’s
performance.
• Y=f(∑wi*xi + b)
Prof.Sachin Sambhaji Patil 17
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural
Networks) types consists of a feed-forward network and includes a threshold
transfer inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes. A
Single-layer perceptron can learn only linearly separable patterns.
• Multi-Layered Perceptron model: It is mainly similar to a single-layer
perceptron model but has more hidden layers.
• Forward Stage: From the input layer in the on stage, activation functions
begin and terminate on the output layer.
• Backward Stage: In the backward stage, weight and bias values are modified
per the model’s requirement. The backstage removed the error between the
actual output and demands originating backward on the output layer. A
multilayer perceptron model has a greater processing power and can process
linear and non-linear patterns. Further, it also implements logic gates such as
AND, OR, XOR, XNOR, and NOR. Prof.Sachin Sambhaji Patil 18
Perceptron models
• Advantages:
• Helps us obtain the same accuracy ratio with big and small data.
• Initially, weights are multiplied with input features, and then the decision is made whether
the neuron is fired or not.
• The activation function applies a step rule to check whether the function is more significant
than zero.
• The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
• If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output willProf.Sachin
be shown.Sambhaji Patil 21
Limitation of Perceptron Model
• It can only be used to classify the linearly separable sets of input vectors.
If the input vectors are non-linear, it is not easy to classify them correctly.
https://www.simplilearn.com/tutorials/deep-learning-tutorial/perceptron
• This Neural Network or Artificial Neural Network has multiple hidden layers
that make it a multilayer neural Network and it is feed-forward because it is
a network that follows a top-down approach to train the network. In this
network there are the following layers:
1. Input Layer
2. Hidden Layer
3. Output Layer
• Hidden Layer: This layer lies after the input layer and contains multiple
neurons that perform all computations and pass the result to the output
unit.
• Output Layer: It is a layer that contains output units or neurons and receives
processed data from the hidden layer, if there are further hidden layers
connected to it then it passes the weighted unit to the connected hidden
layer for further processing to get theSambhaji
Prof.Sachin desired
Patil result. 27
The Architecture of the Multilayer Feed-Forward Neural Network:
• The input and hidden layers use sigmoid and linear activation functions
whereas the output layer uses a step activation function at nodes because it is
a two-step activation function that helps in predicting results as per
requirements.
• All units also known as neurons have weights and calculation at the hidden
layer is the summation of the dot product of all weights and their signals and
finally the sigmoid function of the calculated sum.
• Multiple hidden and output layer increases the accuracy of the output.
Prof.Sachin Sambhaji Patil 28
What is a neural network
• Forward propagation is the way data moves from left (input layer) to right
(output layer) in the neural network.
• The forward and backward phases are repeated from some epochs. In each
epoch, the following occurs:
• The inputs are propagated from the input to the output layer.
• The error is propagated from the output layer to the input layer.
• This network error measures how far the network is from making the correct
prediction. For example, if the correct output is 4 and the network’s prediction
is 1.3, then the absolute error of the network is 4-1.3=2.7.
• The weights of the inputs are W1 and W2, respectively. The bias is treated as a
new input neuron to the output neuron which has a fixed value +1 and a
weight b. Both the weights and biases could be referred to as parameters.
Where s is the sum of products (SOP) between each input and its corresponding
weight:
S = X1* W1 + X2*W2 + b
Prof.Sachin Sambhaji Patil 36
How backpropagation algorithm works
Forward pass
The input of the activation function will be the SOP between each input and its
weight. The SOP is then added to the bias to return the output of the neuron:
S = X1* W1 + X2*W2 + b
S = 0.1* 0.5 + 0.3*0.2 + 1.83
S = 1.94 Prof.Sachin Sambhaji Patil 37
Compare Single and Multi layer Feed-Forward Neural Network
• Above listed all activation functions are belong to non-linear activation functions.
And we will discuss below more in details.
• Sigmoid Activation function is very simple which takes a real value as input and
gives probability that ‘s always between 0 or 1. It looks like ‘S’ shape.
def tanh(z):
return (np.exp(z) - np.exp(-z)) / (np.exp(z) +
np.exp(-z))
def tanh_prime(z):
return 1 - np.power(tanh(z), 2)
• The formula is deceptively simple: (0, ) max(0,z). Despite its name and
appearance, it’s not linear and provides the same benefits as Sigmoid but with
better performance.
• But it has also some draw back . Sometime some gradients can be fragile during training
and can die. That leads to dead neurons.
• In another words, for activations in the region (x<0) of ReLu, gradient will be 0 because
of which the weights will not get adjusted during descent.
• That means, those neurons which go into that state will stop responding to variations
in error/ input ( simply because gradient is 0, nothing changes ). So We should be very
carefully to choose activation function , and activation function should be as per
business requirement. Prof.Sachin Sambhaji Patil 45
Non-linear Activation Functions
• 4. Leaky ReLU
• It prevents dying ReLU problem. T his variation of ReLU has a small
positive slope in the negative area, so it does enable back-
propagation, even for negative input values
• This loss essentially tells you something about the performance of the network:
the higher it is, the worse your network performs overall.
• Classification loss is the case where the aim is to predict the output
from the different categorical values
3. Huber Loss
6. Squared Hinge
• Use Mean absolute error when you are doing regression and don’t want
outliers to play a big role. It can also be useful if you know that your
distribution is multimodal, and it’s desirable to have predictions at one of
the modes, rather than at the mean of them.
• This should be compared with Mean Absolute Error, where the optimal
prediction is the median.
• MSE is thus good to use if you believe that your target data, conditioned on
the input, is normally distributed around a mean value, and when it’s
important to penalize outliers extra much.
• Use MSE when doing regression, believing that your target, conditioned
on the input, is normally distributed, and want large errors to be
significantly (quadratically) more penalized than small ones.
• Consider an example where we have a dataset of 100 values we would like our
model to be trained to predict. Out of all that data, 25% of the expected
values are 5 while the other 75% are 10.
Here, ( ) delta → hyper parameter defines the range for MAE and MSE.
In simple terms, the above radically says is: for loss values less than ( ) delta, use the MSE;
for loss values greater than delta, use the MAE.
This way Huber loss provides the best of both MAE and MSE.
Prof.Sachin Sambhaji Patil 60
Cross-Entropy Loss
• The concept of cross-entropy traces back into the field of Information Theory
where Shannon introduced the concept of entropy in 1948.
• Artificial neurons (also called Perceptrons, Units or Nodes) are the simplest
elements or building blocks in a neural network. They are inspired by
biological neurons that are found in the human brain.
• The connection points between dendrites and biological neurons are called synapses.
Likewise, the connections between inputs and perceptron's are called weights. They measure
the importance level of each input.
• In a biological neuron, the nucleus produces an output signal based on the signals provided by
dendrites. Likewise, the nucleus (colored in blue) in a perceptron performs some calculations
based on the input values and produces an output.
• In a biological neuron, the output signal is carried away by the axon. Likewise, the axon in a
Prof.Sachin Sambhaji Patil 67
perceptron is the output value which will be the input for the next perceptron's.
Optimizers
• An optimizer is an algorithm or function that adapts the neural network's
attributes, like learning rate and weights. Hence, it assists in improving the
accuracy and reduces the total loss.
• Regularization,
• Momentum,
• Gradient-Based Learning,
2. Regularization,
3. Momentum,
4. Gradient-Based Learning,
• A few different values and see which one gives you the best loss without
sacrificing speed of training. We might start with a large value like 0.1,
then try exponentially lower values: 0.01, 0.001, etc.
• Popular regularization techniques which are called L1, L2, and dropout.
• Taking very large steps i.e, a large value of the learning rate may skip the
global minima, and the model will never reach the optimal value for the
loss function. On the contrary, taking very small steps i.e, a small value of
learning rate will take forever to converge.
• Now, this slight change in loss functions can tell us about the next step to
reduce the output of the loss function.
https://www.analyticsvidhya.com/blog/2021/06/complete-guide-to-gradient-based-optimizers/
Simultaneous updates
of both parameters
(weight and bias) are
crucial for correct
gradient descent
implementation.
https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-773ba7cd3dfe#:~:text=III.-,
Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.
• SGD has the advantage of adapting quickly to changing patterns in the data.
• However, it can exhibit more oscillations and may take longer to converge
due to the noise introduced by individual samples.
• The learning rate, denoted as alpha (α), plays a crucial role in determining
how quickly the algorithm converges to the minimum of the cost function.
• It essentially controls the step size taken in each iteration of the gradient
descent process.
https://medium.com/@yennhi95zz/4-a-beginners-guide-to-gradient-descent-in-machine-learning-
773ba7cd3dfe#:~:text=III.,Implementing%20Gradient%20Descent,function%20with%20respect%20to%20w.
• These small steps can cause the convergence process to be extremely slow.
• Similarly, with a small learning rate, gradient descent takes many iterations to
approach the minimum, resulting in slower convergence.
Prof.Sachin Sambhaji Patil 93
1. a learning rate that is too small
• Learning Rate Too Large: Conversely, if the learning rate is set to a very large
value, gradient descent can overshoot the minimum and fail to converge.
• With a large learning rate, the algorithm takes big steps towards the
minimum, but it may continuously overshoot, causing the cost function to
increase rather than decrease.
• This can lead to divergence, where the algorithm fails to find the optimal
solution and keeps moving away from the minimum.
•
Prof.Sachin Sambhaji Patil 94
Gradient descent — The learning rate
• Ideally, you want to find a learning rate that allows the algorithm to
converge quickly without overshooting or getting stuck in local
minima.
• Start with a reasonable initial value and observe the behavior of the
algorithm.
• Instead of using a fixed learning rate throughout the entire training process,
you can employ learning rate schedules.
• These schedules gradually decrease the learning rate over time, allowing for
faster convergence in the beginning and finer adjustments towards the end.
• These adaptive methods can handle different learning rates for different
parameters and mitigate some of the challenges associated with manually
tuning the learning rate.
• The signal needs to flow properly both in the forward direction when making
predictions as well as in the backward direction while calculating gradients.
• After propagating the input features forward to the output layer through the
various hidden layers consisting of different/same activation functions, we
come up with a predicted probability of a sample belonging to the positive
class ( generally, for classification tasks).
• Once the computation for gradients of the cost function w.r.t each
parameter (weights and biases) in the neural network is done, the
algorithm takes a gradient descent step towards the minimum to update
the value of each parameter in the network using these gradients.
• the gradients often get smaller and smaller and approach zero which
eventually leaves the weights of the initial or lower layers nearly
unchanged.
• This, in turn, causes very large weight updates and causes the
gradient descent to diverge. This is known as the exploding
gradients problem.