BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

BACK PROPAGATION and
REGULATION, BATCH
NORMALIZATION
PRESENTED BY,
VIJAYAPRIYA V(BE CSE - IV)
Key Concepts
•Neural Network: Teaches computers to process data like the human brain.
•Perceptron: Learns to detect patterns and features in input data.
•Weights (Parameters): Represents the strength of neuron connections.
•Bias (Error): Difference between expected and predicted values.
•Activation Function: Helps neural networks learn complex data patterns.
•Loss Function: Calculates error to minimize and improve predictions.
•Learning Rate (Hyperparameter): Controls how quickly weights adjust during training.
•Overfitting: When the model performs well on training data but poorly on new data.
•Penalty: Prevents overfitting, improving generalization.
•Generalization: The ability to perform well on new data similar to training data.
•Outlier: A data point noticeably different from the rest.
BACK PROPAGATION
I/P O/P
W1 X y=2x
W3 5 10
7 14
•Problem:
W2
Initially, random weights
are taken.
I/P H O/P •Purpose:
Used for updating
weights.
BACKWARD PASS: (Right to Left)
FORWARD PASS: (Left to Right)
1. Error Computation
1. Summation (Σ)
2. Weight Update
2. Activation Function
3. Bias Update
1. Net = Σ weight.feature + bias Use partial derivative

2. Out = 1/(1+e^-Net)
Error = ½[actual output – predicated output]^2

Implementation
FORWARD PASS:
W1=0.15 W2=0.4
X=5 i h o Y=1
At node h, At node o,
1. Net (h) = W1.X = 0.15 * 5 = 0.75 1. Net (o) = W2.Out(h) = 0.4 * 0.6791 = 0.2716
2. Out (h) = 1/(1+e^-0.75) = 0.6791 2. Out (o) = 1/(1+e^-0.2716) = 0.5674
= predicated output
Error, E = ½ [1 – 0.5676]^2 = 0.0935

W1=0.15 W2=0.4
BACKWARD PASS: X=5 i h o Y=1
∂E/Out(0) = 2/2[Y-Out(o)][-1] = [Out(o)-Y]

W2 Net (o) Out (o) E = 0.5674 – 1
= -0.4326
∂Out(o)/∂Net(o) = Out(o)[1-Out(o)]
= 0.5674[1-0.5674]
∂E/ ∂W2 = ∂E/Out(0)*∂Out(o)/∂Net(o)*∂Net(o)/∂W2 = 0.2454
= -0.4326 * 0.2454 * 0.6791
= -0.072 ∂Net(o)/∂W2 = Out(h) = 0.6791
W2(new) = W2(old) - η ∂E/ ∂W2

= 0.4 – 0.5(-0.072) [assume η=0.5]
= 0.436
W1=0.15 W2=0.4
BACKWARD PASS: X=5 i h o Y=1
∂E/Out(0) = 2/2[Y-Out(o)][-1]
= [Out(o)-Y]
= 0.5674 – 1
= -0.4326
W1 W2 ∂Out(o)/∂Net(o) = Out(o)[1-Out(o)]
i Net (o) Out (o) o E = 0.5674[1-0.
= 0.2454
∂Net(o)/∂Out(h)=W2=0.4
∂Out(h)/∂Net(h)=Out(h)[1-Out(h)]
∂E/∂W1=∂E/∂Out(o)*∂Out(o)/∂Net(o)*∂Net(o)/∂Out(h)*∂Out(h)/∂Net(h)*∂Net(h)/∂W1 =0.2179
= -0.4326 * 0.2456 * 0.4 * 0.2179 * 5 ∂Net(h)/∂W1=X=5
= -0.046
W1(new) = W1(old) – η ∂E/ ∂W2 = 0.15 – 0.5 (-0.046) = 0.173
=>This is the first iteration only,
 again do forward pass
 get error
 Backward pass
 In the forward pass, we begin by propagating the data inputs through the input layer, passing them through the hidden
layer(s), measuring the network’s predictions at the output layer, and finally calculating the network error based on
these predictions. This network error indicates how far the network is from making the correct prediction. For example,
if the correct output is 4 and the network’s prediction is 1.3, then the absolute error is 4 - 1.3 = 2.7. The process of
propagating the inputs from the input layer to the output layer is called forward propagation.
 Once the network error is calculated, the forward propagation phase ends, and the backward pass begins. In the
backward pass, the flow is reversed, starting by propagating the error from the output layer to the input layer, passing
through the hidden layer(s).
 The process of propagating the network error from the output layer to the input layer is called backward propagation
or simply backpropagation. The backpropagation algorithm consists of the steps used to update network weights to
minimize the network error.
ALGORITHM:
1. Inputs X arrive through the preconnected path.
2. The input is modelled using true weights W. Weights are usually chosen randomly.
3. Calculate the output of each neuron from the input layer to the hidden layer to output layer.
4. Calculate the error in the outputs. Backpropagation Error = Actual Output – Desired Output
5. From the output layer, go back to the hidden layer to adjust the weights to reduce error.
6. Repeat the process until the desired output is achieved.
REGULATION
WHY?
• Overfitting refers to the phenomenon where a neural network models the

training data very well but fails when it sees new data from the same problem
domain.
• Overfitting is caused by noise in the training data that the neural network
picks up during training and learns it as an underlying concept of the data.
• This learned noise, however, is unique to each training set. As soon as the
model sees new data from the same problem domain, but that does not
contain this noise, the performance of the neural network gets much worse.
• The reason for this is that the complexity of this network is too high.
• The model with a higher complexity is able to pick up and learn patterns
(noise ) in the data that are just caused by some random fluctuation or error.
• Less complex neural networks are less susceptible to overfitting. To prevent
overfitting or a high variance we must use something that is called
regularization.
WHAT IS REGULARIZATION?
Regulation means restricting a model to avoid overfitting by shrinking the coefficient estimates to zero. When a
model suffers from overfitting, we should control the model’s complexity. Technically, regularization avoids overfitting by
adding a penalty to the model’s loss function.
Regularization = Loss Function + Penalty
Sum of squares = Σ(target output – predicated output)^2 + additional feature to control regularization (which is vary in
each techinques)
There are three commonly used regularization techniques to control the complexity of machine learing models,
as follows:
-> L2 regularization
-> L1 regularization
-> Elastic Net
L2 Regularization
◦ A linear regression that uses the L2 regularization technique is called ridge regression. In other words,
in ridge regression, a regularization term is added to the cost function of the linear regression, which
keeps the magnitude of the model’s weights (coefficients) as small as possible. The L2 regularization
technique tries to keep the model’s weights close to zero, but not zero, which means each feature
should have a low impact on the output while the model’s accuracy should be as high as possible.
◦ Ridge Regression Cost Function = Loss Function + ½ λΣw^2
◦ Where λ controls the strength of regularization, and w are the model’s weights(coefficients).
◦ By increasing λ, the model becomes flattered and underfit. On the other hand, by decreasing λ, the
model becomes more overfit, and with λ = 0, the regularization term will be eliminated.
L1 Regularization
◦ Least Absolute Shrinkage and Selection Operator (lasso) regression is an alternative to ridge for
regularization linear regression. Lasso regression also adds a penalty term to the cost function, but
slightly different, called L1 regularization. L1 regularization makes some coefficients zero, meaning the
model will ignore those features. Ignoring the least important features helps emphasize the model’s
essential features.
◦ Lasso Regression Cost Function = Loss Function + λΣ|w|
◦ Where λ controls the strength of regularization, and w are the model’s weights (coefficients).
◦ Lasso regression automatically performs feature selection by eliminating the least important feature.
Elastic Net Regularization
◦ The third type of regularization, uses both L1 and L2 regularizations to produce most optimized output.
◦ In addition to setting and choosing a lambda value elastic net also allows us to tune the alpha
parameter where α = 0 corresponds to ridge and α = 1 to lasso. Simply put, if you plug in 0 for alpha, the
penalty function reduces to the L1 (ridge) term and if we set alpha to 1 we get the L2 (lasso) term.
◦ Cost function of Elastic Net Regularization

◦ J(β1, β2, ….., βm) = Σi=1 to n(y – Σ xiβ)^2 + λ(αΣ|β|+(1- α)/2 Σ β^2)
◦ Therefore we can choose an alpha value between 0 and 1 to optimize the elastic net (here we can adjust
the weightage of each regularization, thus giving the name elastic). Effectively this will shrink some
coefficients and set some to 0 for sparse selection.
Early stopping
◦ Early stopping is a kind of cross-validation strategy where we keep one part of the training set as the
validation set. When we see that the performance on the validation set is getting worse, we immediately stop
the training on the model. This is known as early stopping.
◦ In the above image, we will stop training at the dotted line since after that our model will start overfitting on
the training data.
NORMALIZATION
Data (x) x.Max Some algorithms are sensitive to the scale of feature
144 0.68 values. If the feature values are too high or too low,
the algorithms may not perform well. Therefore, it is
101 0 essential to normalize the features.
120 0.30
For example,
112 0.17
Min-Max normalization is defined as follows:
164 1.00
Min – Max normalization = (X – min)/(max - min)
T his normalization method ensures that all values are

Max -> 164
scaled to fall within the range of 0 to 1.
Min -> 101
Initially, our inputs x1, x2, x3, x4 are in normalized form as they are coming from the pre-processing stage. When the
input passes through the first layer, it transforms, as a sigmoid function applied over the dot product of input x and the
weight matrix w.
In backpropagation, the forward pass performs two operations: summation and activation function. Normalization is
performed either before or after the activation function, so the output of each step is normalized.
BATCH NORMALIZATION
Þ Batch normalization is not performed on a single row
of training data.
Þ If the training data contains 100 samples, it can be
normalized in batches, such as 32, 32, etc., without
distorting the shape of the neural network.
Þ This process helps to stabilize the numerical data.
Þ In this normalization method, outliers have less
impact.
Þ By adding an extra layer, it becomes faster and more
stable.
1. Normalization is a data pre-processing technique that scales numerical data without altering its shape, helping machine
learning models generalize better.
2. Batch normalization, used in deep neural networks, improves speed and stability by adding layers that standardize and
normalize the input from previous layers.
3. In a feedforward network, it normalizes the activations before passing them to the activation function.
4. During training, batch normalization updates the mean, standard deviation, and learnable parameters (gamma and beta) to
minimize the loss function.
HOW BATCH NORMALIZATION WORKS:

i) Compute the mean and standard deviation of the activations of each layer in a mini-batch of a data.
ii) Normalize the activation of each layer by subtracting the mean and dividing by the standard deviation.
iii) Scale and shift the normalized activations by learnable parameters, known as gamma and beta, which are updated during
training.
iv) Pass the normalised and trainformed activations to the next layer.
 During training, the mean and standard deviation of the activations are computed over each mini-batch of data, and the
gamma and beta parameters are updated using gradient descent to minimize the loss function of the network.
Apply a batch normalization,
1. μ = 1/m Σ x ADVANTAGES:
2. σ^2 = 1/m Σ(x-μ)^2  Improved Stability: Reduces internal covariate shift, enhancing
3. x^ = (x – μ)/ √(σ^2 + ε) training stability and convergence speed.
4. y = γx^ + β = BN γ,β (x)  Improved Performance: Reduces overfitting, improves
generalization, and allows for larger learning rates.
Example :  Faster Convergence: Speeds up convergence by decreasing
Mini batch of 4 examples with 1 features gradient dependence on parameter scales, making optimization
more efficient.
X= [1,2,3,4]
DRAWBACKS:
5. μ = (1+2+3+4)/4 = 2.5  Increased Computational Cost: Requires additional calculations for the
6. σ^2 = 1.118 mean and standard deviation of each layer in each mini-batch, along with
7. x^ = (x – μ)/ σ scaling and shifting using gamma and beta parameters.
= [-1.3416,-0.4472,0.4472,1.3416]  Limited by Batch Size: Most effective with larger batch sizes; small batches
may yield noisy estimates of mean and standard deviation, affecting
8. γ = 1, β = 0 network performance.
z = γ * x^ + β
z = [-1.3416,-0.4472,0.4472,1.3416]
Pass to next layer
THANKING YOU

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Copyright:

Available Formats

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Copyright:

Available Formats

BACK PROPAGATION and

1. Net = Σ weight.feature + bias Use partial derivative

Error = ½[actual output – predicated output]^2

Error, E = ½ [1 – 0.5676]^2 = 0.0935

∂E/Out(0) = 2/2[Y-Out(o)][-1] = [Out(o)-Y]

W2(new) = W2(old) - η ∂E/ ∂W2

• Overfitting refers to the phenomenon where a neural network models the

Regularization = Loss Function + Penalty

◦ Ridge Regression Cost Function = Loss Function + ½ λΣw^2

◦ Lasso Regression Cost Function = Loss Function + λΣ|w|

◦ Cost function of Elastic Net Regularization

T his normalization method ensures that all values are

HOW BATCH NORMALIZATION WORKS:

You might also like

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Copyright:

Available Formats

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BACK PROPAGATION and REGULATION, BATCH NORMALIZATION

Uploaded by

Copyright:

Available Formats

BACK PROPAGATION and

1. Net = Σ weight.feature + bias Use partial derivative

Error = ½[actual output – predicated output]^2

Error, E = ½ [1 – 0.5676]^2 = 0.0935

∂E/Out(0) = 2/2[Y-Out(o)][-1] = [Out(o)-Y]

W2(new) = W2(old) - η ∂E/ ∂W2

• Overfitting refers to the phenomenon where a neural network models the

Regularization = Loss Function + Penalty

◦ Ridge Regression Cost Function = Loss Function + ½ λΣw^2

◦ Lasso Regression Cost Function = Loss Function + λΣ|w|

◦ Cost function of Elastic Net Regularization

T​ his normalization method ensures that all values are

HOW BATCH NORMALIZATION WORKS:

You might also like

T his normalization method ensures that all values are