BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
REGULATION, BATCH
NORMALIZATION
PRESENTED BY,
VIJAYAPRIYA V(BE CSE - IV)
Key Concepts
•Neural Network: Teaches computers to process data like the human brain.
•Perceptron: Learns to detect patterns and features in input data.
•Weights (Parameters): Represents the strength of neuron connections.
•Bias (Error): Difference between expected and predicted values.
•Activation Function: Helps neural networks learn complex data patterns.
•Loss Function: Calculates error to minimize and improve predictions.
•Learning Rate (Hyperparameter): Controls how quickly weights adjust during training.
•Overfitting: When the model performs well on training data but poorly on new data.
•Penalty: Prevents overfitting, improving generalization.
•Generalization: The ability to perform well on new data similar to training data.
•Outlier: A data point noticeably different from the rest.
BACK PROPAGATION
I/P O/P
W1 X y=2x
W3 5 10
7 14
•Problem:
W2
Initially, random weights
are taken.
I/P H O/P •Purpose:
Used for updating
weights.
BACKWARD PASS: (Right to Left)
FORWARD PASS: (Left to Right)
1. Error Computation
1. Summation (Σ)
2. Weight Update
2. Activation Function
3. Bias Update
At node h, At node o,
1. Net (h) = W1.X = 0.15 * 5 = 0.75 1. Net (o) = W2.Out(h) = 0.4 * 0.6791 = 0.2716
2. Out (h) = 1/(1+e^-0.75) = 0.6791 2. Out (o) = 1/(1+e^-0.2716) = 0.5674
= predicated output
∂Out(o)/∂Net(o) = Out(o)[1-Out(o)]
= 0.5674[1-0.5674]
∂E/ ∂W2 = ∂E/Out(0)*∂Out(o)/∂Net(o)*∂Net(o)/∂W2 = 0.2454
= -0.4326 * 0.2454 * 0.6791
= -0.072 ∂Net(o)/∂W2 = Out(h) = 0.6791
∂E/Out(0) = 2/2[Y-Out(o)][-1]
= [Out(o)-Y]
= 0.5674 – 1
= -0.4326
W1 W2 ∂Out(o)/∂Net(o) = Out(o)[1-Out(o)]
i Net (o) Out (o) o E = 0.5674[1-0.
= 0.2454
∂Net(o)/∂Out(h)=W2=0.4
∂Out(h)/∂Net(h)=Out(h)[1-Out(h)]
∂E/∂W1=∂E/∂Out(o)*∂Out(o)/∂Net(o)*∂Net(o)/∂Out(h)*∂Out(h)/∂Net(h)*∂Net(h)/∂W1 =0.2179
= -0.4326 * 0.2456 * 0.4 * 0.2179 * 5 ∂Net(h)/∂W1=X=5
= -0.046
W1(new) = W1(old) – η ∂E/ ∂W2 = 0.15 – 0.5 (-0.046) = 0.173
=>This is the first iteration only,
again do forward pass
get error
Backward pass
In the forward pass, we begin by propagating the data inputs through the input layer, passing them through the hidden
layer(s), measuring the network’s predictions at the output layer, and finally calculating the network error based on
these predictions. This network error indicates how far the network is from making the correct prediction. For example,
if the correct output is 4 and the network’s prediction is 1.3, then the absolute error is 4 - 1.3 = 2.7. The process of
propagating the inputs from the input layer to the output layer is called forward propagation.
Once the network error is calculated, the forward propagation phase ends, and the backward pass begins. In the
backward pass, the flow is reversed, starting by propagating the error from the output layer to the input layer, passing
through the hidden layer(s).
The process of propagating the network error from the output layer to the input layer is called backward propagation
or simply backpropagation. The backpropagation algorithm consists of the steps used to update network weights to
minimize the network error.
ALGORITHM:
1. Inputs X arrive through the preconnected path.
2. The input is modelled using true weights W. Weights are usually chosen randomly.
3. Calculate the output of each neuron from the input layer to the hidden layer to output layer.
4. Calculate the error in the outputs. Backpropagation Error = Actual Output – Desired Output
5. From the output layer, go back to the hidden layer to adjust the weights to reduce error.
6. Repeat the process until the desired output is achieved.
REGULATION
WHY?
Regulation means restricting a model to avoid overfitting by shrinking the coefficient estimates to zero. When a
model suffers from overfitting, we should control the model’s complexity. Technically, regularization avoids overfitting by
adding a penalty to the model’s loss function.
Sum of squares = Σ(target output – predicated output)^2 + additional feature to control regularization (which is vary in
each techinques)
There are three commonly used regularization techniques to control the complexity of machine learing models,
as follows:
-> L2 regularization
-> L1 regularization
-> Elastic Net
L2 Regularization
◦ A linear regression that uses the L2 regularization technique is called ridge regression. In other words,
in ridge regression, a regularization term is added to the cost function of the linear regression, which
keeps the magnitude of the model’s weights (coefficients) as small as possible. The L2 regularization
technique tries to keep the model’s weights close to zero, but not zero, which means each feature
should have a low impact on the output while the model’s accuracy should be as high as possible.
◦ Where λ controls the strength of regularization, and w are the model’s weights(coefficients).
◦ By increasing λ, the model becomes flattered and underfit. On the other hand, by decreasing λ, the
model becomes more overfit, and with λ = 0, the regularization term will be eliminated.
L1 Regularization
◦ Least Absolute Shrinkage and Selection Operator (lasso) regression is an alternative to ridge for
regularization linear regression. Lasso regression also adds a penalty term to the cost function, but
slightly different, called L1 regularization. L1 regularization makes some coefficients zero, meaning the
model will ignore those features. Ignoring the least important features helps emphasize the model’s
essential features.
◦ Where λ controls the strength of regularization, and w are the model’s weights (coefficients).
◦ Lasso regression automatically performs feature selection by eliminating the least important feature.
Elastic Net Regularization
◦ The third type of regularization, uses both L1 and L2 regularizations to produce most optimized output.
◦ In addition to setting and choosing a lambda value elastic net also allows us to tune the alpha
parameter where α = 0 corresponds to ridge and α = 1 to lasso. Simply put, if you plug in 0 for alpha, the
penalty function reduces to the L1 (ridge) term and if we set alpha to 1 we get the L2 (lasso) term.
◦ Therefore we can choose an alpha value between 0 and 1 to optimize the elastic net (here we can adjust
the weightage of each regularization, thus giving the name elastic). Effectively this will shrink some
coefficients and set some to 0 for sparse selection.
Early stopping
◦ Early stopping is a kind of cross-validation strategy where we keep one part of the training set as the
validation set. When we see that the performance on the validation set is getting worse, we immediately stop
the training on the model. This is known as early stopping.
◦ In the above image, we will stop training at the dotted line since after that our model will start overfitting on
the training data.
NORMALIZATION
Data (x) x.Max Some algorithms are sensitive to the scale of feature
144 0.68 values. If the feature values are too high or too low,
the algorithms may not perform well. Therefore, it is
101 0 essential to normalize the features.
120 0.30
For example,
112 0.17
Min-Max normalization is defined as follows:
164 1.00
Min – Max normalization = (X – min)/(max - min)
In backpropagation, the forward pass performs two operations: summation and activation function. Normalization is
performed either before or after the activation function, so the output of each step is normalized.
BATCH NORMALIZATION
Þ Batch normalization is not performed on a single row
of training data.
Þ If the training data contains 100 samples, it can be
normalized in batches, such as 32, 32, etc., without
distorting the shape of the neural network.
Þ This process helps to stabilize the numerical data.
Þ In this normalization method, outliers have less
impact.
Þ By adding an extra layer, it becomes faster and more
stable.
1. Normalization is a data pre-processing technique that scales numerical data without altering its shape, helping machine
learning models generalize better.
2. Batch normalization, used in deep neural networks, improves speed and stability by adding layers that standardize and
normalize the input from previous layers.
3. In a feedforward network, it normalizes the activations before passing them to the activation function.
4. During training, batch normalization updates the mean, standard deviation, and learnable parameters (gamma and beta) to
minimize the loss function.
During training, the mean and standard deviation of the activations are computed over each mini-batch of data, and the
gamma and beta parameters are updated using gradient descent to minimize the loss function of the network.
Apply a batch normalization,
1. μ = 1/m Σ x ADVANTAGES:
2. σ^2 = 1/m Σ(x-μ)^2 Improved Stability: Reduces internal covariate shift, enhancing
3. x^ = (x – μ)/ √(σ^2 + ε) training stability and convergence speed.
4. y = γx^ + β = BN γ,β (x) Improved Performance: Reduces overfitting, improves
generalization, and allows for larger learning rates.
Example : Faster Convergence: Speeds up convergence by decreasing
Mini batch of 4 examples with 1 features gradient dependence on parameter scales, making optimization
more efficient.
X= [1,2,3,4]
DRAWBACKS:
5. μ = (1+2+3+4)/4 = 2.5 Increased Computational Cost: Requires additional calculations for the
6. σ^2 = 1.118 mean and standard deviation of each layer in each mini-batch, along with
7. x^ = (x – μ)/ σ scaling and shifting using gamma and beta parameters.
= [-1.3416,-0.4472,0.4472,1.3416] Limited by Batch Size: Most effective with larger batch sizes; small batches
may yield noisy estimates of mean and standard deviation, affecting
8. γ = 1, β = 0 network performance.
z = γ * x^ + β
z = [-1.3416,-0.4472,0.4472,1.3416]
Pass to next layer
THANKING YOU