21CS743 | DEEP LEARNING | SEARCH CREATORS.
Module-02
Feedforward Networks and Deep Learning
Introduction to Feedforward Neural Networks
1.1 Basic Concepts
• A feedforward neural network is the simplest form of artificial neural network (ANN)
• Information moves in only one direction: forward, from input nodes through hidden nodes
to output nodes
• No cycles or loops exist in the network structure
1.2 Historical Context
1. Origins
o Inspired by biological neural networks
o First proposed by Warren McCulloch and Walter Pitts (1943)
o Significant advancement with perceptron by Frank Rosenblatt (1958)
2. Evolution
o Single-layer to multi-layer networks
o Development of backpropagation in 1986
o Modern deep learning revolution (2012-present)
Search Creators... Page 1
21CS743 | DEEP LEARNING | SEARCH CREATORS.
1.3 Network Architecture
1. Input Layer
o Receives raw input data
o No computation performed
o Number of neurons equals number of input features
o Standardization/normalization often applied here
2. Hidden Layers
o Performs intermediate computations
o Can have multiple hidden layers
o Each neuron connected to all neurons in previous layer
Search Creators... Page 2
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Feature extraction and transformation occur here
3. Output Layer
o Produces final network output
o Number of neurons depends on problem type
o Classification: typically one neuron per class
o Regression: usually one neuron
1.4 Activation Functions
1. Sigmoid (Logistic)
o Formula: σ(x) = 1/(1 + e^(-x))
o Range: [0,1]
o Used in binary classification
o Properties:
▪ Smooth gradient
▪ Clear prediction probability
▪ Suffers from vanishing gradient
2. Hyperbolic Tangent (tanh)
o Formula: tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))
o Range: [-1,1]
o Often performs better than sigmoid
o Properties:
▪ Zero-centered
Search Creators... Page 3
21CS743 | DEEP LEARNING | SEARCH CREATORS.
▪ Stronger gradients
▪ Still has vanishing gradient issue
3. ReLU (Rectified Linear Unit)
o Formula: f(x) = max(0,x)
o Most commonly used
o Helps solve vanishing gradient problem
o Properties:
▪ Computationally efficient
▪ No saturation in positive region
▪ Dying ReLU problem
4. Leaky ReLU
o Formula: f(x) = max(0.01x, x)
o Addresses dying ReLU problem
o Small negative slope
o Properties:
▪ Never completely dies
▪ Allows for negative values
▪ More robust than standard ReLU
2. Gradient-Based Learning
2.1 Understanding Gradients
1. Definition
Search Creators... Page 4
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Gradient is a vector of partial derivatives
o Points in direction of steepest increase
o Used to minimize loss function
2. Properties
o Direction indicates fastest increase
o Magnitude indicates steepness
o Negative gradient used for minimization
2.2 Cost Functions
1. Mean Squared Error (MSE)
o Used for regression problems
o Formula: MSE = (1/n)Σ(y_true - y_pred)²
o Properties:
▪ Always positive
▪ Penalizes larger errors more
▪ Differentiable
2. Cross-Entropy Loss
o Used for classification problems
o Formula: -Σ(y_true * log(y_pred))
o Properties:
▪ Measures probability distribution difference
▪ Better for classification than MSE
Search Creators... Page 5
21CS743 | DEEP LEARNING | SEARCH CREATORS.
▪ Provides stronger gradients
3. Huber Loss
o Combines MSE and MAE
o Less sensitive to outliers
o Formula:
▪ L = 0.5(y - f(x))² if |y - f(x)| ≤ δ
▪ L = δ|y - f(x)| - 0.5δ² otherwise
2.3 Gradient Descent Types
1. Batch Gradient Descent
o Uses entire dataset for each update
o More stable but slower
o Formula: θ = θ - α∇J(θ)
o Memory intensive for large datasets
2. Stochastic Gradient Descent (SGD)
o Updates parameters after each sample
o Faster but less stable
o Better for large datasets
o High variance in parameter updates
3. Mini-batch Gradient Descent
o Compromise between batch and SGD
o Updates parameters after small batches
Search Creators... Page 6
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Most commonly used in practice
o Typical batch sizes: 32, 64, 128
4. Advanced Optimizers a) Adam (Adaptive Moment Estimation)
o Combines momentum and RMSprop
o Adaptive learning rates
o Formula includes first and second moments
b) RMSprop
o Adaptive learning rates
o Divides by running average of gradient magnitudes
c) Momentum
o Adds fraction of previous update
o Helps escape local minima
o Reduces oscillation
3. Backpropagation and Chain Rule
3.1 Chain Rule Fundamentals
1. Mathematical Basis
o df/dx = df/dy * dy/dx
o Allows computation of composite function derivatives
o Essential for neural network training
2. Application in Neural Networks
o Computes gradients layer by layer
Search Creators... Page 7
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Propagates error backwards
o Updates weights based on contribution to error
3.2 Forward Pass
1. Input Processing
o Data normalization
o Weight initialization
o Bias addition
2. Layer Computation
python
Copy
# Pseudo-code for forward pass
for layer in network:
Z = W * A + b # Linear transformation
A = activation(Z) # Apply activation function
3. Output Generation
o Final layer activation
o Prediction computation
o Error calculation
3.3 Backward Pass
1. Error Calculation
o Compare output with target
Search Creators... Page 8
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Calculate loss using cost function
o Initialize gradient computation
2. Weight Updates
o Calculate gradients using chain rule
o Update weights: w_new = w_old - learning_rate * gradient
o Update biases similarly
3. Detailed Steps
python
Copy
# Pseudo-code for backward pass
# Output layer
dZ = A - Y # For MSE
dW = (1/m) * dZ * A_prev.T
db = (1/m) * sum(dZ)
# Hidden layers
dZ = dA * activation_derivative(Z)
dW = (1/m) * dZ * A_prev.T
db = (1/m) * sum(dZ)
4. Regularization for Deep Learning
4.1 L1 Regularization
Search Creators... Page 9
21CS743 | DEEP LEARNING | SEARCH CREATORS.
1. Mathematical Form
o Adds absolute value of weights to loss
o Formula: L1 = λΣ|w|
o Promotes sparsity
2. Properties
o Feature selection capability
o Produces sparse models
o Less sensitive to outliers
4.2 L2 Regularization
1. Mathematical Form
o Adds squared weights to loss
o Formula: L2 = λΣw²
o Prevents large weights
2. Properties
o Smooth weight decay
o No sparse solutions
o More stable training
4.3 Dropout
1. Basic Concept
o Randomly deactivate neurons
o Probability p of keeping neurons
Search Creators... Page 10
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o Different network for each training batch
2. Implementation Details
python
Copy
# Pseudo-code for dropout
mask = np.random.binomial(1, p, size=layer_size)
A = A * mask
A = A / p # Scale to maintain expected value
3. Training vs. Testing
o Used only during training
o Scaled appropriately during inference
o Acts as model ensemble
4.4 Early Stopping
1. Implementation
o Monitor validation error
o Save best model
o Stop when validation error increases
2. Benefits
o Prevents overfitting
o Reduces training time
o Automatic model selection
Search Creators... Page 11
21CS743 | DEEP LEARNING | SEARCH CREATORS.
5. Advanced Concepts
5.1 Batch Normalization
1. Purpose
o Normalizes layer inputs
o Reduces internal covariate shift
o Speeds up training
2. Algorithm
python
Copy
# Pseudo-code for batch normalization
mean = np.mean(x, axis=0)
var = np.var(x, axis=0)
x_norm = (x - mean) / np.sqrt(var + ε)
out = gamma * x_norm + beta
5.2 Weight Initialization
1. Xavier/Glorot Initialization
o Variance = 2/(nin + nout)
o Suitable for tanh activation
2. He Initialization
o Variance = 2/nin
o Better for ReLU activation
Search Creators... Page 12
21CS743 | DEEP LEARNING | SEARCH CREATORS.
6. Practical Implementation
6.1 Network Design Considerations
1. Architecture Choices
o Number of layers
o Neurons per layer
o Activation functions
2. Hyperparameter Selection
o Learning rate
o Batch size
o Regularization strength
6.2 Training Process
1. Data Preparation
o Splitting data
o Normalization
o Augmentation
2. Training Loop
o Forward pass
o Loss computation
o Backward pass
o Parameter updates
Practice Problems and Exercises
Search Creators... Page 13
21CS743 | DEEP LEARNING | SEARCH CREATORS.
1. Basic Concepts
o Explain the role of activation functions in neural networks
o Compare and contrast different types of gradient descent
o Describe the vanishing gradient problem
2. Mathematical Problems
o Calculate gradients for a simple 2-layer network
o Implement batch normalization equations
o Compute different loss functions
3. Implementation Challenges
o Design a network for MNIST classification
o Implement dropout in Python
o Create a custom loss function
Key Formulas Reference Sheet
1. Activation Functions
o Sigmoid: σ(x) = 1/(1 + e^(-x))
o tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))
o ReLU: f(x) = max(0,x)
2. Loss Functions
o MSE = (1/n)Σ(y_true - y_pred)²
o Cross-Entropy = -Σ(y_true * log(y_pred))
3. Regularization
Search Creators... Page 14
21CS743 | DEEP LEARNING | SEARCH CREATORS.
o L1 = λΣ|w|
o L2 = λΣw²
4. Gradient Descent
o Update: w = w - α∇J(w)
o Momentum: v = βv - α∇J(w)
Common Issues and Solutions
1. Vanishing Gradients
o Use ReLU activation
o Implement batch normalization
o Try residual connections
2. Overfitting
o Add dropout
o Use regularization
o Implement early stopping
3. Poor Convergence
o Adjust learning rate
o Try different optimizers
o Check data normalization
Search Creators... Page 15