MFD S Assignment 2

MFDS ASSIGNMENT
Q1
Gradient Descent: Theory and Implementation
Introduction
Gradient descent is a fundamental optimization algorithm used to minimize a

cost function by iteratively adjusting the model's parameters. It aims to find
the minimum of a function, often applied in machine learning tasks to
optimize model weights. There are three popular variants of gradient
descent:
1. Vanilla Gradient Descent (Batch Gradient Descent)
2. Mini-batch Gradient Descent
3. Stochastic Gradient Descent (SGD)
Below is an explanation of these methods along with the code required to

implement them.
---
1. Vanilla Gradient Descent
Process: Uses the entire dataset to compute the gradient at each step.
Advantages: Provides a more stable and accurate gradient.
Disadvantages: Computationally expensive for large datasets.

Update rule:
\theta = \theta - \alpha \cdot \nabla J(\theta)
= model parameters
= learning rate
= gradient of the cost function with respect to parameters
---
2. Mini-batch Gradient Descent
Process: Uses random subsets (batches) of the data to compute the gradient.
Advantages: Strikes a balance between speed and stability.
Optimal batch size: Experimentally chosen, often between 32 and 256

depending on the problem.
Update rule:
\theta = \theta - \alpha \cdot \nabla J(\theta; \text{Batch})
---
3. Stochastic Gradient Descent (SGD)

Process: Updates the parameters for each data point individually.
Advantages: Faster convergence and suitable for online learning.
Disadvantages: High variance in gradient updates, may fluctuate around the

minimum.
Update rule:
\theta = \theta - \alpha \cdot \nabla J(\theta; x_i, y_i)
---
Implementation
Below is a Python implementation of the three variants using NumPy and a

synthetic dataset with 1500 rows and 15 features.
import numpy as np
# Generate a synthetic dataset with 1500 samples and 15 features
np.random.seed(42)
X = np.random.randn(1500, 15)
y = X @ np.random.randn(15) + np.random.randn(1500) # Linear

relationship with noise
# Initialize parameters
theta = np.zeros(15) # Model weights
alpha = 0.01 # Learning rate

epochs = 1000 # Number of iterations
# Helper function to compute cost
def compute_cost(X, y, theta):
m = len(y)
predictions = X @ theta
cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
return cost
# 1. Vanilla Gradient Descent
def vanilla_gradient_descent(X, y, theta, alpha, epochs):
m = len(y)
for epoch in range(epochs):
gradient = (1 / m) * X.T @ (X @ theta - y)
theta -= alpha * gradient
return theta
# 2. Mini-batch Gradient Descent
def mini_batch_gradient_descent(X, y, theta, alpha, epochs, batch_size):
m = len(y)
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, m, batch_size):
X_batch = X_shuffled[i:i + batch_size]
y_batch = y_shuffled[i:i + batch_size]

gradient = (1 / len(y_batch)) * X_batch.T @ (X_batch @ theta -
y_batch)
return theta
# 3. Stochastic Gradient Descent
def stochastic_gradient_descent(X, y, theta, alpha, epochs):
m = len(y)
for i in range(m):
xi = X[i:i + 1]
yi = y[i:i + 1]
gradient = xi.T @ (xi @ theta - yi)
return theta
# Experimenting with the methods
theta_vanilla = vanilla_gradient_descent(X, y, theta.copy(), alpha, epochs)
theta_mini_batch = mini_batch_gradient_descent(X, y, theta.copy(), alpha,

epochs, batch_size=64)
theta_sgd = stochastic_gradient_descent(X, y, theta.copy(), alpha, epochs)
# Compute and print the final costs for comparison
print("Vanilla Gradient Descent Cost:", compute_cost(X, y, theta_vanilla))
print("Mini-batch Gradient Descent Cost:", compute_cost(X, y,

theta_mini_batch))
print("Stochastic Gradient Descent Cost:", compute_cost(X, y, theta_sgd))

---
Experiment on Convergence using Mini-batch Gradient Descent
To determine the optimal batch size, we can experiment with different batch
sizes (e.g., 16, 32, 64, 128, 256) and observe the convergence behavior.
import matplotlib.pyplot as plt
# Function to track convergence for different batch sizes
def track_convergence(X, y, alpha, epochs, batch_sizes):
for batch_size in batch_sizes:
theta = np.zeros(15)
costs = []
m = len(y)
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
for i in range(0, m, batch_size):
X_batch = X_shuffled[i:i + batch_size]
y_batch = y_shuffled[i:i + batch_size]
gradient = (1 / len(y_batch)) * X_batch.T @ (X_batch @ theta -

y_batch)
costs.append(compute_cost(X, y, theta))
plt.plot(costs, label=f'Batch Size {batch_size}')

plt.xlabel('Epochs')
plt.ylabel('Cost')
plt.legend()
plt.title('Convergence for Different Batch Sizes')
plt.show()
# Experiment with different batch sizes
batch_sizes = [16, 32, 64, 128, 256]
track_convergence(X, y, alpha, epochs=200, batch_sizes=batch_sizes)
---
Conclusion
Vanilla Gradient Descent: Takes longer to converge due to processing the

entire dataset in every iteration.
Stochastic Gradient Descent: Faster but may oscillate around the minimum.
Mini-batch Gradient Descent: Strikes a balance between speed and stability.

The optimal batch size is problem-dependent but often found between 32
and 128.
This code demonstrates how different gradient descent variants behave and
how batch size influences convergence in mini-batch gradient descent.
Q2.
Below are the implementations of Bisection and Golden Section methods for
line search to optimize the learning rate for the dataset used in Q1. These
methods aim to find an optimal step size (learning rate) that minimizes the
cost function along a search direction.
1. Bisection Method
The Bisection method finds the minimum by repeatedly halving the interval
between two points where the function changes behavior. It is useful when
the cost function is unimodal in the given interval.
Def bisection_line_search(f, a, b, tol=1e-5):
“””
Bisection method to find the optimal step size.
F: Cost function to minimize
A, b: Interval [a, b] within which the minimum lies
Tol: Tolerance for stopping the search
“””
While abs(b – a) > tol:
Mid = (a + b) / 2
If f(mid – tol) < f(mid + tol):
B = mid
Else:
A = mid
Return (a + b) / 2
# Example usage: Minimize the cost function along the gradient direction
Optimal_alpha_bisection = bisection_line_search(
Lambda alpha: compute_cost(X, y, theta – alpha * X.T @ (X @ theta – y)),
A=0.0001, b=1
Print(“Optimal learning rate (Bisection):”, optimal_alpha_bisection)
2. Golden Section Method
The Golden Section method uses the golden ratio to divide the search
interval, reducing the search space more efficiently than the Bisection
method. It performs fewer function evaluations and can be faster for some
problems.
Def golden_section_line_search(f, a, b, tol=1e-5):
“””
Golden section method to find the optimal step size.
F: Cost function to minimize
A, b: Interval [a, b] within which the minimum lies
Tol: Tolerance for stopping the search
“””
Phi = (1 + np.sqrt(5)) / 2 # Golden ratio
Resphi = 2 – phi
X1 = a + resphi * (b – a)
X2 = b – resphi * (b – a)
While abs(b – a) > tol:
If f(x1) < f(x2):
B = x2
X2 = x1
X1 = a + resphi * (b – a)
Else:
A = x1
X1 = x2
X2 = b – resphi * (b – a)
Return (a + b) / 2
# Example usage: Minimize the cost function along the gradient direction
Optimal_alpha_golden = golden_section_line_search(
Lambda alpha: compute_cost(X, y, theta – alpha * X.T @ (X @ theta – y)),
A=0.0001, b=1
Print(“Optimal learning rate (Golden Section):”, optimal_alpha_golden)
Comparison of Time and Convergence
Both methods are efficient at finding the optimal learning rate, but the
Golden Section method tends to perform fewer iterations due to its more
efficient interval reduction. Here’s what you may observe:
1. Time Comparison:
Bisection: Slightly slower due to the need for more iterations.
Golden Section: Faster because it performs fewer function evaluations.
2. Convergence Behavior:
Both methods converge to similar optimal learning rates.
For very small tolerances, both methods may take more time, but Golden
Section converges more quickly due to its strategic interval division.
Use the print statements to observe the time taken for each method when
applied to the dataset. You can further profile the time using Python’s time
module:
Import time
Start = time.time()
Bisection_result = bisection_line_search(lambda alpha: compute_cost(X, y,

theta – alpha * X.T @ (X @ theta – y)), 0.0001, 1)
Print(f”Bisection Time: {time.time() – start:.5f} seconds”)
Start = time.time()
Golden_result = golden_section_line_search(lambda alpha: compute_cost(X,

y, theta – alpha * X.T @ (X @ theta – y)), 0.0001, 1)
Print(f”Golden Section Time: {time.time() – start:.5f} seconds”)
Conclusion
Golden Section method tends to be faster and more efficient, particularly for
smaller tolerances.
Both methods provide a reliable way to find the optimal learning rate for
gradient descent.
Use the Golden Section method when computational speed is critical.
Q3
Variable Step Size for Faster Convergence
For the function , the goal is to derive the variable step size that accelerates
convergence in gradient descent.
Gradient descent update rule:
x_{n+1} = x_n - \gamma \cdot \nabla f(x_n)
is the gradient.
Deriving the Gradient

First, let’s calculate the gradient .
f(x) = x^2 \cos(x) + \sin(x) - x
\frac{d}{dx} \left( x^2 \cos(x) \right) = 2x \cos(x) - x^2 \sin(x)
\frac{d}{dx} \left( \sin(x) \right) = \cos(x), \quad \frac{d}{dx} \left( -x \right)

= -1 ] Thus:
\nabla f(x) = 2x \cos(x) - x^2 \sin(x) + \cos(x) - 1
---
Choosing the Variable Step Size
To ensure faster convergence, the variable step size can be derived from the
Lipschitz constant of the gradient function . The Lipschitz constant bounds
the gradient change as:
\| \nabla f(x_1) - \nabla f(x_2) \| \leq L \| x_1 - x_2 \|
A common choice for the step size is:
\gamma_n = \frac{1}{L_n}
---
Computing the Hessian

We need the second derivative of to find the curvature.
\nabla^2 f(x) = \frac{d}{dx} \left( 2x \cos(x) - x^2 \sin(x) + \cos(x) - 1 \

right)
\nabla^2 f(x) = 2 \cos(x) - 2x \sin(x) - 2x \sin(x) - x^2 \cos(x) - \sin(x) ]
Evaluating this at each iteration gives the local curvature and hence the
adaptive step size:
\gamma_n = \frac{1}{\| 2 \cos(x_n) - 4x_n \sin(x_n) - x_n^2 \cos(x_n) - \

sin(x_n) \|}
---
Constant Step Size and Decaying Step Size
1. Constant Step Size:
\gamma = \alpha_0
2. Decaying Step Size:
\gamma_n = \frac{\alpha_0}{1 + k \cdot n}
---
Code for Comparison
Below is Python code to implement and compare the three methods.

import numpy as np
import matplotlib.pyplot as plt
# Function and its gradient
def f(x):
return x**2 * np.cos(x) + np.sin(x) - x
def gradient_f(x):
return 2 * x * np.cos(x) - x**2 * np.sin(x) + np.cos(x) - 1
def hessian_f(x):
return 2 * np.cos(x) - 4 * x * np.sin(x) - x**2 * np.cos(x) - np.sin(x)
# Gradient descent with variable, constant, and decaying step sizes
def gradient_descent(f, grad_f, step_size_func, x_init, epochs):
x = x_init
history = [x]
for _ in range(epochs):
gamma = step_size_func(x)
x = x - gamma * grad_f(x)
history.append(x)
return np.array(history)
# Step size strategies
def constant_step_size(alpha):
return lambda x: alpha

def decaying_step_size(alpha, k):
return lambda x, n=[0]: alpha / (1 + k * n[0]); n[0] += 1
def variable_step_size():
return lambda x: 1 / np.abs(hessian_f(x) + 1e-8) # Add epsilon to avoid

division by zero
# Initialize and run
x_init = 1.0 # Initial point
epochs = 50
# Run gradient descent with different step size strategies
history_constant = gradient_descent(f, gradient_f, constant_step_size(0.1),

x_init, epochs)
history_decaying = gradient_descent(f, gradient_f, decaying_step_size(0.1,

0.01), x_init, epochs)
history_variable = gradient_descent(f, gradient_f, variable_step_size(), x_init,

epochs)
# Plot the results
plt.plot(history_constant, label='Constant Step Size')
plt.plot(history_decaying, label='Decaying Step Size')
plt.plot(history_variable, label='Variable Step Size')
plt.xlabel('Iterations')
plt.ylabel('x')
plt.legend()
plt.title('Comparison of Step Size Strategies')

plt.show()
---
Observations
1. Constant Step Size:
Converges only if the step size is carefully chosen. Too large, and it diverges;
too small, and it slows down.
2. Decaying Step Size:
Works well for non-convex functions, but convergence can be slow as the
step size decreases too quickly.
3. Variable Step Size:
Adapts to the function's curvature, providing faster convergence and better

stability. This method performs well for most functions, especially when
gradients vary significantly across the domain.
---
Conclusion
Variable step size offers faster convergence due to its adaptability, especially
when the function has varying curvature.
Constant step size can be effective but requires careful tuning.

Decaying step size ensures convergence but can be slow, especially in the
later stages of optimization.

MFD S Assignment 2

Uploaded by

Copyright:

Available Formats

MFD S Assignment 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MFD S Assignment 2

Uploaded by

Copyright:

Available Formats

MFDS ASSIGNMENT

Gradient Descent: Theory and Implementation

Gradient descent is a fundamental optimization algorithm used to minimize a

1. Vanilla Gradient Descent (Batch Gradient Descent)

2. Mini-batch Gradient Descent

3. Stochastic Gradient Descent (SGD)

Below is an explanation of these methods along with the code required to

1. Vanilla Gradient Descent

Advantages: Provides a more stable and accurate gradient.

Disadvantages: Computationally expensive for large datasets.

\theta = \theta - \alpha \cdot \nabla J(\theta)

= gradient of the cost function with respect to parameters

2. Mini-batch Gradient Descent

Advantages: Strikes a balance between speed and stability.

Optimal batch size: Experimentally chosen, often between 32 and 256

\theta = \theta - \alpha \cdot \nabla J(\theta; \text{Batch})

3. Stochastic Gradient Descent (SGD)

Advantages: Faster convergence and suitable for online learning.

Disadvantages: High variance in gradient updates, may fluctuate around the

\theta = \theta - \alpha \cdot \nabla J(\theta; x_i, y_i)

Below is a Python implementation of the three variants using NumPy and a

# Generate a synthetic dataset with 1500 samples and 15 features

y = X @ np.random.randn(15) + np.random.randn(1500) # Linear

theta = np.zeros(15) # Model weights

alpha = 0.01 # Learning rate

# Helper function to compute cost

def compute_cost(X, y, theta):

cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)

# 1. Vanilla Gradient Descent

def vanilla_gradient_descent(X, y, theta, alpha, epochs):

for epoch in range(epochs):

gradient = (1 / m) * X.T @ (X @ theta - y)

theta -= alpha * gradient

# 2. Mini-batch Gradient Descent

def mini_batch_gradient_descent(X, y, theta, alpha, epochs, batch_size):

for epoch in range(epochs):

for i in range(0, m, batch_size):

X_batch = X_shuffled[i:i + batch_size]

y_batch = y_shuffled[i:i + batch_size]

theta -= alpha * gradient

# 3. Stochastic Gradient Descent

def stochastic_gradient_descent(X, y, theta, alpha, epochs):

for epoch in range(epochs):

gradient = xi.T @ (xi @ theta - yi)

theta -= alpha * gradient

# Experimenting with the methods

theta_vanilla = vanilla_gradient_descent(X, y, theta.copy(), alpha, epochs)

theta_mini_batch = mini_batch_gradient_descent(X, y, theta.copy(), alpha,

theta_sgd = stochastic_gradient_descent(X, y, theta.copy(), alpha, epochs)

# Compute and print the final costs for comparison

print("Vanilla Gradient Descent Cost:", compute_cost(X, y, theta_vanilla))

print("Mini-batch Gradient Descent Cost:", compute_cost(X, y,

print("Stochastic Gradient Descent Cost:", compute_cost(X, y, theta_sgd))

Experiment on Convergence using Mini-batch Gradient Descent

import matplotlib.pyplot as plt

# Function to track convergence for different batch sizes

def track_convergence(X, y, alpha, epochs, batch_sizes):

for batch_size in batch_sizes:

for epoch in range(epochs):