MFD S Assignment 2
MFD S Assignment 2
MFD S Assignment 2
Q1
Introduction
---
Process: Uses the entire dataset to compute the gradient at each step.
= model parameters
= learning rate
---
Process: Uses random subsets (batches) of the data to compute the gradient.
Update rule:
---
Update rule:
---
Implementation
import numpy as np
np.random.seed(42)
X = np.random.randn(1500, 15)
# Initialize parameters
m = len(y)
predictions = X @ theta
return cost
m = len(y)
return theta
m = len(y)
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
return theta
m = len(y)
for i in range(m):
xi = X[i:i + 1]
yi = y[i:i + 1]
return theta
To determine the optimal batch size, we can experiment with different batch
sizes (e.g., 16, 32, 64, 128, 256) and observe the convergence behavior.
theta = np.zeros(15)
costs = []
m = len(y)
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
costs.append(compute_cost(X, y, theta))
plt.ylabel('Cost')
plt.legend()
plt.show()
---
Conclusion
Stochastic Gradient Descent: Faster but may oscillate around the minimum.
This code demonstrates how different gradient descent variants behave and
how batch size influences convergence in mini-batch gradient descent.
Q2.
Below are the implementations of Bisection and Golden Section methods for
line search to optimize the learning rate for the dataset used in Q1. These
methods aim to find an optimal step size (learning rate) that minimizes the
cost function along a search direction.
1. Bisection Method
The Bisection method finds the minimum by repeatedly halving the interval
between two points where the function changes behavior. It is useful when
the cost function is unimodal in the given interval.
“””
“””
Mid = (a + b) / 2
B = mid
Else:
A = mid
Return (a + b) / 2
# Example usage: Minimize the cost function along the gradient direction
Optimal_alpha_bisection = bisection_line_search(
A=0.0001, b=1
The Golden Section method uses the golden ratio to divide the search
interval, reducing the search space more efficiently than the Bisection
method. It performs fewer function evaluations and can be faster for some
problems.
“””
“””
Resphi = 2 – phi
X1 = a + resphi * (b – a)
X2 = b – resphi * (b – a)
While abs(b – a) > tol:
B = x2
X2 = x1
X1 = a + resphi * (b – a)
Else:
A = x1
X1 = x2
X2 = b – resphi * (b – a)
Return (a + b) / 2
# Example usage: Minimize the cost function along the gradient direction
Optimal_alpha_golden = golden_section_line_search(
A=0.0001, b=1
Both methods are efficient at finding the optimal learning rate, but the
Golden Section method tends to perform fewer iterations due to its more
efficient interval reduction. Here’s what you may observe:
1. Time Comparison:
2. Convergence Behavior:
For very small tolerances, both methods may take more time, but Golden
Section converges more quickly due to its strategic interval division.
Use the print statements to observe the time taken for each method when
applied to the dataset. You can further profile the time using Python’s time
module:
Import time
Start = time.time()
Start = time.time()
Conclusion
Golden Section method tends to be faster and more efficient, particularly for
smaller tolerances.
Both methods provide a reliable way to find the optimal learning rate for
gradient descent.
Q3
For the function , the goal is to derive the variable step size that accelerates
convergence in gradient descent.
is the gradient.
---
To ensure faster convergence, the variable step size can be derived from the
Lipschitz constant of the gradient function . The Lipschitz constant bounds
the gradient change as:
\gamma_n = \frac{1}{L_n}
---
Evaluating this at each iteration gives the local curvature and hence the
adaptive step size:
---
\gamma = \alpha_0
---
def f(x):
def gradient_f(x):
def hessian_f(x):
x = x_init
history = [x]
for _ in range(epochs):
gamma = step_size_func(x)
x = x - gamma * grad_f(x)
history.append(x)
return np.array(history)
def constant_step_size(alpha):
def variable_step_size():
epochs = 50
plt.xlabel('Iterations')
plt.ylabel('x')
plt.legend()
---
Observations
Converges only if the step size is carefully chosen. Too large, and it diverges;
too small, and it slows down.
Works well for non-convex functions, but convergence can be slow as the
step size decreases too quickly.
---
Conclusion
Variable step size offers faster convergence due to its adaptability, especially
when the function has varying curvature.