MFD S Assignment 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

MFDS ASSIGNMENT

Q1

Gradient Descent: Theory and Implementation

Introduction

Gradient descent is a fundamental optimization algorithm used to minimize a


cost function by iteratively adjusting the model's parameters. It aims to find
the minimum of a function, often applied in machine learning tasks to
optimize model weights. There are three popular variants of gradient
descent:

1. Vanilla Gradient Descent (Batch Gradient Descent)

2. Mini-batch Gradient Descent

3. Stochastic Gradient Descent (SGD)

Below is an explanation of these methods along with the code required to


implement them.

---

1. Vanilla Gradient Descent

Process: Uses the entire dataset to compute the gradient at each step.

Advantages: Provides a more stable and accurate gradient.

Disadvantages: Computationally expensive for large datasets.


Update rule:

\theta = \theta - \alpha \cdot \nabla J(\theta)

= model parameters

= learning rate

= gradient of the cost function with respect to parameters

---

2. Mini-batch Gradient Descent

Process: Uses random subsets (batches) of the data to compute the gradient.

Advantages: Strikes a balance between speed and stability.

Optimal batch size: Experimentally chosen, often between 32 and 256


depending on the problem.

Update rule:

\theta = \theta - \alpha \cdot \nabla J(\theta; \text{Batch})

---

3. Stochastic Gradient Descent (SGD)


Process: Updates the parameters for each data point individually.

Advantages: Faster convergence and suitable for online learning.

Disadvantages: High variance in gradient updates, may fluctuate around the


minimum.

Update rule:

\theta = \theta - \alpha \cdot \nabla J(\theta; x_i, y_i)

---

Implementation

Below is a Python implementation of the three variants using NumPy and a


synthetic dataset with 1500 rows and 15 features.

import numpy as np

# Generate a synthetic dataset with 1500 samples and 15 features

np.random.seed(42)

X = np.random.randn(1500, 15)

y = X @ np.random.randn(15) + np.random.randn(1500) # Linear


relationship with noise

# Initialize parameters

theta = np.zeros(15) # Model weights

alpha = 0.01 # Learning rate


epochs = 1000 # Number of iterations

# Helper function to compute cost

def compute_cost(X, y, theta):

m = len(y)

predictions = X @ theta

cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)

return cost

# 1. Vanilla Gradient Descent

def vanilla_gradient_descent(X, y, theta, alpha, epochs):

m = len(y)

for epoch in range(epochs):

gradient = (1 / m) * X.T @ (X @ theta - y)

theta -= alpha * gradient

return theta

# 2. Mini-batch Gradient Descent

def mini_batch_gradient_descent(X, y, theta, alpha, epochs, batch_size):

m = len(y)

for epoch in range(epochs):

indices = np.random.permutation(m)

X_shuffled = X[indices]

y_shuffled = y[indices]

for i in range(0, m, batch_size):

X_batch = X_shuffled[i:i + batch_size]

y_batch = y_shuffled[i:i + batch_size]


gradient = (1 / len(y_batch)) * X_batch.T @ (X_batch @ theta -
y_batch)

theta -= alpha * gradient

return theta

# 3. Stochastic Gradient Descent

def stochastic_gradient_descent(X, y, theta, alpha, epochs):

m = len(y)

for epoch in range(epochs):

for i in range(m):

xi = X[i:i + 1]

yi = y[i:i + 1]

gradient = xi.T @ (xi @ theta - yi)

theta -= alpha * gradient

return theta

# Experimenting with the methods

theta_vanilla = vanilla_gradient_descent(X, y, theta.copy(), alpha, epochs)

theta_mini_batch = mini_batch_gradient_descent(X, y, theta.copy(), alpha,


epochs, batch_size=64)

theta_sgd = stochastic_gradient_descent(X, y, theta.copy(), alpha, epochs)

# Compute and print the final costs for comparison

print("Vanilla Gradient Descent Cost:", compute_cost(X, y, theta_vanilla))

print("Mini-batch Gradient Descent Cost:", compute_cost(X, y,


theta_mini_batch))

print("Stochastic Gradient Descent Cost:", compute_cost(X, y, theta_sgd))


---

Experiment on Convergence using Mini-batch Gradient Descent

To determine the optimal batch size, we can experiment with different batch
sizes (e.g., 16, 32, 64, 128, 256) and observe the convergence behavior.

import matplotlib.pyplot as plt

# Function to track convergence for different batch sizes

def track_convergence(X, y, alpha, epochs, batch_sizes):

for batch_size in batch_sizes:

theta = np.zeros(15)

costs = []

m = len(y)

for epoch in range(epochs):

indices = np.random.permutation(m)

X_shuffled = X[indices]

y_shuffled = y[indices]

for i in range(0, m, batch_size):

X_batch = X_shuffled[i:i + batch_size]

y_batch = y_shuffled[i:i + batch_size]

gradient = (1 / len(y_batch)) * X_batch.T @ (X_batch @ theta -


y_batch)

theta -= alpha * gradient

costs.append(compute_cost(X, y, theta))

plt.plot(costs, label=f'Batch Size {batch_size}')


plt.xlabel('Epochs')

plt.ylabel('Cost')

plt.legend()

plt.title('Convergence for Different Batch Sizes')

plt.show()

# Experiment with different batch sizes

batch_sizes = [16, 32, 64, 128, 256]

track_convergence(X, y, alpha, epochs=200, batch_sizes=batch_sizes)

---

Conclusion

Vanilla Gradient Descent: Takes longer to converge due to processing the


entire dataset in every iteration.

Stochastic Gradient Descent: Faster but may oscillate around the minimum.

Mini-batch Gradient Descent: Strikes a balance between speed and stability.


The optimal batch size is problem-dependent but often found between 32
and 128.

This code demonstrates how different gradient descent variants behave and
how batch size influences convergence in mini-batch gradient descent.

Q2.
Below are the implementations of Bisection and Golden Section methods for
line search to optimize the learning rate for the dataset used in Q1. These
methods aim to find an optimal step size (learning rate) that minimizes the
cost function along a search direction.

1. Bisection Method

The Bisection method finds the minimum by repeatedly halving the interval
between two points where the function changes behavior. It is useful when
the cost function is unimodal in the given interval.

Def bisection_line_search(f, a, b, tol=1e-5):

“””

Bisection method to find the optimal step size.

F: Cost function to minimize

A, b: Interval [a, b] within which the minimum lies

Tol: Tolerance for stopping the search

“””

While abs(b – a) > tol:

Mid = (a + b) / 2

If f(mid – tol) < f(mid + tol):

B = mid

Else:

A = mid

Return (a + b) / 2
# Example usage: Minimize the cost function along the gradient direction

Optimal_alpha_bisection = bisection_line_search(

Lambda alpha: compute_cost(X, y, theta – alpha * X.T @ (X @ theta – y)),

A=0.0001, b=1

Print(“Optimal learning rate (Bisection):”, optimal_alpha_bisection)

2. Golden Section Method

The Golden Section method uses the golden ratio to divide the search
interval, reducing the search space more efficiently than the Bisection
method. It performs fewer function evaluations and can be faster for some
problems.

Def golden_section_line_search(f, a, b, tol=1e-5):

“””

Golden section method to find the optimal step size.

F: Cost function to minimize

A, b: Interval [a, b] within which the minimum lies

Tol: Tolerance for stopping the search

“””

Phi = (1 + np.sqrt(5)) / 2 # Golden ratio

Resphi = 2 – phi

X1 = a + resphi * (b – a)

X2 = b – resphi * (b – a)
While abs(b – a) > tol:

If f(x1) < f(x2):

B = x2

X2 = x1

X1 = a + resphi * (b – a)

Else:

A = x1

X1 = x2

X2 = b – resphi * (b – a)

Return (a + b) / 2

# Example usage: Minimize the cost function along the gradient direction

Optimal_alpha_golden = golden_section_line_search(

Lambda alpha: compute_cost(X, y, theta – alpha * X.T @ (X @ theta – y)),

A=0.0001, b=1

Print(“Optimal learning rate (Golden Section):”, optimal_alpha_golden)

Comparison of Time and Convergence

Both methods are efficient at finding the optimal learning rate, but the
Golden Section method tends to perform fewer iterations due to its more
efficient interval reduction. Here’s what you may observe:
1. Time Comparison:

Bisection: Slightly slower due to the need for more iterations.

Golden Section: Faster because it performs fewer function evaluations.

2. Convergence Behavior:

Both methods converge to similar optimal learning rates.

For very small tolerances, both methods may take more time, but Golden
Section converges more quickly due to its strategic interval division.

Use the print statements to observe the time taken for each method when
applied to the dataset. You can further profile the time using Python’s time
module:

Import time

Start = time.time()

Bisection_result = bisection_line_search(lambda alpha: compute_cost(X, y,


theta – alpha * X.T @ (X @ theta – y)), 0.0001, 1)

Print(f”Bisection Time: {time.time() – start:.5f} seconds”)

Start = time.time()

Golden_result = golden_section_line_search(lambda alpha: compute_cost(X,


y, theta – alpha * X.T @ (X @ theta – y)), 0.0001, 1)
Print(f”Golden Section Time: {time.time() – start:.5f} seconds”)

Conclusion

Golden Section method tends to be faster and more efficient, particularly for
smaller tolerances.

Both methods provide a reliable way to find the optimal learning rate for
gradient descent.

Use the Golden Section method when computational speed is critical.

Q3

Variable Step Size for Faster Convergence

For the function , the goal is to derive the variable step size that accelerates
convergence in gradient descent.

Gradient descent update rule:

x_{n+1} = x_n - \gamma \cdot \nabla f(x_n)

is the gradient.

Deriving the Gradient


First, let’s calculate the gradient .

f(x) = x^2 \cos(x) + \sin(x) - x

\frac{d}{dx} \left( x^2 \cos(x) \right) = 2x \cos(x) - x^2 \sin(x)

\frac{d}{dx} \left( \sin(x) \right) = \cos(x), \quad \frac{d}{dx} \left( -x \right)


= -1 ] Thus:

\nabla f(x) = 2x \cos(x) - x^2 \sin(x) + \cos(x) - 1

---

Choosing the Variable Step Size

To ensure faster convergence, the variable step size can be derived from the
Lipschitz constant of the gradient function . The Lipschitz constant bounds
the gradient change as:

\| \nabla f(x_1) - \nabla f(x_2) \| \leq L \| x_1 - x_2 \|

A common choice for the step size is:

\gamma_n = \frac{1}{L_n}

---

Computing the Hessian


We need the second derivative of to find the curvature.

\nabla^2 f(x) = \frac{d}{dx} \left( 2x \cos(x) - x^2 \sin(x) + \cos(x) - 1 \


right)

\nabla^2 f(x) = 2 \cos(x) - 2x \sin(x) - 2x \sin(x) - x^2 \cos(x) - \sin(x) ]

Evaluating this at each iteration gives the local curvature and hence the
adaptive step size:

\gamma_n = \frac{1}{\| 2 \cos(x_n) - 4x_n \sin(x_n) - x_n^2 \cos(x_n) - \


sin(x_n) \|}

---

Constant Step Size and Decaying Step Size

1. Constant Step Size:

\gamma = \alpha_0

2. Decaying Step Size:

\gamma_n = \frac{\alpha_0}{1 + k \cdot n}

---

Code for Comparison

Below is Python code to implement and compare the three methods.


import numpy as np

import matplotlib.pyplot as plt

# Function and its gradient

def f(x):

return x**2 * np.cos(x) + np.sin(x) - x

def gradient_f(x):

return 2 * x * np.cos(x) - x**2 * np.sin(x) + np.cos(x) - 1

def hessian_f(x):

return 2 * np.cos(x) - 4 * x * np.sin(x) - x**2 * np.cos(x) - np.sin(x)

# Gradient descent with variable, constant, and decaying step sizes

def gradient_descent(f, grad_f, step_size_func, x_init, epochs):

x = x_init

history = [x]

for _ in range(epochs):

gamma = step_size_func(x)

x = x - gamma * grad_f(x)

history.append(x)

return np.array(history)

# Step size strategies

def constant_step_size(alpha):

return lambda x: alpha


def decaying_step_size(alpha, k):

return lambda x, n=[0]: alpha / (1 + k * n[0]); n[0] += 1

def variable_step_size():

return lambda x: 1 / np.abs(hessian_f(x) + 1e-8) # Add epsilon to avoid


division by zero

# Initialize and run

x_init = 1.0 # Initial point

epochs = 50

# Run gradient descent with different step size strategies

history_constant = gradient_descent(f, gradient_f, constant_step_size(0.1),


x_init, epochs)

history_decaying = gradient_descent(f, gradient_f, decaying_step_size(0.1,


0.01), x_init, epochs)

history_variable = gradient_descent(f, gradient_f, variable_step_size(), x_init,


epochs)

# Plot the results

plt.plot(history_constant, label='Constant Step Size')

plt.plot(history_decaying, label='Decaying Step Size')

plt.plot(history_variable, label='Variable Step Size')

plt.xlabel('Iterations')

plt.ylabel('x')

plt.legend()

plt.title('Comparison of Step Size Strategies')


plt.show()

---

Observations

1. Constant Step Size:

Converges only if the step size is carefully chosen. Too large, and it diverges;
too small, and it slows down.

2. Decaying Step Size:

Works well for non-convex functions, but convergence can be slow as the
step size decreases too quickly.

3. Variable Step Size:

Adapts to the function's curvature, providing faster convergence and better


stability. This method performs well for most functions, especially when
gradients vary significantly across the domain.

---

Conclusion

Variable step size offers faster convergence due to its adaptability, especially
when the function has varying curvature.

Constant step size can be effective but requires careful tuning.


Decaying step size ensures convergence but can be slow, especially in the
later stages of optimization.

You might also like