Skip to content

Commit c9b8146

Browse files
authored
Merge pull request animator#559 from manishh12/main
Added types of optimizers
2 parents 930a906 + 1496dac commit c9b8146

File tree

2 files changed

+358
-0
lines changed

2 files changed

+358
-0
lines changed
Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
2+
---
3+
# Optimizers in Machine Learning
4+
5+
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. Optimization algorithms help to minimize (or maximize) an objective function (also called a loss function) which is simply a mathematical function dependent on the model's internal learnable parameters which are used in computing the target values from the set of features.
6+
7+
## Types of Optimizers
8+
9+
10+
11+
### 1. Gradient Descent
12+
13+
**Explanation:**
14+
Gradient Descent is the simplest and most commonly used optimization algorithm. It works by iteratively updating the model parameters in the opposite direction of the gradient of the objective function with respect to the parameters. The idea is to find the minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point.
15+
16+
**Mathematical Formulation:**
17+
18+
The update rule for the parameter vector θ in gradient descent is represented by the equation:
19+
20+
- $$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla J(\theta)$$
21+
22+
Where:
23+
- θold is the old parameter vector.
24+
- θnew is the updated parameter vector.
25+
- alpha(α) is the learning rate.
26+
- ∇J(θ) is the gradient of the objective function with respect to the parameters.
27+
28+
29+
30+
**Intuition:**
31+
- At each iteration, we calculate the gradient of the cost function.
32+
- The parameters are updated in the opposite direction of the gradient.
33+
- The size of the step is controlled by the learning rate α.
34+
35+
**Advantages:**
36+
- Simple to implement.
37+
- Suitable for convex problems.
38+
39+
**Disadvantages:**
40+
- Can be slow for large datasets.
41+
- May get stuck in local minima for non-convex problems.
42+
- Requires careful tuning of the learning rate.
43+
44+
**Python Implementation:**
45+
```python
46+
import numpy as np
47+
48+
def gradient_descent(X, y, lr=0.01, epochs=1000):
49+
m, n = X.shape
50+
theta = np.zeros(n)
51+
for epoch in range(epochs):
52+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
53+
theta -= lr * gradient
54+
return theta
55+
```
56+
57+
### 2. Stochastic Gradient Descent (SGD)
58+
59+
**Explanation:**
60+
SGD is a variation of gradient descent where we use only one training example to calculate the gradient and update the parameters. This introduces noise into the parameter updates, which can help to escape local minima but may cause the loss to fluctuate.
61+
62+
**Mathematical Formulation:**
63+
64+
- $$θ = θ - α \cdot \frac{∂J (θ; xᵢ, yᵢ)}{∂θ}$$
65+
66+
67+
- xᵢ, yᵢ are a single training example and its target.
68+
69+
**Intuition:**
70+
- At each iteration, a random training example is selected.
71+
- The gradient is calculated and the parameters are updated for this single example.
72+
- This process is repeated for a specified number of epochs.
73+
74+
**Advantages:**
75+
- Faster updates compared to batch gradient descent.
76+
- Can handle large datasets.
77+
- Helps to escape local minima due to the noise in updates.
78+
79+
**Disadvantages:**
80+
- Loss function may fluctuate.
81+
- Requires more iterations to converge.
82+
83+
**Python Implementation:**
84+
```python
85+
def stochastic_gradient_descent(X, y, lr=0.01, epochs=1000):
86+
m, n = X.shape
87+
theta = np.zeros(n)
88+
for epoch in range(epochs):
89+
for i in range(m):
90+
rand_index = np.random.randint(0, m)
91+
xi = X[rand_index:rand_index+1]
92+
yi = y[rand_index:rand_index+1]
93+
gradient = np.dot(xi.T, (np.dot(xi, theta) - yi))
94+
theta -= lr * gradient
95+
return theta
96+
```
97+
98+
### 3. Mini-Batch Gradient Descent
99+
100+
**Explanation:**
101+
Mini-Batch Gradient Descent is a variation where instead of a single training example or the whole dataset, a mini-batch of examples is used to compute the gradient. This reduces the variance of the parameter updates, leading to more stable convergence.
102+
103+
**Mathematical Formulation:**
104+
105+
- $$θ = θ - α \cdot \frac{1}{k} \sum_{i=1}^{k} \frac{∂J (θ; xᵢ, yᵢ)}{∂θ}$$
106+
107+
108+
Where:
109+
- \( k \) is the batch size.
110+
111+
**Intuition:**
112+
- At each iteration, a mini-batch of training examples is selected.
113+
- The gradient is calculated for this mini-batch.
114+
- The parameters are updated based on the average gradient of the mini-batch.
115+
116+
**Advantages:**
117+
- More stable updates compared to SGD.
118+
- Faster convergence than batch gradient descent.
119+
- Efficient on large datasets.
120+
121+
**Disadvantages:**
122+
- Requires tuning of batch size.
123+
- Computationally more expensive than SGD per iteration.
124+
125+
**Python Implementation:**
126+
```python
127+
def mini_batch_gradient_descent(X, y, lr=0.01, epochs=1000, batch_size=32):
128+
m, n = X.shape
129+
theta = np.zeros(n)
130+
for epoch in range(epochs):
131+
indices = np.random.permutation(m)
132+
X_shuffled = X[indices]
133+
y_shuffled = y[indices]
134+
for i in range(0, m, batch_size):
135+
X_i = X_shuffled[i:i+batch_size]
136+
y_i = y_shuffled[i:i+batch_size]
137+
gradient = np.dot(X_i.T, (np.dot(X_i, theta) - y_i)) / batch_size
138+
theta -= lr * gradient
139+
return theta
140+
```
141+
142+
### 4. Momentum
143+
144+
**Explanation:**
145+
Momentum helps accelerate gradient vectors in the right directions, thus leading to faster converging. It accumulates a velocity vector in directions of persistent reduction in the objective function, which helps to smooth the path towards the minimum.
146+
147+
**Mathematical Formulation:**
148+
149+
- $$v_t = γ \cdot v_{t-1} + α \cdot ∇J(θ)$$
150+
- $$θ = θ - v_t$$
151+
152+
where:
153+
154+
- \( v_t \) is the velocity.
155+
- γ is the momentum term, typically set between 0.9 and 0.99.
156+
157+
**Intuition:**
158+
- At each iteration, the gradient is calculated.
159+
- The velocity is updated based on the current gradient and the previous velocity.
160+
- The parameters are updated based on the velocity.
161+
162+
**Advantages:**
163+
- Faster convergence.
164+
- Reduces oscillations in the parameter updates.
165+
166+
**Disadvantages:**
167+
- Requires tuning of the momentum term.
168+
169+
**Python Implementation:**
170+
```python
171+
def momentum_gradient_descent(X, y, lr=0.01, epochs=1000, gamma=0.9):
172+
m, n = X.shape
173+
theta = np.zeros(n)
174+
v = np.zeros(n)
175+
for epoch in range(epochs):
176+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
177+
v = gamma * v + lr * gradient
178+
theta -= v
179+
return theta
180+
```
181+
182+
### 5. Nesterov Accelerated Gradient (NAG)
183+
184+
**Explanation:**
185+
NAG is a variant of the gradient descent with momentum. It looks ahead by a step and calculates the gradient at that point, thus providing more accurate updates. This method helps to correct the overshooting problem seen in standard momentum.
186+
187+
**Mathematical Formulation:**
188+
189+
- $$v_t = γv_{t-1} + α \cdot ∇J(θ - γ \cdot v_{t-1})$$
190+
191+
- $$θ = θ - v_t$$
192+
193+
194+
195+
196+
**Intuition:**
197+
- At each iteration, the parameters are temporarily updated using the previous velocity.
198+
- The gradient is calculated at this lookahead position.
199+
- The velocity and parameters are then updated based on this gradient.
200+
201+
**Advantages:**
202+
- More accurate updates compared to standard momentum.
203+
- Faster convergence.
204+
205+
**Disadvantages:**
206+
- Requires tuning of the momentum term.
207+
208+
**Python Implementation:**
209+
```python
210+
def nesterov_accelerated_gradient(X, y, lr=0.01, epochs=1000, gamma=0.9):
211+
m, n = X.shape
212+
theta = np.zeros(n)
213+
v = np.zeros(n)
214+
for epoch in range(epochs):
215+
lookahead_theta = theta - gamma * v
216+
gradient = np.dot(X.T, (np.dot(X, lookahead_theta) - y)) / m
217+
v = gamma * v + lr * gradient
218+
theta -= v
219+
return theta
220+
```
221+
222+
### 6. AdaGrad
223+
224+
**Explanation:**
225+
AdaGrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. It scales the learning rate inversely proportional to the square root of the sum of all historical squared values of the gradient.
226+
227+
**Mathematical Formulation:**
228+
229+
- $$G_t = G_{t-1} + (∂J(θ)/∂θ)^2$$
230+
231+
- $$θ = θ - \frac{α}{\sqrt{G_t + ε}} \cdot ∇J(θ)$$
232+
233+
Where:
234+
- \(G_t\) is the sum of squares of the gradients up to time step \( t \).
235+
- ε is a small constant to avoid division by zero.
236+
237+
**Intuition:**
238+
- Accumulates the sum of the squares of the gradients for each parameter.
239+
- Uses this accumulated
240+
241+
sum to scale the learning rate.
242+
- Parameters with large gradients in the past have smaller learning rates.
243+
244+
**Advantages:**
245+
- Effective for sparse data.
246+
- Automatically adjusts learning rate.
247+
248+
**Disadvantages:**
249+
- Learning rate decreases continuously, which can lead to premature convergence.
250+
251+
**Python Implementation:**
252+
```python
253+
def adagrad(X, y, lr=0.01, epochs=1000, epsilon=1e-8):
254+
m, n = X.shape
255+
theta = np.zeros(n)
256+
G = np.zeros(n)
257+
for epoch in range(epochs):
258+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
259+
G += gradient**2
260+
adjusted_lr = lr / (np.sqrt(G) + epsilon)
261+
theta -= adjusted_lr * gradient
262+
return theta
263+
```
264+
265+
### 7. RMSprop
266+
267+
**Explanation:**
268+
RMSprop modifies AdaGrad to perform well in non-convex settings by using a moving average of squared gradients to scale the learning rate. It helps to keep the learning rate in check, especially in the presence of noisy gradients.
269+
270+
**Mathematical Formulation:**
271+
272+
- E[g²]ₜ = βE[g²]ₜ₋₁ + (1 - β)(∂J(θ) / ∂θ)²
273+
274+
- $$θ = θ - \frac{α}{\sqrt{E[g^2]_t + ε}} \cdot ∇J(θ)$$
275+
276+
Where:
277+
- \( E[g^2]_t \) is the exponentially decaying average of past squared gradients.
278+
- β is the decay rate.
279+
280+
**Intuition:**
281+
- Keeps a running average of the squared gradients.
282+
- Uses this average to scale the learning rate.
283+
- Parameters with large gradients have their learning rates reduced.
284+
285+
**Advantages:**
286+
- Effective for non-convex problems.
287+
- Reduces oscillations in parameter updates.
288+
289+
**Disadvantages:**
290+
- Requires tuning of the decay rate.
291+
292+
**Python Implementation:**
293+
```python
294+
def rmsprop(X, y, lr=0.01, epochs=1000, beta=0.9, epsilon=1e-8):
295+
m, n = X.shape
296+
theta = np.zeros(n)
297+
E_g = np.zeros(n)
298+
for epoch in range(epochs):
299+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
300+
E_g = beta * E_g + (1 - beta) * gradient**2
301+
adjusted_lr = lr / (np.sqrt(E_g) + epsilon)
302+
theta -= adjusted_lr * gradient
303+
return theta
304+
```
305+
306+
### 8. Adam
307+
308+
**Explanation:**
309+
Adam (Adaptive Moment Estimation) combines the advantages of both RMSprop and AdaGrad by keeping an exponentially decaying average of past gradients and past squared gradients.
310+
311+
**Mathematical Formulation:**
312+
313+
- $$m_t = β_1m_{t-1} + (1 - β_1)(∂J(θ)/∂θ)$$
314+
- $$v_t = β_2v_{t-1} + (1 - β_2)(∂J(θ)/∂θ)^2$$
315+
- $$\hat{m}_t = \frac{m_t}{1 - β_1^t}$$
316+
- $$\hat{v}_t = \frac{v_t}{1 - β_2^t}$$
317+
- $$θ = θ - \frac{α\hat{m}_t}{\sqrt{\hat{v}_t} + ε}$$
318+
319+
Where:
320+
- \( m<sub>t \) is the first moment (mean) of the gradient.
321+
- \( v<sub>t \) is the second moment (uncentered variance) of the gradient.
322+
- β_1.β_2 are the decay rates for the moment estimates.
323+
324+
**Intuition:**
325+
- Keeps track of both the mean and the variance of the gradients.
326+
- Uses these to adaptively scale the learning rate.
327+
- Provides a balance between AdaGrad and RMSprop.
328+
329+
**Advantages:**
330+
- Efficient for large datasets.
331+
- Well-suited for non-convex optimization.
332+
- Handles sparse gradients well.
333+
334+
**Disadvantages:**
335+
- Requires careful tuning of hyperparameters.
336+
- Can be computationally intensive.
337+
338+
**Python Implementation:**
339+
```python
340+
def adam(X, y, lr=0.01, epochs=1000, beta1=0.9, beta2=0.999, epsilon=1e-8):
341+
m, n = X.shape
342+
theta = np.zeros(n)
343+
m_t = np.zeros(n)
344+
v_t = np.zeros(n)
345+
for epoch in range(1, epochs+1):
346+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
347+
m_t = beta1 * m_t + (1 - beta1) * gradient
348+
v_t = beta2 * v_t + (1 - beta2) * gradient**2
349+
m_t_hat = m_t / (1 - beta1**epoch)
350+
v_t_hat = v_t / (1 - beta2**epoch)
351+
theta -= lr * m_t_hat / (np.sqrt(v_t_hat) + epsilon)
352+
return theta
353+
```
354+
355+
These implementations are basic examples of how these optimizers can be implemented in Python using NumPy. In practice, libraries like TensorFlow and PyTorch provide highly optimized and more sophisticated implementations of these and other optimization algorithms.
356+
357+
---

contrib/machine-learning/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@
77
- [Support Vector Machine Algorithm](support-vector-machine.md)
88
- [Artificial Neural Network from the Ground Up](ArtificialNeuralNetwork.md)
99
- [TensorFlow.md](tensorFlow.md)
10+
- [Types of optimizers](Types_of_optimizers.md)

0 commit comments

Comments
 (0)