Skip to content

Commit 534475b

Browse files
committed
Added types of optimizers issue#527
1 parent ed4b85e commit 534475b

File tree

2 files changed

+358
-0
lines changed

2 files changed

+358
-0
lines changed
Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
Sure, here's a more detailed explanation for each optimizer, including the mathematical formulation, intuition, advantages, and disadvantages, along with the Python implementation.
2+
3+
---
4+
5+
# Optimizers in Machine Learning
6+
7+
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses. Optimization algorithms help to minimize (or maximize) an objective function (also called a loss function) which is simply a mathematical function dependent on the model's internal learnable parameters which are used in computing the target values from the set of features.
8+
9+
## Types of Optimizers
10+
11+
### 1. Gradient Descent
12+
13+
**Explanation:**
14+
Gradient Descent is the simplest and most commonly used optimization algorithm. It works by iteratively updating the model parameters in the opposite direction of the gradient of the objective function with respect to the parameters. The idea is to find the minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point.
15+
16+
**Mathematical Formulation:**
17+
18+
The update rule for the parameter vector θ in gradient descent is represented by the equation:
19+
20+
- \(theta_new = theta_old - alpha * gradient/)
21+
22+
Where:
23+
- theta_old is the old parameter vector.
24+
- theta_new is the updated parameter vector.
25+
- alpha is the learning rate.
26+
- gradient is the gradient of the objective function with respect to the parameters.
27+
28+
29+
**Intuition:**
30+
- At each iteration, we calculate the gradient of the cost function.
31+
- The parameters are updated in the opposite direction of the gradient.
32+
- The size of the step is controlled by the learning rate \( \alpha \).
33+
34+
**Advantages:**
35+
- Simple to implement.
36+
- Suitable for convex problems.
37+
38+
**Disadvantages:**
39+
- Can be slow for large datasets.
40+
- May get stuck in local minima for non-convex problems.
41+
- Requires careful tuning of the learning rate.
42+
43+
**Python Implementation:**
44+
```python
45+
import numpy as np
46+
47+
def gradient_descent(X, y, lr=0.01, epochs=1000):
48+
m, n = X.shape
49+
theta = np.zeros(n)
50+
for epoch in range(epochs):
51+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
52+
theta -= lr * gradient
53+
return theta
54+
```
55+
56+
### 2. Stochastic Gradient Descent (SGD)
57+
58+
**Explanation:**
59+
SGD is a variation of gradient descent where we use only one training example to calculate the gradient and update the parameters. This introduces noise into the parameter updates, which can help to escape local minima but may cause the loss to fluctuate.
60+
61+
**Mathematical Formulation:**
62+
63+
- \(theta = theta - alpha * dJ(theta; x_i, y_i) / d(theta)/)
64+
65+
\( x_i, y_i \) are a single training example and its target.
66+
67+
**Intuition:**
68+
- At each iteration, a random training example is selected.
69+
- The gradient is calculated and the parameters are updated for this single example.
70+
- This process is repeated for a specified number of epochs.
71+
72+
**Advantages:**
73+
- Faster updates compared to batch gradient descent.
74+
- Can handle large datasets.
75+
- Helps to escape local minima due to the noise in updates.
76+
77+
**Disadvantages:**
78+
- Loss function may fluctuate.
79+
- Requires more iterations to converge.
80+
81+
**Python Implementation:**
82+
```python
83+
def stochastic_gradient_descent(X, y, lr=0.01, epochs=1000):
84+
m, n = X.shape
85+
theta = np.zeros(n)
86+
for epoch in range(epochs):
87+
for i in range(m):
88+
rand_index = np.random.randint(0, m)
89+
xi = X[rand_index:rand_index+1]
90+
yi = y[rand_index:rand_index+1]
91+
gradient = np.dot(xi.T, (np.dot(xi, theta) - yi))
92+
theta -= lr * gradient
93+
return theta
94+
```
95+
96+
### 3. Mini-Batch Gradient Descent
97+
98+
**Explanation:**
99+
Mini-Batch Gradient Descent is a variation where instead of a single training example or the whole dataset, a mini-batch of examples is used to compute the gradient. This reduces the variance of the parameter updates, leading to more stable convergence.
100+
101+
**Mathematical Formulation:**
102+
103+
- theta = theta - alpha * (1/k) * sum(dJ(theta; x_i, y_i) / d(theta))
104+
105+
Where:
106+
- \( k \) is the batch size.
107+
108+
**Intuition:**
109+
- At each iteration, a mini-batch of training examples is selected.
110+
- The gradient is calculated for this mini-batch.
111+
- The parameters are updated based on the average gradient of the mini-batch.
112+
113+
**Advantages:**
114+
- More stable updates compared to SGD.
115+
- Faster convergence than batch gradient descent.
116+
- Efficient on large datasets.
117+
118+
**Disadvantages:**
119+
- Requires tuning of batch size.
120+
- Computationally more expensive than SGD per iteration.
121+
122+
**Python Implementation:**
123+
```python
124+
def mini_batch_gradient_descent(X, y, lr=0.01, epochs=1000, batch_size=32):
125+
m, n = X.shape
126+
theta = np.zeros(n)
127+
for epoch in range(epochs):
128+
indices = np.random.permutation(m)
129+
X_shuffled = X[indices]
130+
y_shuffled = y[indices]
131+
for i in range(0, m, batch_size):
132+
X_i = X_shuffled[i:i+batch_size]
133+
y_i = y_shuffled[i:i+batch_size]
134+
gradient = np.dot(X_i.T, (np.dot(X_i, theta) - y_i)) / batch_size
135+
theta -= lr * gradient
136+
return theta
137+
```
138+
139+
### 4. Momentum
140+
141+
**Explanation:**
142+
Momentum helps accelerate gradient vectors in the right directions, thus leading to faster converging. It accumulates a velocity vector in directions of persistent reduction in the objective function, which helps to smooth the path towards the minimum.
143+
144+
**Mathematical Formulation:**
145+
146+
- v_t = gamma * v_{t-1} + alpha * dJ(theta) / d(theta)
147+
148+
- theta = theta - v_t
149+
150+
where:
151+
152+
- \( v_t \) is the velocity.
153+
- \( \gamma \) is the momentum term, typically set between 0.9 and 0.99.
154+
155+
**Intuition:**
156+
- At each iteration, the gradient is calculated.
157+
- The velocity is updated based on the current gradient and the previous velocity.
158+
- The parameters are updated based on the velocity.
159+
160+
**Advantages:**
161+
- Faster convergence.
162+
- Reduces oscillations in the parameter updates.
163+
164+
**Disadvantages:**
165+
- Requires tuning of the momentum term.
166+
167+
**Python Implementation:**
168+
```python
169+
def momentum_gradient_descent(X, y, lr=0.01, epochs=1000, gamma=0.9):
170+
m, n = X.shape
171+
theta = np.zeros(n)
172+
v = np.zeros(n)
173+
for epoch in range(epochs):
174+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
175+
v = gamma * v + lr * gradient
176+
theta -= v
177+
return theta
178+
```
179+
180+
### 5. Nesterov Accelerated Gradient (NAG)
181+
182+
**Explanation:**
183+
NAG is a variant of the gradient descent with momentum. It looks ahead by a step and calculates the gradient at that point, thus providing more accurate updates. This method helps to correct the overshooting problem seen in standard momentum.
184+
185+
**Mathematical Formulation:**
186+
187+
- v_t = gamma * v_{t-1} + alpha * dJ(theta - gamma * v_{t-1}) / d(theta)
188+
189+
- theta = theta - v_t
190+
191+
192+
**Intuition:**
193+
- At each iteration, the parameters are temporarily updated using the previous velocity.
194+
- The gradient is calculated at this lookahead position.
195+
- The velocity and parameters are then updated based on this gradient.
196+
197+
**Advantages:**
198+
- More accurate updates compared to standard momentum.
199+
- Faster convergence.
200+
201+
**Disadvantages:**
202+
- Requires tuning of the momentum term.
203+
204+
**Python Implementation:**
205+
```python
206+
def nesterov_accelerated_gradient(X, y, lr=0.01, epochs=1000, gamma=0.9):
207+
m, n = X.shape
208+
theta = np.zeros(n)
209+
v = np.zeros(n)
210+
for epoch in range(epochs):
211+
lookahead_theta = theta - gamma * v
212+
gradient = np.dot(X.T, (np.dot(X, lookahead_theta) - y)) / m
213+
v = gamma * v + lr * gradient
214+
theta -= v
215+
return theta
216+
```
217+
218+
### 6. AdaGrad
219+
220+
**Explanation:**
221+
AdaGrad adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. It scales the learning rate inversely proportional to the square root of the sum of all historical squared values of the gradient.
222+
223+
**Mathematical Formulation:**
224+
225+
- G_t = G_{t-1} + (dJ(theta) / d(theta)) ⊙ (dJ(theta) / d(theta))
226+
227+
- theta = theta - (alpha / sqrt(G_t + epsilon)) * (dJ(theta) / d(theta))
228+
229+
Where:
230+
- \( G_t \) is the sum of squares of the gradients up to time step \( t \).
231+
- \( \epsilon \) is a small constant to avoid division by zero.
232+
233+
**Intuition:**
234+
- Accumulates the sum of the squares of the gradients for each parameter.
235+
- Uses this accumulated
236+
237+
sum to scale the learning rate.
238+
- Parameters with large gradients in the past have smaller learning rates.
239+
240+
**Advantages:**
241+
- Effective for sparse data.
242+
- Automatically adjusts learning rate.
243+
244+
**Disadvantages:**
245+
- Learning rate decreases continuously, which can lead to premature convergence.
246+
247+
**Python Implementation:**
248+
```python
249+
def adagrad(X, y, lr=0.01, epochs=1000, epsilon=1e-8):
250+
m, n = X.shape
251+
theta = np.zeros(n)
252+
G = np.zeros(n)
253+
for epoch in range(epochs):
254+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
255+
G += gradient**2
256+
adjusted_lr = lr / (np.sqrt(G) + epsilon)
257+
theta -= adjusted_lr * gradient
258+
return theta
259+
```
260+
261+
### 7. RMSprop
262+
263+
**Explanation:**
264+
RMSprop modifies AdaGrad to perform well in non-convex settings by using a moving average of squared gradients to scale the learning rate. It helps to keep the learning rate in check, especially in the presence of noisy gradients.
265+
266+
**Mathematical Formulation:**
267+
268+
E[g^2]_t = beta * E[g^2]_{t-1} + (1 - beta) * (dJ(theta) / d(theta)) ⊙ (dJ(theta) / d(theta))
269+
270+
theta = theta - (alpha / sqrt(E[g^2]_t + epsilon)) * (dJ(theta) / d(theta))
271+
272+
Where:
273+
- \( E[g^2]_t \) is the exponentially decaying average of past squared gradients.
274+
- \( \beta \) is the decay rate.
275+
276+
**Intuition:**
277+
- Keeps a running average of the squared gradients.
278+
- Uses this average to scale the learning rate.
279+
- Parameters with large gradients have their learning rates reduced.
280+
281+
**Advantages:**
282+
- Effective for non-convex problems.
283+
- Reduces oscillations in parameter updates.
284+
285+
**Disadvantages:**
286+
- Requires tuning of the decay rate.
287+
288+
**Python Implementation:**
289+
```python
290+
def rmsprop(X, y, lr=0.01, epochs=1000, beta=0.9, epsilon=1e-8):
291+
m, n = X.shape
292+
theta = np.zeros(n)
293+
E_g = np.zeros(n)
294+
for epoch in range(epochs):
295+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
296+
E_g = beta * E_g + (1 - beta) * gradient**2
297+
adjusted_lr = lr / (np.sqrt(E_g) + epsilon)
298+
theta -= adjusted_lr * gradient
299+
return theta
300+
```
301+
302+
### 8. Adam
303+
304+
**Explanation:**
305+
Adam (Adaptive Moment Estimation) combines the advantages of both RMSprop and AdaGrad by keeping an exponentially decaying average of past gradients and past squared gradients.
306+
307+
**Mathematical Formulation:**
308+
309+
- m_t = beta1 * m_{t-1} + (1 - beta1) * (dJ(theta) / d(theta))
310+
311+
- v_t = beta2 * v_{t-1} + (1 - beta2) * ((dJ(theta) / d(theta))^2)
312+
313+
- hat_m_t = m_t / (1 - beta1^t)
314+
315+
- hat_v_t = v_t / (1 - beta2^t)
316+
317+
- theta = theta - (alpha * hat_m_t) / (sqrt(hat_v_t) + epsilon)
318+
319+
Where:
320+
- \( m_t \) is the first moment (mean) of the gradient.
321+
- \( v_t \) is the second moment (uncentered variance) of the gradient.
322+
- \( \beta_1, \beta_2 \) are the decay rates for the moment estimates.
323+
324+
**Intuition:**
325+
- Keeps track of both the mean and the variance of the gradients.
326+
- Uses these to adaptively scale the learning rate.
327+
- Provides a balance between AdaGrad and RMSprop.
328+
329+
**Advantages:**
330+
- Efficient for large datasets.
331+
- Well-suited for non-convex optimization.
332+
- Handles sparse gradients well.
333+
334+
**Disadvantages:**
335+
- Requires careful tuning of hyperparameters.
336+
- Can be computationally intensive.
337+
338+
**Python Implementation:**
339+
```python
340+
def adam(X, y, lr=0.01, epochs=1000, beta1=0.9, beta2=0.999, epsilon=1e-8):
341+
m, n = X.shape
342+
theta = np.zeros(n)
343+
m_t = np.zeros(n)
344+
v_t = np.zeros(n)
345+
for epoch in range(1, epochs+1):
346+
gradient = np.dot(X.T, (np.dot(X, theta) - y)) / m
347+
m_t = beta1 * m_t + (1 - beta1) * gradient
348+
v_t = beta2 * v_t + (1 - beta2) * gradient**2
349+
m_t_hat = m_t / (1 - beta1**epoch)
350+
v_t_hat = v_t / (1 - beta2**epoch)
351+
theta -= lr * m_t_hat / (np.sqrt(v_t_hat) + epsilon)
352+
return theta
353+
```
354+
355+
These implementations are basic examples of how these optimizers can be implemented in Python using NumPy. In practice, libraries like TensorFlow and PyTorch provide highly optimized and more sophisticated implementations of these and other optimization algorithms.
356+
357+
---

contrib/machine-learning/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@
77
- [Support Vector Machine Algorithm](support-vector-machine.md)
88
- [Artificial Neural Network from the Ground Up](ArtificialNeuralNetwork.md)
99
- [TensorFlow.md](tensorFlow.md)
10+
- [Types of optimizers](Types_of_optimizers.md)

0 commit comments

Comments
 (0)