@@ -6,6 +6,8 @@ Optimizers are algorithms or methods used to change the attributes of your neura
6
6
7
7
## Types of Optimizers
8
8
9
+
10
+
9
11
### 1. Gradient Descent
10
12
11
13
** Explanation:**
@@ -15,19 +17,20 @@ Gradient Descent is the simplest and most commonly used optimization algorithm.
15
17
16
18
The update rule for the parameter vector θ in gradient descent is represented by the equation:
17
19
18
- - \( theta_new = theta_old - alpha * gradient/)
20
+ - $$ \theta_{\text{new}} = \theta_{\text{old}} - \ alpha \cdot \nabla J(\theta) $$
19
21
20
22
Where:
21
- - theta_old is the old parameter vector.
22
- - theta_new is the updated parameter vector.
23
- - alpha is the learning rate.
24
- - gradient is the gradient of the objective function with respect to the parameters.
23
+ - θold is the old parameter vector.
24
+ - θnew is the updated parameter vector.
25
+ - alpha(α) is the learning rate.
26
+ - ∇J(θ) is the gradient of the objective function with respect to the parameters.
27
+
25
28
26
29
27
30
** Intuition:**
28
31
- At each iteration, we calculate the gradient of the cost function.
29
32
- The parameters are updated in the opposite direction of the gradient.
30
- - The size of the step is controlled by the learning rate \( \alpha \) .
33
+ - The size of the step is controlled by the learning rate α .
31
34
32
35
** Advantages:**
33
36
- Simple to implement.
@@ -58,9 +61,10 @@ SGD is a variation of gradient descent where we use only one training example to
58
61
59
62
** Mathematical Formulation:**
60
63
61
- - \( theta = theta - alpha * dJ(theta; x_i, y_i) / d(theta)/)
64
+ - $$ θ = θ - α \cdot \frac{∂J (θ; xᵢ, yᵢ)}{∂θ} $$
65
+
62
66
63
- \( x_i, y_i \) are a single training example and its target.
67
+ - xᵢ, yᵢ are a single training example and its target.
64
68
65
69
** Intuition:**
66
70
- At each iteration, a random training example is selected.
@@ -98,7 +102,8 @@ Mini-Batch Gradient Descent is a variation where instead of a single training ex
98
102
99
103
** Mathematical Formulation:**
100
104
101
- - theta = theta - alpha * (1/k) * sum(dJ(theta; x_i, y_i) / d(theta))
105
+ - $$ θ = θ - α \cdot \frac{1}{k} \sum_{i=1}^{k} \frac{∂J (θ; xᵢ, yᵢ)}{∂θ} $$
106
+
102
107
103
108
Where:
104
109
- \( k \) is the batch size.
@@ -141,14 +146,13 @@ Momentum helps accelerate gradient vectors in the right directions, thus leading
141
146
142
147
** Mathematical Formulation:**
143
148
144
- - v_t = gamma * v_ {t-1} + alpha * dJ(theta) / d(theta)
145
-
146
- - theta = theta - v_t
149
+ - $$ v_t = γ \cdot v_{t-1} + α \cdot ∇J(θ) $$
150
+ - $$ θ = θ - v_t $$
147
151
148
152
where:
149
153
150
154
- \( v_t \) is the velocity.
151
- - \( \gamma \) is the momentum term, typically set between 0.9 and 0.99.
155
+ - γ is the momentum term, typically set between 0.9 and 0.99.
152
156
153
157
** Intuition:**
154
158
- At each iteration, the gradient is calculated.
@@ -182,9 +186,11 @@ NAG is a variant of the gradient descent with momentum. It looks ahead by a step
182
186
183
187
** Mathematical Formulation:**
184
188
185
- - v_t = gamma * v_ {t-1} + alpha * dJ(theta - gamma * v_ {t-1}) / d(theta)
189
+ - $$ v_t = γv_{t-1} + α \cdot ∇J(θ - γ \cdot v_{t-1}) $$
190
+
191
+ - $$ θ = θ - v_t $$
192
+
186
193
187
- - theta = theta - v_t
188
194
189
195
190
196
** Intuition:**
@@ -220,13 +226,13 @@ AdaGrad adapts the learning rate to the parameters, performing larger updates fo
220
226
221
227
** Mathematical Formulation:**
222
228
223
- - G_t = G_ {t-1} + (dJ(theta) / d(theta)) ⊙ (dJ(theta) / d(theta))
229
+ - $$ G_t = G_{t-1} + (∂J(θ)/∂θ)^2 $$
224
230
225
- - theta = theta - (alpha / sqrt( G_t + epsilon)) * (dJ(theta) / d(theta))
231
+ - $$ θ = θ - \frac{α}{\ sqrt{ G_t + ε}} \cdot ∇J(θ) $$
226
232
227
233
Where:
228
- - \( G_t \) is the sum of squares of the gradients up to time step \( t \) .
229
- - \( \epsilon \) is a small constant to avoid division by zero.
234
+ - \( G_t\) is the sum of squares of the gradients up to time step \( t \) .
235
+ - ε is a small constant to avoid division by zero.
230
236
231
237
** Intuition:**
232
238
- Accumulates the sum of the squares of the gradients for each parameter.
@@ -263,13 +269,13 @@ RMSprop modifies AdaGrad to perform well in non-convex settings by using a movin
263
269
264
270
** Mathematical Formulation:**
265
271
266
- E[ g^2 ] _ t = beta * E [ g^2 ] _ {t-1} + (1 - beta) * (dJ(theta ) / d(theta)) ⊙ (dJ(theta) / d(theta))
272
+ - E[g²]ₜ = βE[g²]ₜ₋₁ + (1 - β)(∂J(θ ) / ∂θ)²
267
273
268
- theta = theta - (alpha / sqrt( E[ g^2] _ t + epsilon)) * (dJ(theta) / d(theta))
274
+ - $$ θ = θ - \frac{α}{\ sqrt{ E[g^2]_t + ε}} \cdot ∇J(θ) $$
269
275
270
276
Where:
271
- - \( E[ g^2] _ t \) is the exponentially decaying average of past squared gradients.
272
- - \( \beta \) is the decay rate.
277
+ - \( E[ g^2] _ t \) is the exponentially decaying average of past squared gradients.
278
+ - β is the decay rate.
273
279
274
280
** Intuition:**
275
281
- Keeps a running average of the squared gradients.
@@ -304,20 +310,16 @@ Adam (Adaptive Moment Estimation) combines the advantages of both RMSprop and Ad
304
310
305
311
** Mathematical Formulation:**
306
312
307
- - m_t = beta1 * m_ {t-1} + (1 - beta1) * (dJ(theta) / d(theta))
308
-
309
- - v_t = beta2 * v_ {t-1} + (1 - beta2) * ((dJ(theta) / d(theta))^2)
310
-
311
- - hat_m_t = m_t / (1 - beta1^t)
312
-
313
- - hat_v_t = v_t / (1 - beta2^t)
314
-
315
- - theta = theta - (alpha * hat_m_t) / (sqrt(hat_v_t) + epsilon)
313
+ - $$ m_t = β_1m_{t-1} + (1 - β_1)(∂J(θ)/∂θ) $$
314
+ - $$ v_t = β_2v_{t-1} + (1 - β_2)(∂J(θ)/∂θ)^2 $$
315
+ - $$ \hat{m}_t = \frac{m_t}{1 - β_1^t} $$
316
+ - $$ \hat{v}_t = \frac{v_t}{1 - β_2^t} $$
317
+ - $$ θ = θ - \frac{α\hat{m}_t}{\sqrt{\hat{v}_t} + ε} $$
316
318
317
319
Where:
318
- - \( m_t \) is the first moment (mean) of the gradient.
319
- - \( v_t \) is the second moment (uncentered variance) of the gradient.
320
- - \( \beta_1, \beta_2 \) are the decay rates for the moment estimates.
320
+ - \( m< sub >t \) is the first moment (mean) of the gradient.
321
+ - \( v< sub >t \) is the second moment (uncentered variance) of the gradient.
322
+ - β_1.β_2 are the decay rates for the moment estimates.
321
323
322
324
** Intuition:**
323
325
- Keeps track of both the mean and the variance of the gradients.
@@ -352,4 +354,4 @@ def adam(X, y, lr=0.01, epochs=1000, beta1=0.9, beta2=0.999, epsilon=1e-8):
352
354
353
355
These implementations are basic examples of how these optimizers can be implemented in Python using NumPy. In practice, libraries like TensorFlow and PyTorch provide highly optimized and more sophisticated implementations of these and other optimization algorithms.
354
356
355
- ---
357
+ ---
0 commit comments