Skip to content

Commit 1496dac

Browse files
authored
Updated maths formulas
1 parent a51621d commit 1496dac

File tree

1 file changed

+38
-36
lines changed

1 file changed

+38
-36
lines changed

contrib/machine-learning/Types_of_optimizers.md

Lines changed: 38 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ Optimizers are algorithms or methods used to change the attributes of your neura
66

77
## Types of Optimizers
88

9+
10+
911
### 1. Gradient Descent
1012

1113
**Explanation:**
@@ -15,19 +17,20 @@ Gradient Descent is the simplest and most commonly used optimization algorithm.
1517

1618
The update rule for the parameter vector θ in gradient descent is represented by the equation:
1719

18-
- \(theta_new = theta_old - alpha * gradient/)
20+
- $$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \cdot \nabla J(\theta)$$
1921

2022
Where:
21-
- theta_old is the old parameter vector.
22-
- theta_new is the updated parameter vector.
23-
- alpha is the learning rate.
24-
- gradient is the gradient of the objective function with respect to the parameters.
23+
- θold is the old parameter vector.
24+
- θnew is the updated parameter vector.
25+
- alpha(α) is the learning rate.
26+
- ∇J(θ) is the gradient of the objective function with respect to the parameters.
27+
2528

2629

2730
**Intuition:**
2831
- At each iteration, we calculate the gradient of the cost function.
2932
- The parameters are updated in the opposite direction of the gradient.
30-
- The size of the step is controlled by the learning rate \( \alpha \).
33+
- The size of the step is controlled by the learning rate α.
3134

3235
**Advantages:**
3336
- Simple to implement.
@@ -58,9 +61,10 @@ SGD is a variation of gradient descent where we use only one training example to
5861

5962
**Mathematical Formulation:**
6063

61-
- \(theta = theta - alpha * dJ(theta; x_i, y_i) / d(theta)/)
64+
- $$θ = θ - α \cdot \frac{∂J (θ; xᵢ, yᵢ)}{∂θ}$$
65+
6266

63-
\( x_i, y_i \) are a single training example and its target.
67+
- xᵢ, yᵢ are a single training example and its target.
6468

6569
**Intuition:**
6670
- At each iteration, a random training example is selected.
@@ -98,7 +102,8 @@ Mini-Batch Gradient Descent is a variation where instead of a single training ex
98102

99103
**Mathematical Formulation:**
100104

101-
- theta = theta - alpha * (1/k) * sum(dJ(theta; x_i, y_i) / d(theta))
105+
- $$θ = θ - α \cdot \frac{1}{k} \sum_{i=1}^{k} \frac{∂J (θ; xᵢ, yᵢ)}{∂θ}$$
106+
102107

103108
Where:
104109
- \( k \) is the batch size.
@@ -141,14 +146,13 @@ Momentum helps accelerate gradient vectors in the right directions, thus leading
141146

142147
**Mathematical Formulation:**
143148

144-
- v_t = gamma * v_{t-1} + alpha * dJ(theta) / d(theta)
145-
146-
- theta = theta - v_t
149+
- $$v_t = γ \cdot v_{t-1} + α \cdot ∇J(θ)$$
150+
- $$θ = θ - v_t$$
147151

148152
where:
149153

150154
- \( v_t \) is the velocity.
151-
- \( \gamma \) is the momentum term, typically set between 0.9 and 0.99.
155+
- γ is the momentum term, typically set between 0.9 and 0.99.
152156

153157
**Intuition:**
154158
- At each iteration, the gradient is calculated.
@@ -182,9 +186,11 @@ NAG is a variant of the gradient descent with momentum. It looks ahead by a step
182186

183187
**Mathematical Formulation:**
184188

185-
- v_t = gamma * v_{t-1} + alpha * dJ(theta - gamma * v_{t-1}) / d(theta)
189+
- $$v_t = γv_{t-1} + α \cdot ∇J(θ - γ \cdot v_{t-1})$$
190+
191+
- $$θ = θ - v_t$$
192+
186193

187-
- theta = theta - v_t
188194

189195

190196
**Intuition:**
@@ -220,13 +226,13 @@ AdaGrad adapts the learning rate to the parameters, performing larger updates fo
220226

221227
**Mathematical Formulation:**
222228

223-
- G_t = G_{t-1} + (dJ(theta) / d(theta)) ⊙ (dJ(theta) / d(theta))
229+
- $$G_t = G_{t-1} + (∂J(θ)/∂θ)^2$$
224230

225-
- theta = theta - (alpha / sqrt(G_t + epsilon)) * (dJ(theta) / d(theta))
231+
- $$θ = θ - \frac{α}{\sqrt{G_t + ε}} \cdot ∇J(θ)$$
226232

227233
Where:
228-
- \( G_t \) is the sum of squares of the gradients up to time step \( t \).
229-
- \( \epsilon \) is a small constant to avoid division by zero.
234+
- \(G_t\) is the sum of squares of the gradients up to time step \( t \).
235+
- ε is a small constant to avoid division by zero.
230236

231237
**Intuition:**
232238
- Accumulates the sum of the squares of the gradients for each parameter.
@@ -263,13 +269,13 @@ RMSprop modifies AdaGrad to perform well in non-convex settings by using a movin
263269

264270
**Mathematical Formulation:**
265271

266-
E[g^2]_t = beta * E[g^2]_{t-1} + (1 - beta) * (dJ(theta) / d(theta)) ⊙ (dJ(theta) / d(theta))
272+
- E[g²]ₜ = βE[g²]ₜ₋₁ + (1 - β)(∂J(θ) / ∂θ)²
267273

268-
theta = theta - (alpha / sqrt(E[g^2]_t + epsilon)) * (dJ(theta) / d(theta))
274+
- $$θ = θ - \frac{α}{\sqrt{E[g^2]_t + ε}} \cdot ∇J(θ)$$
269275

270276
Where:
271-
- \( E[g^2]_t \) is the exponentially decaying average of past squared gradients.
272-
- \( \beta \) is the decay rate.
277+
- \( E[g^2]_t \) is the exponentially decaying average of past squared gradients.
278+
- β is the decay rate.
273279

274280
**Intuition:**
275281
- Keeps a running average of the squared gradients.
@@ -304,20 +310,16 @@ Adam (Adaptive Moment Estimation) combines the advantages of both RMSprop and Ad
304310

305311
**Mathematical Formulation:**
306312

307-
- m_t = beta1 * m_{t-1} + (1 - beta1) * (dJ(theta) / d(theta))
308-
309-
- v_t = beta2 * v_{t-1} + (1 - beta2) * ((dJ(theta) / d(theta))^2)
310-
311-
- hat_m_t = m_t / (1 - beta1^t)
312-
313-
- hat_v_t = v_t / (1 - beta2^t)
314-
315-
- theta = theta - (alpha * hat_m_t) / (sqrt(hat_v_t) + epsilon)
313+
- $$m_t = β_1m_{t-1} + (1 - β_1)(∂J(θ)/∂θ)$$
314+
- $$v_t = β_2v_{t-1} + (1 - β_2)(∂J(θ)/∂θ)^2$$
315+
- $$\hat{m}_t = \frac{m_t}{1 - β_1^t}$$
316+
- $$\hat{v}_t = \frac{v_t}{1 - β_2^t}$$
317+
- $$θ = θ - \frac{α\hat{m}_t}{\sqrt{\hat{v}_t} + ε}$$
316318

317319
Where:
318-
- \( m_t \) is the first moment (mean) of the gradient.
319-
- \( v_t \) is the second moment (uncentered variance) of the gradient.
320-
- \( \beta_1, \beta_2 \) are the decay rates for the moment estimates.
320+
- \( m<sub>t \) is the first moment (mean) of the gradient.
321+
- \( v<sub>t \) is the second moment (uncentered variance) of the gradient.
322+
- β_1.β_2 are the decay rates for the moment estimates.
321323

322324
**Intuition:**
323325
- Keeps track of both the mean and the variance of the gradients.
@@ -352,4 +354,4 @@ def adam(X, y, lr=0.01, epochs=1000, beta1=0.9, beta2=0.999, epsilon=1e-8):
352354

353355
These implementations are basic examples of how these optimizers can be implemented in Python using NumPy. In practice, libraries like TensorFlow and PyTorch provide highly optimized and more sophisticated implementations of these and other optimization algorithms.
354356

355-
---
357+
---

0 commit comments

Comments
 (0)