Gradient Descent Optimization
Gradient Descent Optimization
Gradient Descent Optimization
OPTIMIZATION
Gradient Descent?
• An optimization algorithm for finding a local minimum of a
differentiable function.
• Advantages
• Computationally efficient
• Produces a stable convergence
• Disadvantages
• Requires the entire training dataset be in memory and available to
the algorithm
Stochastic Gradient Descent (SGD)
• SGD does the update for each training example within the
dataset, meaning it updates the parameters for each
training example one by one.
• Advantages:
• frequent updates allow a detailed rate of improvement.
• Disadvantages:
• more computationally expensive than the batch gradient descent
approach
• Frequent updates may result in noisy gradients, which may cause
the error rate to jump around instead of slowly decreasing.
Mini-Batch Gradient Descent (SGD)
• combination of the concepts of SGD and batch gradient
descent.
Pathological Curvature
Gradient descent with Momentum
Gradient Descent with Momentum
• Momentum is a method that helps accelerate SGD in the
relevant direction and dampens oscillations
• VdW = β * VdW + (1 — β) * dW
• Vdb = β * Vdb + (1 — β) *db
• Where
•
Adagrad's per-parameter update
• Cons:
• Its accumulation of the squared gradients in the denominator
(keeps growing during training)
• https://www.youtube.com/watch?v=rIVLE3condE
• https://ruder.io/optimizing-gradient-
descent/index.html#momentum
• https://distill.pub/2017/momentum/
• https://medium.com/optimization-algorithms-for-deep-neural-
networks/gradient-descent-with-momentum-dce805cd8de8
• https://blog.paperspace.com/intro-to-optimization-momentum-
rmsprop-adam/