AI33
AI33
AI33
Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected results.
Further, gradient descent is also used to train Neural Networks.
The best way to define the local minimum or local maximum of a function using gradient descent
is as follows:
o If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
o Whenever we move towards a positive gradient or towards the gradient of the function at
the current point, we will get the local maximum of that function.
This entire procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objective of using a gradient descent algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps iteratively:
o Calculates the first-order derivative of the function to compute the gradient or slope of that
function.
o Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
What is Cost-function?
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number. It helps to increase and improve machine learning efficiency by providing feedback to
this model so that it can minimize error and find the local or global minimum. Further, it
continuously iterates along the direction of the negative gradient until the cost function approaches
zero. At this steepest descent point, the model will stop learning further. Although cost function
and loss function are considered synonymous, also there is a minor difference between them. The
slight difference between the loss function and the cost function is about the error within the
training of machine learning models, as loss function refers to the error of one training example,
while a cost function calculates the average error across an entire training set.
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce the cost
function.
Hypothesis:
Parameters:
Cost function:
Goal:
Before starting the working principle of gradient descent, we should know some basic concepts to
find out the slope of a line from linear regression. The equation for simple linear regression is
given as: Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The starting point(shown in above fig.) is used to evaluate the performance as it is considered just
as an arbitrary point. At this starting point, we will derive the first derivative or slope and then use
a tangent line to calculate the steepness of this slope. Further, this slope will inform the updates to
the parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new parameters
are generated, then steepness gradually reduces, and at the lowest point, it approaches the lowest
point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error between
expected and actual. To minimize the cost function, two data points are required:
These two factors are used to determine the partial derivative calculation of future iteration and
allow it to the point of convergence or local minimum or global minimum. Let's discuss learning
rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost function. If the learning rate
is high, it results in larger steps but also leads to risks of overshooting the minimum. At the same
time, a low learning rate shows the small step sizes, which compromises overall efficiency but
gives the advantage of more precision.
Based on the error in various training models, the Gradient Descent learning algorithm can
be divided into Batch gradient descent, stochastic gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model after evaluating all training examples. This procedure is known as the training
epoch. In simple words, it is a greedy approach where we have to sum over all examples for each
update.
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration. Or in other words, it processes a training epoch for each example within a
dataset and updates each training example's parameters one at a time. As it requires only one
training example at a time, hence it is easier to store in allocated memory. However, it shows some
computational efficiency losses in comparison to batch gradient systems as it shows frequent
updates that require more detail and speed. Further, due to frequent updates, it is also treated as a
noisy gradient. However, sometimes it can be helpful in finding the global minimum and also
escaping the local minimum.
In Stochastic gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Splitting training datasets into smaller batches make a balance
to maintain the computational efficiency of batch gradient descent and speed of stochastic gradient
descent. Hence, we can achieve a special type of gradient descent with higher computational
efficiency and less noisy gradient descent.