DS 2
DS 2
DS 2
Human minds are more adap ve for the visual representa on of data rather than textual data. We
can easily understand things when they are visualized. It is be er to represent the data through the
graph where we can analyze the data more efficiently and make the specific decision according to
data analysis. Before learning the matplotlib, we need to understand data visualiza on and why data
visualiza on is important.
Data Visualiza on
Graphics provides an excellent approach for exploring the data, which is essen al for presen ng
results. Data visualiza on is a new term. It expresses the idea that involves more than just
represen ng data in the graphical form (instead of using textual form).
This can be very helpful when discovering and ge ng to know a dataset and can help with classifying
pa erns, corrupt data, outliers, and much more. With a li le domain knowledge, data visualiza ons
can be used to express and demonstrate key rela onships in plots and charts. The sta c does indeed
focus on quan ta ve descrip on and es ma ons of data. It provides an important set of tools for
gaining a qualita ve understanding.
There are five key plots that are used for data visualiza on.
for the organiza on: There are five phases which are essen al to make the decision
o Visualize: We analyze the raw data, which means it makes complex data more accessible,
understandable, and more usable. Tabular data representa on is used where the user will
look up a specific measurement, while the chart of several types is used to show pa erns or
rela onships in the data for one or more variables.
o Analysis: Data analysis is defined as cleaning, inspec ng, transforming, and modeling data to
derive useful informa on. Whenever we make a decision for the business or in daily life, is by
past experience. What will happen to choose a par cular decision, it is nothing but
analyzing our past. That may be affected in the future, so the proper analysis is necessary for
be er decisions for any business or organiza on.
o Document Insight: Document insight is the process where the useful data or informa on is
organized in the document in the standard format.
o Transform Data Set: Standard data is used to make the decision more effec vely.
Here are some benefits of the data visualiza on, which helps to make an effec ve decision for the
organiza ons or business:
Data visualiza on allows users to receive vast amounts of informa on regarding opera onal and
business condi ons. It helps decision-makers to see the rela onship between mul -dimensional data
sets. It offers new ways to analyses data through the use of maps, fever charts, and other rich
graphical representa ons.
Visual data discovery is more likely to find the informa on that the organiza on needs and then end
up with being more produc ve than other compe ve companies.
The crucial advantage of data visualiza on is that it is essen al to find the correla on between
opera ng condi ons and business performance in today's highly compe ve business environment.
The ability to make these types of correla ons enables the execu ves to iden fy the root cause of
the problem and act quickly to resolve it.
Suppose a food company is looking their monthly customer data, and the data is presented with bar
charts, which shows that the company's score has dropped by five points in the previous months in
that par cular region; the data suggest that there's a problem with customer sa sfac on in this area.
Data visualiza on allows the decision-maker to grasp shi s in customer behavior and market
condi ons across mul ple data sets more efficiently.
Having an idea about the customer's sen ments and other data discloses an emerging opportunity
for the company to act on new business opportuni es ahead of their compe tor.
Geo-spa al visualiza on is occurred due to many websites providing web-services, a rac ng visitor's
interest. These types of websites are required to take benefit of loca on-specific informa on, which
is already present in the customer details.
Matplotlib is a Python library which is defined as a mul -pla orm data visualiza on library built on
Numpy array. It can be used in python scripts, shell, web applica on, and other graphical user
interface toolkit.
The John D. Hunter originally conceived the matplotlib in 2002. It has an ac ve development
community and is distributed under a BSD-style license. Its first version was released in 2003, and
the latest version 3.1.1 is released on 1 July 2019.
Matplotlib 2.0.x supports Python versions 2.7 to 3.6 ll 23 June 2007. Python3 support started with
Matplotlib 1.2. Matplotlib 1.4 is the last version that supports Python 2.6.
There are various toolkits available that are used to enhance the func onality of the matplotlib.
Some of these tools are downloaded separately, others can be shi ed with the matplotlib source
code but have external dependencies.
o Bashmap: It is a map plo ng toolkit with several map projec ons, coastlines, and poli cal
boundaries.
o Cartopy: It is a mapping library consis ng of object-oriented map projec on defini ons, and
arbitrary point, line, polygon, and image transforma on abili es.
o Excel tools: Matplotlib provides the facility to u li es for exchanging data with Microso
Excel.
o Natgrid: It is an interface to the Natgrid library for irregular gridding of the spaced data.
Matplotlib Architecture
There are three different layers in the architecture of the matplotlib which are the following:
o Backend Layer
o Ar st layer
o Scrip ng layer
Backend layer
The backend layer is the bo om layer of the figure, which consists of the implementa on of the
various func ons that are necessary for plo ng. There are three essen al classes from the backend
layer FigureCanvas(The surface on which the figure will be drawn), Renderer(The class that takes
care of the drawing on the surface), and Event(It handle the mouse and keyboard events).
Ar st Layer
The ar st layer is the second layer in the architecture. It is responsible for the various plo ng
func ons, like axis, which coordinates on how to use the renderer on the figure canvas.
Scrip ng layer
The scrip ng layer is the topmost layer on which most of our code will run. The methods in the
scrip ng layer, almost automa cally take care of the other layers, and all we need to care about is
the current state (figure & subplot).
Axes: A Figure can contain several Axes. It consists of two or three (in the case of 3D) Axis objects.
Each Axes is comprised of a tle, an x-label, and a y-label.
Axis: Axises are the number of line like objects and responsible for genera ng the graph limits.
Ar st: An ar st is the all which we see on the graph like Text objects, Line2D objects, and collec on
objects. Most Ar sts are ed to Axes.
Installing Matplotlib
Before start working with the Matplotlib or its plo ng func ons first, it needs to be installed. The
installa on of matplotlib is dependent on the distribu on that is installed on your computer. These
installa on methods are following:
The easiest way to install Matplotlib is to download the Anaconda distribu on of Python. Matplotlib
is pre-installed in the anaconda distribu on No further installa on steps are necessary.
1. Line graph
The line graph is one of charts which shows informa on as a series of the line. The graph is plo ed
by the plot() func on. The line graph is simple to plot; let's consider the following example:
2.
3. x = [4,8,9]
4. y = [10,12,15]
5.
6. plt.plot(x,y)
7.
9. plt.ylabel('Y axis')
11. plt.show()
Output:
We can customize the graph by impor ng the style module. The style module will be built into a
matplotlib installa on. It contains the various func ons to make the plot more a rac ve. In the
below program, we are using the style module:
3.
4. style.use('ggplot')
5. x = [16, 8, 10]
6. y = [8, 16, 6]
8. y2 = [6, 15, 7]
12. fig = plt.figure()
15. plt.legend()
17. plt.show()
Output:
In Matplotlib, the figure (an instance of class plt.Figure) can be supposed of as a single container that
consists of all the objects deno ng axes, graphics, text, and labels.
2. Bar graphs
Bar graphs are one of the most common types of graphs and are used to show data associated with
the categorical variables. Matplotlib provides a bar() to make bar graphs which accepts arguments
such as: categorical variables, their value and color.
2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]
4. plt.bar(players,runs,color = 'green')
6. plt.xlabel('Players')
7. plt.ylabel('Runs')
8. plt.show()
Output:
Another func on barh() is used to make horizontal bar graphs. It accepts xerr or yerr as arguments
(in case of ver cal graphs) to depict the variance in our data as follows:
2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]
6. plt.xlabel('Players')
7. plt.ylabel('Runs')
8. plt.show()
Output:
5. Sca er plot
The sca er plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable. The data is displayed as a collec on of points. Each point has
the value of one variable, which defines the posi on on the horizontal axes, and the value of other
variable represents the posi on on the ver cal axis.
Example-1:
3. style.use('ggplot')
4.
5. x = [5,7,10]
6. y = [18,10,6]
7.
8. x2 = [6,9,11]
9. y2 = [7,14,17]
10.
12.
14.
18.
19. plt.show()
Output:
Linear algebra is a key tool in data science. It helps data scien sts manage and analyze large datasets.
By using vectors and matrices, linear algebra simplifies opera ons. This makes data easier to work
with and understand.
In this ar cle, we are going to learn about the importance of linear algebra in data science, including
its applica ons and challenges.
Linear algebra in data science refers to the use of mathema cal concepts involving vectors, matrices,
and linear transforma ons to manipulate and analyze data. It provides useful tools for most
algorithms and processes in data science, such as machine learning, sta s cs, and big data analy cs.
In the field of data science, linear algebra supports various tasks. These include algorithm design,
data processing, and machine learning. With linear algebra, complex problems become simpler. It
turns theore cal data models into prac cal solu ons that can be applied in real-world situa ons.
Understanding linear algebra is key to becoming a skilled data scien st. Linear algebra is important in
data science because of the following reasons:
Many data science algorithms rely on linear algebra to work fast and accurately.
It supports major machine learning techniques, like regression and classifica on.
Techniques like Principal Component Analysis for reducing data dimensionality depend on it.
It solves op miza on problems, helping find the best solu ons in complex data scenarios.
Linear algebra is a branch of mathema cs useful for understanding and working with arrays of
numbers known as matrices and vectors. Let us understand some of the key concepts in linear
algebra in the table below :
Concept Descrip on
Opera ons such as addi on, subtrac on, mul plica on, and inversion
Matrix Opera ons
that are crucial for various data transforma ons and algorithms.
These are used to understand data distribu ons and are crucial in
Eigenvalues and
methods such as Principal Component Analysis (PCA) which reduces
Eigenvectors
dimensionality.
Singular Value A method for decomposing a matrix into singular values and vectors,
Decomposi on (SVD) useful for noise reduc on and data compression in data science.
Linear Algebra turns complex problems into manageable solu ons. Here are some of the most
common applica ons of linear algebra in data science:
Image Processing
In image processing, linear algebra streamlines tasks like scaling and rota ng images. Matrices
represent images as arrays of pixel values. This representa on helps in transforming the images
efficiently.
NLP uses vectors to represent words. This technique is known as word embedding. Vectors help in
modeling word rela onships and meanings. For example, vector space models can determine
synonyms based on proximity.
Linear algebra is used to fit data into models. This process predicts future trends from past data.
Least squares, a method that minimizes the difference between observed and predicted values,
relies heavily on matrix opera ons.
Network Analysis
In network analysis, matrices store and manage data about connec ons. For instance, adjacency
matrices can represent social networks. They show connec ons between persons or items, aiding in
understanding network structures.
Op miza on Problems
Linear algebra solves op miza on problems in data science. It helps find values that minimize or
maximize some func on. Linear programming problems o en use matrix nota ons for constraints
and objec ves, streamlining the solu on process.
Some techniques in linear algebra can be applied to solve complex and high-dimensional data
problems effec vely in data science. Some of the advanced Techniques in Linear Algebra for Data
Science are :
Singular Value Decomposi on breaks down a matrix into three key components. These components
make it easier to analyze data. For example, SVD is used in recommender systems. It helps in
iden fying pa erns that connect user preferences with products.
Tensor decomposi ons extend matrix techniques to mul -dimensional data. They are vital in
handling data from mul ple sources or categories. For instance, in healthcare, tensor
decomposi ons analyze pa ent data across various condi ons and treatments to find hidden
pa erns.
The conjugate gradient method is used for solving large systems of linear equa ons that are common
in simula ons. It’s faster than tradi onal methods when dealing with sparse matrices. This is
important in physics simula ons where space and me variables interact.
There are some difficul es that one faces in learning linear algebra for data science. These challenges
show the complexi es involved in mastering linear algebra for effec ve use in data science.
Overcoming them requires structured learning and prac cal applica on.
Let us learn about some of the most common challenges in learning linear algebra for data science.
Abstract Concepts
Linear algebra involves many abstract concepts like vectors, matrices, and transforma ons. These can
be hard to visualize. For beginners, understanding how these concepts translate to solving real-world
data problems is o en challenging. A common struggle is seeing how theore cal matrix opera ons
apply to prac cal tasks like image recogni on.
The learning curve for linear algebra is steep, especially for those without a strong mathema cal
background. Learning to perform opera ons like matrix inversion and eigenvalue decomposi on can
be daun ng. For instance, mastering eigenvalues and eigenvectors is crucial for PCA, but
understanding their importance and computa ons takes some effort.
Applying linear algebra in data science requires bridging theory with prac cal applica on. Learners
o en find it difficult to connect the dots between abstract mathema cal theories and their prac cal
implementa on in so ware like Python’s NumPy or MATLAB. This gap makes it hard to apply learned
concepts directly to data science projects.
Linear algebra is used in a wide range of data science applica ons, from natural language processing
to computer vision. For learners, understanding where to apply specific linear algebra techniques
across different domains can be overwhelming. Each field may use the same mathema cal tools in
subtly different ways.
Eigenvalue Problems
Geometric Interpreta on
When we speak of a data set, we refer to either a sample or a popula on. If sta s cal
inference is our goal, we’ll wish ul mately to use sample numerical descrip ve measures to
make inferences about the corresponding measures for the popula on.
1. The central tendency of the set of measurements: the tendency of the data to cluster, or
center, about certain numerical values
1. Mean
The **mean** of a set of quan ta ve data is the sum of the measurements divided by the
number of measurements contained in the data set.
The sample mean, ¯x𝑥¯, will play an important role in accomplishing our objec ve of making
inferences about popula ons based on sample informa on.
For this reason, we need to use a different symbol for the mean of a popula on.
We’ll o en use the sample mean ¯x𝑥¯ to es mate (make an inference about) the popula on
mean, μ𝜇.
Example
For example, the percentages of revenues spent on R&D by the popula on consis ng of all
U.S. companies has a mean equal to some value, μ𝜇.
Our sample of 50 companies yielded percentages with a mean of ¯x𝑥¯ = 8.492. If, as is usually
the case, we don’t have access to the measurements for the en re popula on, we could
use ¯x𝑥¯ as an es mator or approximator for μ𝜇.
Then we’d need to know something about the reliability of our inference-that is, we’d need
to know how accurately we might expect ¯x𝑥¯ to es mate μ𝜇
In next tutorial, we’ll find that this accuracy depends on two factors:
1. The size of the sample. The larger the sample, the more accurate the es mate will tend to
be.
2. The variability, or spread, of the data. All other factors remaining constant, the more
variable the data, the less accurate the es mate.
Median
The median of a quan ta ve data set is the middle number when the measurements are
arranged in ascending (or descending) order.
In certain situa ons, the median may be a be er measure of central tendency than the
mean. In par cular, the median is less sensi ve than the mean to extremely large or small
measurements.
A data set is said to be skewed if one tail of the distribu on has more extreme observa ons
than the other tail.
Mode
The mode is the measurement that occurs most frequently in the data set.
Measures of central tendency provide only a par al descrip on of a quan ta ve data set.
The descrip on is incomplete without a measure of the variability, or spread, of the data set.
Knowledge of the data’s variability along with its center can help us visualize the shape of a
data set as well as its extreme values.
The sample variance for a sample of n measurements is equal to the sum of the squared
devia ons from the mean divided by (n - 1).
Formula
The sample standard devia on is defined as the posi ve square root of the sample variance
No ce that, unlike the variance, the standard devia on is expressed in the original units of
measurement. For example, if the original measurements are in dollars, the variance is
expressed in the peculiar units “dollars squared”“, but the standard devia on is expressed in
dollars.
You may wonder why we use the divisor (n−1)(𝑛−1) instead of n𝑛 when calcula ng the
sample variance. Wouldn’t using n𝑛 be more logical so that the sample variance would be
the average squared devia on from the mean?
The trouble is that using n𝑛 tends to produce an underes mate of the popula on variance so
we use (n−1)(𝑛−1) in the denominator to provide the appropriate correc on for this
tendency.
You now know that the standard devia on measures the variability of a set of data.
The larger the standard devia on, the more variable the data.
The smaller the standard devia on the less variable the data
We’ve seen that if we are comparing the variability of two samples selected from a
popula on, the sample with the larger standard devia on is the more variable of the two.
Thus, we know how to interpret the standard devia on on a rela ve or compara ve basis,
but we haven’t explained how it provides a measure of variability for a single sample.
To understand how the standard devia on provides a measure of variability of a data set,
consider a specific data set and answer the following ques ons:
The Empirical Rule is a rule of thumb that applies to data sets with frequency distribu ons
that are mound-shaped and symmetric, as shown below.
Approximately 68% of the measurements willfall within 1 standard devia on of the mean
Approximately 95% of the measurements will fall within 2 standard devia ons of the mean
Approximately 99.7% (essen ally all) of the measurements will fall within 3 standard
devia ons of the mean
Example
A manufacturer of automobile ba eries claims that the average length of life for its grade A
ba ery is 60 months. However, the guarantee on this brand is for just 36 months. Suppose
the standard devia on of the life length is known to be 10 months, and the frequency
distribu on of the life-length data is known to be mound-shaped.
1. Approximately what percentage of the manufacturer’s grade A ba eries will last more than
50 months, assuming the manufacturer’s claim is true?
2. Approximately what percentage of the manufacturer’s ba eries will last less than 40 months,
assuming the manufacturer’s claim is true?
3. Suppose your ba ery lasts 37 months. What could you infer about the manufacturer’s claim?
Answer
1. It is easy to see that the percentage of ba eries las ng more than 50 months is
approximately 34% (between 50 and 60 months) plus 50% (greater than 60 months). Thus,
approximately 84% of the ba eries should have life length exceeding 50 months.
2. The percentage of ba eries that last less than 40 months can also be easily determined.
Approximately 2.5% of the ba eries should fail prior to 40 months, assuming the
manufacturer’s claim is true.
3. If you are so unfortunate that your grade A ba ery fails at 37 months, you can make one of
two inferences: either your ba ery was one of the approximately 2.5% that fail prior to 40
months, or something about the manufacturer’s claim is not true. Because the chances are
so small that a ba ery fails before 40 months, you would have good reason to have serious
doubts about the manufacturer’s claim. A mean smaller than 60 months and/or a standard
devia on longer than 10 months would both increase the likelihood of failure prior to 40
months.
Another measure of rela ve standing in popular use is the z-score. As you can see in the
defini on of z-score below, the z-score makes use of the mean and standard devia on of the
data set in order to specify the rela ve loca on of a measurement. Note that the z-score is
calculated by subtrac ng ¯x𝑥¯ (or μ𝜇) from the measurement x𝑥 and then dividing the result
by s𝑠 (or σ𝜎). The final result, the z-score, represents the distance between a given
measurement x𝑥 and the mean, expressed in standard devia ons.
z=x−¯xs𝑧=𝑥−𝑥¯𝑠
z=x−μσ𝑧=𝑥−𝜇𝜎
Example
A random sample of 2,000 students who sat for the Graduate Management Admission Test
(GMAT) is selected.
For this sample, the mean GMAT score is x𝑥 = 540 points and the standard devia on is s𝑠 =
100 points.
One student from the sample, Kara Smith, had a GMAT score of x𝑥 = 440 points. What is
Kara’s sample z-score?
(440-540)/100
## [1] -1
This z-score implies that Kara Smith’s GMAT score is 1.0 standard devia ons below the
sample mean GMAT score, or, in short, her sample z-score is - 1.0.
3. Approximately 99.7% (almost all) of the measurements will have a z-score between -3 and 3.
Simpson's Paradox is a phenomenon in sta s cs where a trend observed within mul ple
groups of data reverses or disappears when these groups are combined. This paradox can lead to
misleading conclusions if the data is not carefully analyzed, taking into account the poten al
confounding factors. Understanding Simpson's Paradox is crucial in data science for proper data
analysis and interpreta on.
Group A:
Group B:
Here, if we look at the data separately for Group A and Group B, males have a higher success rate
in Group A, and females have a higher success rate in Group B. However, when combined,
females have a higher overall success rate, which seems to contradict the separate group
findings. This is Simpson's Paradox.
2. **Data Aggrega on**: Combining data without considering the underlying groupings can
mask the true rela onships.
3. **Sample Size Variability**: Differences in sample sizes among groups can lead to misleading
overall trends.
1. **Group Analysis**: Always analyze data within relevant subgroups before aggrega ng.
2. **Sta s cal Tests**: Use sta s cal tests to check for consistency across groups.
3. **Visualiza on**: Plo ng the data can help iden fy if trends differ across groups.
Here's how you can visualize Simpson's Paradox using Python and Matplotlib:
```python
import pandas as pd
# Example data
data = {
df = pd.DataFrame(data)
ax[0].set_ylabel('Success Rate')
ax[0].legend()
# Combined data
combined_data = df.groupby('Gender').sum()
ax[1].set_ylabel('Success Rate')
plt. ght_layout()
plt.show()
```
To avoid falling into the trap of Simpson's Paradox, consider the following best prac ces:
2. **Confounder Iden fica on**: Iden fy and account for poten al confounding variables.
3. **Contextual Understanding**: Understand the context and domain-specific factors that may
influence the data.
### Applica ons in Data Science
Recognizing and addressing Simpson's Paradox is essen al in data science to ensure accurate
analysis and avoid erroneous conclusions.
### Correla on
Correla on measures the sta s cal rela onship between two variables. It quan fies the degree
to which the variables move in rela on to each other. The most common measure of correla on
is the Pearson correla on coefficient, which ranges from -1 to 1.
- **Posi ve Correla on**: As one variable increases, the other variable also increases.
- **Nega ve Correla on**: As one variable increases, the other variable decreases.
#### Example
```python
import numpy as np
np.random.seed(0)
x = np.random.rand(100)
# Plo ng
plt.sca er(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
```
### Causa on
Causa on indicates that one event is the result of the occurrence of the other event; there is a
cause-and-effect rela onship. Establishing causa on requires more rigorous analysis beyond
observing correla ons.
#### Example
2. **Longitudinal Studies**: Observing subjects over me to see if changes in one variable cause
changes in another.
3. **Sta s cal Methods**: Using techniques like regression analysis, instrumental variables, and
Granger causality tests to infer causal rela onships.
- **Confounding Variables**: A third variable may influence both correlated variables, crea ng a
spurious rela onship.
- **Reverse Causa on**: It might be that variable Y causes variable X rather than the other way
around.
1. **Randomized Controlled Trials (RCTs)**: Randomly assign subjects to treatment and control
groups to isolate the effect of the treatment.
2. **Natural Experiments**: U lize naturally occurring events that mimic random assignment.
3. **Regression Analysis**: Use sta s cal models to control for confounding variables and
isolate the effect of interest.
4. **Instrumental Variables**: Use external variables that affect the independent variable but
not the dependent variable directly.
5. **Granger Causality**: Test whether one me series can predict another me series.
```python
import pandas as pd
import statsmodels.api as sm
# Example data
data = {
'Scores': [70, 75, 80, 85, 90, 95, 100, 105, 110]
df = pd.DataFrame(data)
X = df['Study Hours']
y = df['Scores']
X = sm.add_constant(X)
# Fi ng the regression model
print(model.summary())
# Plo ng
plt.xlabel('Study Hours')
plt.ylabel('Scores')
plt.show()
### Conclusion
- **Correla on** is useful for iden fying rela onships between variables but does not imply one
variable causes changes in another.
- **Causa on** indicates a cause-and-effect rela onship and requires more rigorous methods to
establish.
- Understanding the dis nc on and using appropriate methods to infer causa on are essen al in
data science to avoid misleading conclusions and make informed decisions.
Gradient Descent
Gradient Descent is known as one of the most commonly used op miza on algorithms to train
machine learning models by means of minimizing errors between actual and expected results.
Further, gradient descent is also used to train Neural Networks.
In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient
descent, the role of cost func ons specifically as a barometer within Machine Learning, types of
gradient descents, learning rates, etc.
What is Gradient Descent or Steepest Descent?
Gradient descent was ini ally discovered by "Augus n-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used itera ve op miza on
algorithms of machine learning to train the machine learning and deep learning models. It
helps in finding the local minimum of a func on.
The best way to define the local minimum or local maximum of a func on using gradient descent
is as follows:
o If we move towards a nega ve gradient or away from the gradient of the func on at the
current point, it will give the local minimum of that func on.
o Whenever we move towards a posi ve gradient or towards the gradient of the func on at
the current point, we will get the local maximum of that func on.
This en re procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objec ve of using a gradient descent algorithm is to minimize the cost func on using
itera on. To achieve this goal, it performs two steps itera vely:
o Calculates the first-order deriva ve of the func on to compute the gradient or slope of that
func on.
o Move away from the direc on of the gradient, which means slope increased from the
current point by alpha mes, where Alpha is defined as Learning Rate. It is a tuning
parameter in the op miza on process which helps to decide the length of the steps.
The cost func on is defined as the measurement of difference or error between actual values
and expected values at the current posi on and present in the form of a single real number. It
helps to increase and improve machine learning efficiency by providing feedback to this model so
that it can minimize error and find the local or global minimum. Further, it con nuously iterates
along the direc on of the nega ve gradient un l the cost func on approaches zero. At this
steepest descent point, the model will stop learning further. Although cost func on and loss
func on are considered synonymous, also there is a minor difference between them. The slight
difference between the loss func on and the cost func on is about the error within the training
of machine learning models, as loss func on refers to the error of one training example, while a
cost func on calculates the average error across an en re training set.
The cost func on is calculated a er making a hypothesis with ini al parameters and modifying
these parameters using gradient descent algorithms over known data to reduce the cost
func on.
Hypothesis:
Parameters:
Goal:
Before star ng the working principle of gradient descent, we should know some basic concepts
to find out the slope of a line from linear regression. The equa on for simple linear regression is
given as:
1. Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.
The star ng point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this star ng point, we will derive the first deriva ve or slope and
then use a tangent line to calculate the steepness of this slope. Further, this slope will inform the
updates to the parameters (weights and bias).
The slope becomes steeper at the star ng point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objec ve of gradient descent is to minimize the cost func on or the error between
expected and actual. To minimize the cost func on, two data points are required:
These two factors are used to determine the par al deriva ve calcula on of future itera on and
allow it to the point of convergence or local minimum or global minimum. Let's discuss learning
rate factors in brief;
Learning Rate:
It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost func on. If the learning
rate is high, it results in larger steps but also leads to risks of overshoo ng the minimum. At the
same me, a low learning rate shows the small step sizes, which compromises overall efficiency
but gives the advantage of more precision.
Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochas c gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model a er evalua ng all training examples. This procedure is known as the training
epoch. In simple words, it is a greedy approach where we have to sum over all examples for each
update.
Stochas c gradient descent (SGD) is a type of gradient descent that runs one training example
per itera on. Or in other words, it processes a training epoch for each example within a dataset
and updates each training example's parameters one at a me. As it requires only one training
example at a me, hence it is easier to store in allocated memory. However, it shows some
computa onal efficiency losses in comparison to batch gradient systems as it shows frequent
updates that require more detail and speed. Further, due to frequent updates, it is also treated
as a noisy gradient. However, some mes it can be helpful in finding the global minimum and also
escaping the local minimum.
In Stochas c gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.
Mini Batch gradient descent is the combina on of both batch gradient descent and stochas c
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Spli ng training datasets into smaller batches make a
balance to maintain the computa onal efficiency of batch gradient descent and speed of
stochas c gradient descent. Hence, we can achieve a special type of gradient descent with
higher computa onal efficiency and less noisy gradient descent.
Although we know Gradient Descent is one of the most popular methods for op miza on
problems, it s ll also has some challenges. There are a few challenges as follows:
For convex problems, gradient descent can find the global minimum easily, while for non-convex
problems, it is some mes difficult to find the global minimum, where the machine learning
models achieve the best results.
Whenever the slope of the cost func on is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop,
which is saddle point and local minimum. Local minima generate the shape similar to the global
minimum, where the slope of the cost func on increases on both sides of the current points.
In contrast, with saddle points, the nega ve gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a
saddle point is taken by that of a horse's saddle.
The name of local minima is because the value of the loss func on is minimum at that point in a
local region. In contrast, the name of the global minima is given so because the value of the loss
func on is minimum there, globally across the en re domain the loss func on.
In a deep neural network, if the model is trained with gradient descent and backpropaga on,
there can occur two more issues other than local minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected. During backpropaga on,
this gradient becomes smaller that causing the decrease in the learning rate of earlier layers than
the later layer of the network. Once this happens, the weight parameters update un l they
become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too
large and creates a stable model. Further, in this scenario, model weight increases, and they will
be represented as NaN. This problem can be solved using the dimensionality reduc on
technique, which helps to minimize complexity within the model.
Gradient Descent (extra)
What is Gradient?
A gradient is nothing but a deriva ve that defines the effects on outputs of the func on with a
li le bit of varia on in inputs.
Gradient Descent (GD) is a widely used op miza on algorithm in machine learning and deep
learning that minimises the cost func on of a neural network model during training. It works by
itera vely adjus ng the weights or parameters of the model in the direc on of the nega ve
gradient of the cost func on un l the minimum of the cost func on is reached.
The learning happens during the backpropaga on while training the neural network-based
model. There is a term known as Gradient Descent, which is used to op mize the weight and
biases based on the cost func on. The cost func on evaluates the difference between the actual
and predicted outputs.
It itera vely adjusts model parameters by moving in the direc on of the steepest decrease in
the cost func on.
The algorithm calculates gradients, represen ng the par al deriva ves of the cost func on
concerning each parameter.
These gradients guide the updates, ensuring convergence towards the op mal parameter values
that yield the lowest possible cost.
Gradient Descent is versa le and applicable to various machine learning models, including linear
regression and neural networks. Its efficiency lies in naviga ng the parameter space efficiently,
enabling models to learn pa erns and make accurate predic ons. Adjus ng the learning rate is
crucial to balance convergence speed and avoiding overshoo ng the op mal solu on.
Diving further into the concept, let’s understand in depth, with prac cal implementa on.
Python3
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
Python3
torch.manual_seed(42)
num_samples = 1000
x = torch.randn(num_samples, 2)
# create random weights and bias for the linear regression model
true_bias = torch.tensor([-3.5])
# Target variable
y = x @ true_weights.T + true_bias
ax[0].sca er(x[:,0],y)
ax[1].sca er(x[:,1],y)
ax[0].set_xlabel('X1')
ax[0].set_ylabel('Y')
ax[1].set_xlabel('X2')
ax[1].set_ylabel('Y')
plt.show()
Output:
X vs Y
Python3
class LinearRegression(nn.Module):
super(LinearRegression, self).__init__()
out = self.linear(x)
return out
input_size = x.shape[1]
output_size = 1
Note:
The number of weight values will be equal to the input size of the model, And the input size in
deep Learning is the number of independent input features i.e we are pu ng inside the model
In our case, input features are two so, the input size will also be two, and the corresponding
weight value will also be two.
Python3
bias = torch.rand(1)
weight_param = nn.Parameter(weight)
bias_param = nn.Parameter(bias)
model.linear.weight = weight_param
model.linear.bias = bias_param
print('Weight :',weight)
print('bias :',bias)
Output:
tensor([0.5710], requires_grad=True)
Predic on
Python3
y_p = model(x)
y_p[:5]
Output:
tensor([[ 0.7760],
[-0.8944],
[-0.3369],
[-0.3095],
[ 1.7338]], grad_fn=<SliceBackward0>)