DS 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Matplotlib (Python Plo ng Library)

Human minds are more adap ve for the visual representa on of data rather than textual data. We
can easily understand things when they are visualized. It is be er to represent the data through the
graph where we can analyze the data more efficiently and make the specific decision according to
data analysis. Before learning the matplotlib, we need to understand data visualiza on and why data
visualiza on is important.

Data Visualiza on

Graphics provides an excellent approach for exploring the data, which is essen al for presen ng
results. Data visualiza on is a new term. It expresses the idea that involves more than just
represen ng data in the graphical form (instead of using textual form).

This can be very helpful when discovering and ge ng to know a dataset and can help with classifying
pa erns, corrupt data, outliers, and much more. With a li le domain knowledge, data visualiza ons
can be used to express and demonstrate key rela onships in plots and charts. The sta c does indeed
focus on quan ta ve descrip on and es ma ons of data. It provides an important set of tools for
gaining a qualita ve understanding.

There are five key plots that are used for data visualiza on.
for the organiza on: There are five phases which are essen al to make the decision

o Visualize: We analyze the raw data, which means it makes complex data more accessible,
understandable, and more usable. Tabular data representa on is used where the user will
look up a specific measurement, while the chart of several types is used to show pa erns or
rela onships in the data for one or more variables.

o Analysis: Data analysis is defined as cleaning, inspec ng, transforming, and modeling data to
derive useful informa on. Whenever we make a decision for the business or in daily life, is by
past experience. What will happen to choose a par cular decision, it is nothing but
analyzing our past. That may be affected in the future, so the proper analysis is necessary for
be er decisions for any business or organiza on.

o Document Insight: Document insight is the process where the useful data or informa on is
organized in the document in the standard format.

o Transform Data Set: Standard data is used to make the decision more effec vely.

Why need data visualiza on?


Data visualiza on can perform below tasks:

o It iden fies areas that need improvement and a en on.

o It clarifies the factors.

o It helps to understand which product to place where.

o Predict sales volumes.

Benefit of Data Visualiza on

Here are some benefits of the data visualiza on, which helps to make an effec ve decision for the
organiza ons or business:

1. Building ways of absorbing informa on

Data visualiza on allows users to receive vast amounts of informa on regarding opera onal and
business condi ons. It helps decision-makers to see the rela onship between mul -dimensional data
sets. It offers new ways to analyses data through the use of maps, fever charts, and other rich
graphical representa ons.

Visual data discovery is more likely to find the informa on that the organiza on needs and then end
up with being more produc ve than other compe ve companies.

2. Visualize rela onship and pa erns in Businesses

The crucial advantage of data visualiza on is that it is essen al to find the correla on between
opera ng condi ons and business performance in today's highly compe ve business environment.
The ability to make these types of correla ons enables the execu ves to iden fy the root cause of
the problem and act quickly to resolve it.

Suppose a food company is looking their monthly customer data, and the data is presented with bar
charts, which shows that the company's score has dropped by five points in the previous months in
that par cular region; the data suggest that there's a problem with customer sa sfac on in this area.

3. Take ac on on the emerging trends faster

Data visualiza on allows the decision-maker to grasp shi s in customer behavior and market
condi ons across mul ple data sets more efficiently.

Having an idea about the customer's sen ments and other data discloses an emerging opportunity
for the company to act on new business opportuni es ahead of their compe tor.

4. Geological based Visualiza on

Geo-spa al visualiza on is occurred due to many websites providing web-services, a rac ng visitor's
interest. These types of websites are required to take benefit of loca on-specific informa on, which
is already present in the customer details.

Matplotlib is a Python library which is defined as a mul -pla orm data visualiza on library built on
Numpy array. It can be used in python scripts, shell, web applica on, and other graphical user
interface toolkit.

The John D. Hunter originally conceived the matplotlib in 2002. It has an ac ve development
community and is distributed under a BSD-style license. Its first version was released in 2003, and
the latest version 3.1.1 is released on 1 July 2019.

Matplotlib 2.0.x supports Python versions 2.7 to 3.6 ll 23 June 2007. Python3 support started with
Matplotlib 1.2. Matplotlib 1.4 is the last version that supports Python 2.6.

There are various toolkits available that are used to enhance the func onality of the matplotlib.
Some of these tools are downloaded separately, others can be shi ed with the matplotlib source
code but have external dependencies.

o Bashmap: It is a map plo ng toolkit with several map projec ons, coastlines, and poli cal
boundaries.
o Cartopy: It is a mapping library consis ng of object-oriented map projec on defini ons, and
arbitrary point, line, polygon, and image transforma on abili es.

o Excel tools: Matplotlib provides the facility to u li es for exchanging data with Microso
Excel.

o Mplot3d: It is used for 3D plots.

o Natgrid: It is an interface to the Natgrid library for irregular gridding of the spaced data.

Matplotlib Architecture

There are three different layers in the architecture of the matplotlib which are the following:

o Backend Layer

o Ar st layer

o Scrip ng layer

Backend layer

The backend layer is the bo om layer of the figure, which consists of the implementa on of the
various func ons that are necessary for plo ng. There are three essen al classes from the backend
layer FigureCanvas(The surface on which the figure will be drawn), Renderer(The class that takes
care of the drawing on the surface), and Event(It handle the mouse and keyboard events).

Ar st Layer

The ar st layer is the second layer in the architecture. It is responsible for the various plo ng
func ons, like axis, which coordinates on how to use the renderer on the figure canvas.

Scrip ng layer

The scrip ng layer is the topmost layer on which most of our code will run. The methods in the
scrip ng layer, almost automa cally take care of the other layers, and all we need to care about is
the current state (figure & subplot).

The General Concept of Matplotlib

A Matplotlib figure can be categorized into various parts as below:


Figure: It is a whole figure which may hold one or more axes (plots). We can think of a Figure as a
canvas that holds plots.

Axes: A Figure can contain several Axes. It consists of two or three (in the case of 3D) Axis objects.
Each Axes is comprised of a tle, an x-label, and a y-label.

Axis: Axises are the number of line like objects and responsible for genera ng the graph limits.

Ar st: An ar st is the all which we see on the graph like Text objects, Line2D objects, and collec on
objects. Most Ar sts are ed to Axes.

Installing Matplotlib

Before start working with the Matplotlib or its plo ng func ons first, it needs to be installed. The
installa on of matplotlib is dependent on the distribu on that is installed on your computer. These
installa on methods are following:

Use the Anaconda distribu on of Python

The easiest way to install Matplotlib is to download the Anaconda distribu on of Python. Matplotlib
is pre-installed in the anaconda distribu on No further installa on steps are necessary.

o Visit the official site of Anaconda and click on the Download Bu on


o Choose download according to your Python interpreter configura on.
Crea ng different types of graph

1. Line graph

The line graph is one of charts which shows informa on as a series of the line. The graph is plo ed
by the plot() func on. The line graph is simple to plot; let's consider the following example:

1. from matplotlib import pyplot as plt

2.

3. x = [4,8,9]

4. y = [10,12,15]

5.

6. plt.plot(x,y)

7.

8. plt. tle("Line graph")

9. plt.ylabel('Y axis')

10. plt.xlabel('X axis')

11. plt.show()

Output:

We can customize the graph by impor ng the style module. The style module will be built into a
matplotlib installa on. It contains the various func ons to make the plot more a rac ve. In the
below program, we are using the style module:

1. from matplotlib import pyplot as plt

2. from matplotlib import style

3.
4. style.use('ggplot')

5. x = [16, 8, 10]

6. y = [8, 16, 6]

7. x2 = [8, 15, 11]

8. y2 = [6, 15, 7]

9. plt.plot(x, y, 'r', label='line one', linewidth=5)

10. plt.plot(x2, y2, 'm', label='line two', linewidth=5)

11. plt. tle('Epic Info')

12. fig = plt.figure()

13. plt.ylabel('Y axis')

14. plt.xlabel('X axis')

15. plt.legend()

16. plt.grid(True, color='k')

17. plt.show()

Output:

In Matplotlib, the figure (an instance of class plt.Figure) can be supposed of as a single container that
consists of all the objects deno ng axes, graphics, text, and labels.

2. Bar graphs

Bar graphs are one of the most common types of graphs and are used to show data associated with
the categorical variables. Matplotlib provides a bar() to make bar graphs which accepts arguments
such as: categorical variables, their value and color.

1. from matplotlib import pyplot as plt

2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]

4. plt.bar(players,runs,color = 'green')

5. plt. tle('Score Card')

6. plt.xlabel('Players')

7. plt.ylabel('Runs')

8. plt.show()

Output:

Another func on barh() is used to make horizontal bar graphs. It accepts xerr or yerr as arguments
(in case of ver cal graphs) to depict the variance in our data as follows:

1. from matplotlib import pyplot as plt

2. players = ['Virat','Rohit','Shikhar','Hardik']

3. runs = [51,87,45,67]

4. plt.barh(players,runs, color = 'green')

5. plt. tle('Score Card')

6. plt.xlabel('Players')

7. plt.ylabel('Runs')

8. plt.show()

Output:
5. Sca er plot

The sca er plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable. The data is displayed as a collec on of points. Each point has
the value of one variable, which defines the posi on on the horizontal axes, and the value of other
variable represents the posi on on the ver cal axis.

Let's consider the following simple example:

Example-1:

1. from matplotlib import pyplot as plt

2. from matplotlib import style

3. style.use('ggplot')

4.

5. x = [5,7,10]

6. y = [18,10,6]

7.

8. x2 = [6,9,11]

9. y2 = [7,14,17]

10.

11. plt.sca er(x, y)

12.

13. plt.sca er(x2, y2, color='g')

14.

15. plt. tle('Epic Info')


16. plt.ylabel('Y axis')

17. plt.xlabel('X axis')

18.

19. plt.show()

Output:

Linear Algebra Required for Data Science

Linear algebra is a key tool in data science. It helps data scien sts manage and analyze large datasets.
By using vectors and matrices, linear algebra simplifies opera ons. This makes data easier to work
with and understand.

In this ar cle, we are going to learn about the importance of linear algebra in data science, including
its applica ons and challenges.

Linear Algebra in Data Science

Linear algebra in data science refers to the use of mathema cal concepts involving vectors, matrices,
and linear transforma ons to manipulate and analyze data. It provides useful tools for most
algorithms and processes in data science, such as machine learning, sta s cs, and big data analy cs.

In the field of data science, linear algebra supports various tasks. These include algorithm design,
data processing, and machine learning. With linear algebra, complex problems become simpler. It
turns theore cal data models into prac cal solu ons that can be applied in real-world situa ons.

Importance of Linear Algebra in Data Science

Understanding linear algebra is key to becoming a skilled data scien st. Linear algebra is important in
data science because of the following reasons:

 It helps in organizing and manipula ng large data sets with efficiency.

 Many data science algorithms rely on linear algebra to work fast and accurately.
 It supports major machine learning techniques, like regression and classifica on.

 Techniques like Principal Component Analysis for reducing data dimensionality depend on it.

 Linear algebra is used to alter and analyze images and signals.

 It solves op miza on problems, helping find the best solu ons in complex data scenarios.

Key Concepts in Linear Algebra

Linear algebra is a branch of mathema cs useful for understanding and working with arrays of
numbers known as matrices and vectors. Let us understand some of the key concepts in linear
algebra in the table below :

Concept Descrip on

Fundamental en es in linear algebra represen ng quan es with


Vectors both magnitude and direc on, used extensively to model data in data
science.

Rectangular arrays of numbers, which are essen al for represen ng


Matrices
and manipula ng data sets.

Opera ons such as addi on, subtrac on, mul plica on, and inversion
Matrix Opera ons
that are crucial for various data transforma ons and algorithms.

These are used to understand data distribu ons and are crucial in
Eigenvalues and
methods such as Principal Component Analysis (PCA) which reduces
Eigenvectors
dimensionality.

Singular Value A method for decomposing a matrix into singular values and vectors,
Decomposi on (SVD) useful for noise reduc on and data compression in data science.

A sta s cal technique that uses an orthogonal transforma on to


Principal Component
convert a set of observa ons of possibly correlated variables into a set
Analysis (PCA)
of values of linearly uncorrelated variables.

Applica ons of Linear Algebra in Data Science

Linear Algebra turns complex problems into manageable solu ons. Here are some of the most
common applica ons of linear algebra in data science:

Machine Learning Algorithms


Linear algebra is vital for machine learning. It helps in crea ng and training models. For instance, in
regression analysis, matrices represent data sets. This simplifies calcula ons across vast numbers of
data points.

Image Processing

In image processing, linear algebra streamlines tasks like scaling and rota ng images. Matrices
represent images as arrays of pixel values. This representa on helps in transforming the images
efficiently.

Natural Language Processing (NLP)

NLP uses vectors to represent words. This technique is known as word embedding. Vectors help in
modeling word rela onships and meanings. For example, vector space models can determine
synonyms based on proximity.

Data Fi ng and Predic ons

Linear algebra is used to fit data into models. This process predicts future trends from past data.
Least squares, a method that minimizes the difference between observed and predicted values,
relies heavily on matrix opera ons.

Network Analysis

In network analysis, matrices store and manage data about connec ons. For instance, adjacency
matrices can represent social networks. They show connec ons between persons or items, aiding in
understanding network structures.

Op miza on Problems

Linear algebra solves op miza on problems in data science. It helps find values that minimize or
maximize some func on. Linear programming problems o en use matrix nota ons for constraints
and objec ves, streamlining the solu on process.

Advanced Techniques in Linear Algebra for Data Science

Some techniques in linear algebra can be applied to solve complex and high-dimensional data
problems effec vely in data science. Some of the advanced Techniques in Linear Algebra for Data
Science are :

1. Singular Value Decomposi on (SVD)

2. Principal Component Analysis (PCA)

3. Tensor Decomposi ons

4. Conjugate Gradient Method

Singular Value Decomposi on (SVD)

Singular Value Decomposi on breaks down a matrix into three key components. These components
make it easier to analyze data. For example, SVD is used in recommender systems. It helps in
iden fying pa erns that connect user preferences with products.

Principal Component Analysis (PCA)


Principal Component Analysis reduces the dimensionality of data while keeping the most important
informa on. It simplifies complex data sets. In face recogni on technology, PCA helps in isola ng
features that dis nguish one face from another efficiently.

Tensor Decomposi ons

Tensor decomposi ons extend matrix techniques to mul -dimensional data. They are vital in
handling data from mul ple sources or categories. For instance, in healthcare, tensor
decomposi ons analyze pa ent data across various condi ons and treatments to find hidden
pa erns.

Conjugate Gradient Method

The conjugate gradient method is used for solving large systems of linear equa ons that are common
in simula ons. It’s faster than tradi onal methods when dealing with sparse matrices. This is
important in physics simula ons where space and me variables interact.

Challenges in Learning Linear Algebra for Data Science

There are some difficul es that one faces in learning linear algebra for data science. These challenges
show the complexi es involved in mastering linear algebra for effec ve use in data science.
Overcoming them requires structured learning and prac cal applica on.

Let us learn about some of the most common challenges in learning linear algebra for data science.

Abstract Concepts

Linear algebra involves many abstract concepts like vectors, matrices, and transforma ons. These can
be hard to visualize. For beginners, understanding how these concepts translate to solving real-world
data problems is o en challenging. A common struggle is seeing how theore cal matrix opera ons
apply to prac cal tasks like image recogni on.

Steep Learning Curve

The learning curve for linear algebra is steep, especially for those without a strong mathema cal
background. Learning to perform opera ons like matrix inversion and eigenvalue decomposi on can
be daun ng. For instance, mastering eigenvalues and eigenvectors is crucial for PCA, but
understanding their importance and computa ons takes some effort.

Bridging Theory and Prac ce

Applying linear algebra in data science requires bridging theory with prac cal applica on. Learners
o en find it difficult to connect the dots between abstract mathema cal theories and their prac cal
implementa on in so ware like Python’s NumPy or MATLAB. This gap makes it hard to apply learned
concepts directly to data science projects.

Overwhelming Range of Applica ons

Linear algebra is used in a wide range of data science applica ons, from natural language processing
to computer vision. For learners, understanding where to apply specific linear algebra techniques
across different domains can be overwhelming. Each field may use the same mathema cal tools in
subtly different ways.

Representa on of Problems in Linear Algebra


Problems in linear algebra can be represented in various ways, depending on the specific context
and applica on. Here are some common representa ons of problems in linear algebra:

 System of Linear Equa ons

 Matrix Opera ons

 Eigenvalue Problems

 Geometric Interpreta on

 Applica on-Specific Representa ons

Methods for Describing a Set of Data


1.1 Numerical measure of Central Tendency

When we speak of a data set, we refer to either a sample or a popula on. If sta s cal
inference is our goal, we’ll wish ul mately to use sample numerical descrip ve measures to
make inferences about the corresponding measures for the popula on.

1. The central tendency of the set of measurements: the tendency of the data to cluster, or
center, about certain numerical values

2. The variability of the set of measurements: the spread of the data

1.1.1 Measure of central tendency

1. Mean

The **mean** of a set of quan ta ve data is the sum of the measurements divided by the
number of measurements contained in the data set.
The sample mean, ¯x𝑥¯, will play an important role in accomplishing our objec ve of making
inferences about popula ons based on sample informa on.

For this reason, we need to use a different symbol for the mean of a popula on.

 ¯x𝑥¯: sample mean

 μ𝜇: Popula on mean

We’ll o en use the sample mean ¯x𝑥¯ to es mate (make an inference about) the popula on
mean, μ𝜇.

Example

For example, the percentages of revenues spent on R&D by the popula on consis ng of all
U.S. companies has a mean equal to some value, μ𝜇.

Our sample of 50 companies yielded percentages with a mean of ¯x𝑥¯ = 8.492. If, as is usually
the case, we don’t have access to the measurements for the en re popula on, we could
use ¯x𝑥¯ as an es mator or approximator for μ𝜇.

Then we’d need to know something about the reliability of our inference-that is, we’d need
to know how accurately we might expect ¯x𝑥¯ to es mate μ𝜇

In next tutorial, we’ll find that this accuracy depends on two factors:

1. The size of the sample. The larger the sample, the more accurate the es mate will tend to
be.

2. The variability, or spread, of the data. All other factors remaining constant, the more
variable the data, the less accurate the es mate.

Median

The median of a quan ta ve data set is the middle number when the measurements are
arranged in ascending (or descending) order.

In certain situa ons, the median may be a be er measure of central tendency than the
mean. In par cular, the median is less sensi ve than the mean to extremely large or small
measurements.

A data set is said to be skewed if one tail of the distribu on has more extreme observa ons
than the other tail.
Mode

The mode is the measurement that occurs most frequently in the data set.

1.1.2 Numerical Measures of Variability

Measures of central tendency provide only a par al descrip on of a quan ta ve data set.
The descrip on is incomplete without a measure of the variability, or spread, of the data set.

Knowledge of the data’s variability along with its center can help us visualize the shape of a
data set as well as its extreme values.

The sample variance for a sample of n measurements is equal to the sum of the squared
devia ons from the mean divided by (n - 1).

Formula

σ2=∑(xi−¯x)2n−1𝜎2=∑(𝑥𝑖−𝑥¯)2𝑛−1Note: that the popula on vairance


isσ2=∑(xi−μ)2N𝜎2=∑(𝑥𝑖−𝜇)2𝑁

The second step in finding a meaningful measure of data variability is to calculate


the standard devia on of the data set.

The sample standard devia on is defined as the posi ve square root of the sample variance
No ce that, unlike the variance, the standard devia on is expressed in the original units of
measurement. For example, if the original measurements are in dollars, the variance is
expressed in the peculiar units “dollars squared”“, but the standard devia on is expressed in
dollars.

You may wonder why we use the divisor (n−1)(𝑛−1) instead of n𝑛 when calcula ng the
sample variance. Wouldn’t using n𝑛 be more logical so that the sample variance would be
the average squared devia on from the mean?

The trouble is that using n𝑛 tends to produce an underes mate of the popula on variance so
we use (n−1)(𝑛−1) in the denominator to provide the appropriate correc on for this
tendency.

You now know that the standard devia on measures the variability of a set of data.

 The larger the standard devia on, the more variable the data.

 The smaller the standard devia on the less variable the data

1.1.3 Using the Mean and Standard Devia on to Describe Data

We’ve seen that if we are comparing the variability of two samples selected from a
popula on, the sample with the larger standard devia on is the more variable of the two.
Thus, we know how to interpret the standard devia on on a rela ve or compara ve basis,
but we haven’t explained how it provides a measure of variability for a single sample.

To understand how the standard devia on provides a measure of variability of a data set,
consider a specific data set and answer the following ques ons:

 How many measurements are within 1 standard devia on of the mean?

 How many measurements are within 2 standard devia ons?

The Empirical Rule is a rule of thumb that applies to data sets with frequency distribu ons
that are mound-shaped and symmetric, as shown below.

 Approximately 68% of the measurements willfall within 1 standard devia on of the mean

 Approximately 95% of the measurements will fall within 2 standard devia ons of the mean

 Approximately 99.7% (essen ally all) of the measurements will fall within 3 standard
devia ons of the mean
Example

A manufacturer of automobile ba eries claims that the average length of life for its grade A
ba ery is 60 months. However, the guarantee on this brand is for just 36 months. Suppose
the standard devia on of the life length is known to be 10 months, and the frequency
distribu on of the life-length data is known to be mound-shaped.

1. Approximately what percentage of the manufacturer’s grade A ba eries will last more than
50 months, assuming the manufacturer’s claim is true?

2. Approximately what percentage of the manufacturer’s ba eries will last less than 40 months,
assuming the manufacturer’s claim is true?

3. Suppose your ba ery lasts 37 months. What could you infer about the manufacturer’s claim?

Answer
1. It is easy to see that the percentage of ba eries las ng more than 50 months is
approximately 34% (between 50 and 60 months) plus 50% (greater than 60 months). Thus,
approximately 84% of the ba eries should have life length exceeding 50 months.

2. The percentage of ba eries that last less than 40 months can also be easily determined.
Approximately 2.5% of the ba eries should fail prior to 40 months, assuming the
manufacturer’s claim is true.

3. If you are so unfortunate that your grade A ba ery fails at 37 months, you can make one of
two inferences: either your ba ery was one of the approximately 2.5% that fail prior to 40
months, or something about the manufacturer’s claim is not true. Because the chances are
so small that a ba ery fails before 40 months, you would have good reason to have serious
doubts about the manufacturer’s claim. A mean smaller than 60 months and/or a standard
devia on longer than 10 months would both increase the likelihood of failure prior to 40
months.

1.2 Numerical Measures of Rela ve Standing

Another measure of rela ve standing in popular use is the z-score. As you can see in the
defini on of z-score below, the z-score makes use of the mean and standard devia on of the
data set in order to specify the rela ve loca on of a measurement. Note that the z-score is
calculated by subtrac ng ¯x𝑥¯ (or μ𝜇) from the measurement x𝑥 and then dividing the result
by s𝑠 (or σ𝜎). The final result, the z-score, represents the distance between a given
measurement x𝑥 and the mean, expressed in standard devia ons.

z=x−¯xs𝑧=𝑥−𝑥¯𝑠

with s𝑠 is the sample sd

z=x−μσ𝑧=𝑥−𝜇𝜎

Example

A random sample of 2,000 students who sat for the Graduate Management Admission Test
(GMAT) is selected.

For this sample, the mean GMAT score is x𝑥 = 540 points and the standard devia on is s𝑠 =
100 points.

One student from the sample, Kara Smith, had a GMAT score of x𝑥 = 440 points. What is
Kara’s sample z-score?

(440-540)/100

## [1] -1

This z-score implies that Kara Smith’s GMAT score is 1.0 standard devia ons below the
sample mean GMAT score, or, in short, her sample z-score is - 1.0.

Interpreta on of z-Scores for Mound-Shaped Distribu ons of Data

1. Approximately 68% of the measurements will have a z-score between -1 and 1.

2. Approximately 95% of the measurements will have a z-score between - 2 and 2.

3. Approximately 99.7% (almost all) of the measurements will have a z-score between -3 and 3.
Simpson's Paradox is a phenomenon in sta s cs where a trend observed within mul ple
groups of data reverses or disappears when these groups are combined. This paradox can lead to
misleading conclusions if the data is not carefully analyzed, taking into account the poten al
confounding factors. Understanding Simpson's Paradox is crucial in data science for proper data
analysis and interpreta on.

### Explana on with an Example

Consider the following scenario:

#### Data of Two Groups

Group A:

- Male: 70 out of 100 (70%)

- Female: 30 out of 50 (60%)

Group B:

- Male: 20 out of 30 (66.7%)

- Female: 80 out of 100 (80%)

#### Combined Data

When combined, the data might look like this:

- Male: 90 out of 130 (69.2%)

- Female: 110 out of 150 (73.3%)

Here, if we look at the data separately for Group A and Group B, males have a higher success rate
in Group A, and females have a higher success rate in Group B. However, when combined,
females have a higher overall success rate, which seems to contradict the separate group
findings. This is Simpson's Paradox.

### Causes of Simpson's Paradox


1. **Confounding Variables**: Hidden variables that affect the results.

2. **Data Aggrega on**: Combining data without considering the underlying groupings can
mask the true rela onships.

3. **Sample Size Variability**: Differences in sample sizes among groups can lead to misleading
overall trends.

### Detec ng Simpson's Paradox

1. **Group Analysis**: Always analyze data within relevant subgroups before aggrega ng.

2. **Sta s cal Tests**: Use sta s cal tests to check for consistency across groups.

3. **Visualiza on**: Plo ng the data can help iden fy if trends differ across groups.

### Example Using Python

Here's how you can visualize Simpson's Paradox using Python and Matplotlib:

```python

import pandas as pd

import matplotlib.pyplot as plt

# Example data

data = {

'Group': ['A', 'A', 'B', 'B'],

'Gender': ['Male', 'Female', 'Male', 'Female'],

'Successes': [70, 30, 20, 80],

'Total': [100, 50, 30, 100]

df = pd.DataFrame(data)

df['Success Rate'] = df['Successes'] / df['Total']


# Plo ng the success rates

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

# Plo ng for each group

for group in df['Group'].unique():

group_data = df[df['Group'] == group]

ax[0].bar(group_data['Gender'], group_data['Success Rate'], label=f'Group {group}')

ax[0].set_ tle('Success Rates by Gender in Each Group')

ax[0].set_ylabel('Success Rate')

ax[0].legend()

# Combined data

combined_data = df.groupby('Gender').sum()

combined_data['Success Rate'] = combined_data['Successes'] / combined_data['Total']

ax[1].bar(combined_data.index, combined_data['Success Rate'], color=['blue', 'orange'])

ax[1].set_ tle('Combined Success Rates by Gender')

ax[1].set_ylabel('Success Rate')

plt. ght_layout()

plt.show()

```

### Avoiding Misinterpreta on

To avoid falling into the trap of Simpson's Paradox, consider the following best prac ces:

1. **Detailed Analysis**: Perform detailed analysis of subgroups before drawing conclusions.

2. **Confounder Iden fica on**: Iden fy and account for poten al confounding variables.

3. **Contextual Understanding**: Understand the context and domain-specific factors that may
influence the data.
### Applica ons in Data Science

1. **Healthcare**: Evalua ng treatment effec veness across different pa ent groups.

2. **Economics**: Analyzing economic indicators across different regions or demographics.

3. **Marke ng**: Understanding customer behavior across various segments.

Recognizing and addressing Simpson's Paradox is essen al in data science to ensure accurate
analysis and avoid erroneous conclusions.

In data science, understanding the difference between correla on and causa on is


cri cal for accurate data analysis and drawing meaningful conclusions. Here’s a detailed look at
these concepts:

### Correla on

Correla on measures the sta s cal rela onship between two variables. It quan fies the degree
to which the variables move in rela on to each other. The most common measure of correla on
is the Pearson correla on coefficient, which ranges from -1 to 1.

- **Posi ve Correla on**: As one variable increases, the other variable also increases.

- **Nega ve Correla on**: As one variable increases, the other variable decreases.

- **Zero Correla on**: No rela onship between the variables.

#### Example

```python

import numpy as np

import matplotlib.pyplot as plt

# Genera ng sample data

np.random.seed(0)

x = np.random.rand(100)

y = 2 * x + np.random.normal(0, 0.1, 100)


# Calcula ng correla on

correla on = np.corrcoef(x, y)[0, 1]

# Plo ng

plt.sca er(x, y)

plt. tle(f'Sca er Plot with Correla on = {correla on:.2f}')

plt.xlabel('X')

plt.ylabel('Y')

plt.show()

```

### Causa on

Causa on indicates that one event is the result of the occurrence of the other event; there is a
cause-and-effect rela onship. Establishing causa on requires more rigorous analysis beyond
observing correla ons.

#### Example

To establish causa on, one might use:

1. **Controlled Experiments**: Randomly assigning subjects to different groups to isolate the


effect of a variable.

2. **Longitudinal Studies**: Observing subjects over me to see if changes in one variable cause
changes in another.

3. **Sta s cal Methods**: Using techniques like regression analysis, instrumental variables, and
Granger causality tests to infer causal rela onships.

### Why Correla on Does Not Imply Causa on

- **Confounding Variables**: A third variable may influence both correlated variables, crea ng a
spurious rela onship.

- **Reverse Causa on**: It might be that variable Y causes variable X rather than the other way
around.

- **Coincidence**: The correla on may be coincidental, especially in small datasets.

#### Example of Misleading Correla on


Consider ice cream sales and drowning incidents. Both may increase in the summer, leading to a
posi ve correla on. However, buying ice cream does not cause drowning; instead, the warm
weather (a confounding variable) influences both.

### Iden fying Causa on

To iden fy causa on, consider the following approaches:

1. **Randomized Controlled Trials (RCTs)**: Randomly assign subjects to treatment and control
groups to isolate the effect of the treatment.

2. **Natural Experiments**: U lize naturally occurring events that mimic random assignment.

3. **Regression Analysis**: Use sta s cal models to control for confounding variables and
isolate the effect of interest.

4. **Instrumental Variables**: Use external variables that affect the independent variable but
not the dependent variable directly.

5. **Granger Causality**: Test whether one me series can predict another me series.

### Prac cal Example Using Regression

```python

import pandas as pd

import statsmodels.api as sm

# Example data

data = {

'Study Hours': [2, 3, 4, 5, 6, 7, 8, 9, 10],

'Scores': [70, 75, 80, 85, 90, 95, 100, 105, 110]

df = pd.DataFrame(data)

X = df['Study Hours']

y = df['Scores']

# Adding a constant for the intercept term

X = sm.add_constant(X)
# Fi ng the regression model

model = sm.OLS(y, X).fit()

predic ons = model.predict(X)

# Summary of the regression model

print(model.summary())

# Plo ng

plt.sca er(df['Study Hours'], df['Scores'])

plt.plot(df['Study Hours'], predic ons, color='red')

plt. tle('Regression: Study Hours vs. Scores')

plt.xlabel('Study Hours')

plt.ylabel('Scores')

plt.show()

### Conclusion

- **Correla on** is useful for iden fying rela onships between variables but does not imply one
variable causes changes in another.

- **Causa on** indicates a cause-and-effect rela onship and requires more rigorous methods to
establish.

- Understanding the dis nc on and using appropriate methods to infer causa on are essen al in
data science to avoid misleading conclusions and make informed decisions.

Gradient Descent

Gradient Descent is known as one of the most commonly used op miza on algorithms to train
machine learning models by means of minimizing errors between actual and expected results.
Further, gradient descent is also used to train Neural Networks.

In mathema cal terminology, Op miza on algorithm refers to the task of minimizing/maximizing


an objec ve func on f(x) parameterized by x. Similarly, in machine learning, op miza on is the
task of minimizing the cost func on parameterized by the model's parameters. The main
objec ve of gradient descent is to minimize the convex func on using itera on of parameter
updates. Once these machine learning models are op mized, these models can be used as
powerful tools for Ar ficial Intelligence and various computer science applica ons.

In this tutorial on Gradient Descent in Machine Learning, we will learn in detail about gradient
descent, the role of cost func ons specifically as a barometer within Machine Learning, types of
gradient descents, learning rates, etc.
What is Gradient Descent or Steepest Descent?

Gradient descent was ini ally discovered by "Augus n-Louis Cauchy" in mid of 18th
century. Gradient Descent is defined as one of the most commonly used itera ve op miza on
algorithms of machine learning to train the machine learning and deep learning models. It
helps in finding the local minimum of a func on.

The best way to define the local minimum or local maximum of a func on using gradient descent
is as follows:

o If we move towards a nega ve gradient or away from the gradient of the func on at the
current point, it will give the local minimum of that func on.

o Whenever we move towards a posi ve gradient or towards the gradient of the func on at
the current point, we will get the local maximum of that func on.

This en re procedure is known as Gradient Ascent, which is also known as steepest descent. The
main objec ve of using a gradient descent algorithm is to minimize the cost func on using
itera on. To achieve this goal, it performs two steps itera vely:

o Calculates the first-order deriva ve of the func on to compute the gradient or slope of that
func on.

o Move away from the direc on of the gradient, which means slope increased from the
current point by alpha mes, where Alpha is defined as Learning Rate. It is a tuning
parameter in the op miza on process which helps to decide the length of the steps.

What is Cost-func on?

The cost func on is defined as the measurement of difference or error between actual values
and expected values at the current posi on and present in the form of a single real number. It
helps to increase and improve machine learning efficiency by providing feedback to this model so
that it can minimize error and find the local or global minimum. Further, it con nuously iterates
along the direc on of the nega ve gradient un l the cost func on approaches zero. At this
steepest descent point, the model will stop learning further. Although cost func on and loss
func on are considered synonymous, also there is a minor difference between them. The slight
difference between the loss func on and the cost func on is about the error within the training
of machine learning models, as loss func on refers to the error of one training example, while a
cost func on calculates the average error across an en re training set.

The cost func on is calculated a er making a hypothesis with ini al parameters and modifying
these parameters using gradient descent algorithms over known data to reduce the cost
func on.

Hypothesis:

Parameters:

Cost func on:

Goal:

How does Gradient Descent work?

Before star ng the working principle of gradient descent, we should know some basic concepts
to find out the slope of a line from linear regression. The equa on for simple linear regression is
given as:

1. Y=mX+c

Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

The star ng point(shown in above fig.) is used to evaluate the performance as it is considered
just as an arbitrary point. At this star ng point, we will derive the first deriva ve or slope and
then use a tangent line to calculate the steepness of this slope. Further, this slope will inform the
updates to the parameters (weights and bias).
The slope becomes steeper at the star ng point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.

The main objec ve of gradient descent is to minimize the cost func on or the error between
expected and actual. To minimize the cost func on, two data points are required:

o Direc on & Learning Rate

These two factors are used to determine the par al deriva ve calcula on of future itera on and
allow it to the point of convergence or local minimum or global minimum. Let's discuss learning
rate factors in brief;

Learning Rate:

It is defined as the step size taken to reach the minimum or lowest point. This is typically a small
value that is evaluated and updated based on the behavior of the cost func on. If the learning
rate is high, it results in larger steps but also leads to risks of overshoo ng the minimum. At the
same me, a low learning rate shows the small step sizes, which compromises overall efficiency
but gives the advantage of more precision.

Types of Gradient Descent

Based on the error in various training models, the Gradient Descent learning algorithm can be
divided into Batch gradient descent, stochas c gradient descent, and mini-batch gradient
descent. Let's understand these different types of gradient descent:

1. Batch Gradient Descent:

Batch gradient descent (BGD) is used to find the error for each point in the training set and
update the model a er evalua ng all training examples. This procedure is known as the training
epoch. In simple words, it is a greedy approach where we have to sum over all examples for each
update.

Advantages of Batch gradient descent:

o It produces less noise in comparison to other gradient descent.

o It produces stable gradient descent convergence.


o It is Computa onally efficient as all resources are used for all training samples.

2. Stochas c gradient descent

Stochas c gradient descent (SGD) is a type of gradient descent that runs one training example
per itera on. Or in other words, it processes a training epoch for each example within a dataset
and updates each training example's parameters one at a me. As it requires only one training
example at a me, hence it is easier to store in allocated memory. However, it shows some
computa onal efficiency losses in comparison to batch gradient systems as it shows frequent
updates that require more detail and speed. Further, due to frequent updates, it is also treated
as a noisy gradient. However, some mes it can be helpful in finding the global minimum and also
escaping the local minimum.

Advantages of Stochas c gradient descent:

In Stochas c gradient descent (SGD), learning happens on every example, and it consists of a few
advantages over other gradient descent.

o It is easier to allocate in desired memory.

o It is rela vely fast to compute than batch gradient descent.

o It is more efficient for large datasets.

3. MiniBatch Gradient Descent:

Mini Batch gradient descent is the combina on of both batch gradient descent and stochas c
gradient descent. It divides the training datasets into small batch sizes then performs the
updates on those batches separately. Spli ng training datasets into smaller batches make a
balance to maintain the computa onal efficiency of batch gradient descent and speed of
stochas c gradient descent. Hence, we can achieve a special type of gradient descent with
higher computa onal efficiency and less noisy gradient descent.

Advantages of Mini Batch gradient descent:

o It is easier to fit in allocated memory.

o It is computa onally efficient.

o It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for op miza on
problems, it s ll also has some challenges. There are a few challenges as follows:

1. Local Minima and Saddle Point:

For convex problems, gradient descent can find the global minimum easily, while for non-convex
problems, it is some mes difficult to find the global minimum, where the machine learning
models achieve the best results.
Whenever the slope of the cost func on is at zero or just close to zero, this model stops learning
further. Apart from the global minimum, there occur some scenarios that can show this slop,
which is saddle point and local minimum. Local minima generate the shape similar to the global
minimum, where the slope of the cost func on increases on both sides of the current points.

In contrast, with saddle points, the nega ve gradient only occurs on one side of the point, which
reaches a local maximum on one side and a local minimum on the other side. The name of a
saddle point is taken by that of a horse's saddle.

The name of local minima is because the value of the loss func on is minimum at that point in a
local region. In contrast, the name of the global minima is given so because the value of the loss
func on is minimum there, globally across the en re domain the loss func on.

2. Vanishing and Exploding Gradient

In a deep neural network, if the model is trained with gradient descent and backpropaga on,
there can occur two more issues other than local minima and saddle point.

Vanishing Gradients:

Vanishing Gradient occurs when the gradient is smaller than expected. During backpropaga on,
this gradient becomes smaller that causing the decrease in the learning rate of earlier layers than
the later layer of the network. Once this happens, the weight parameters update un l they
become insignificant.

Exploding Gradient:

Exploding gradient is just opposite to the vanishing gradient as it occurs when the Gradient is too
large and creates a stable model. Further, in this scenario, model weight increases, and they will
be represented as NaN. This problem can be solved using the dimensionality reduc on
technique, which helps to minimize complexity within the model.
Gradient Descent (extra)

What is Gradient?

A gradient is nothing but a deriva ve that defines the effects on outputs of the func on with a
li le bit of varia on in inputs.

What is Gradient Descent?

Gradient Descent stands as a cornerstone orchestra ng the intricate dance of model


op miza on. At its core, it is a numerical op miza on algorithm that aims to find the op mal
parameters—weights and biases—of a neural network by minimizing a defined cost func on.

Gradient Descent (GD) is a widely used op miza on algorithm in machine learning and deep
learning that minimises the cost func on of a neural network model during training. It works by
itera vely adjus ng the weights or parameters of the model in the direc on of the nega ve
gradient of the cost func on un l the minimum of the cost func on is reached.

The learning happens during the backpropaga on while training the neural network-based
model. There is a term known as Gradient Descent, which is used to op mize the weight and
biases based on the cost func on. The cost func on evaluates the difference between the actual
and predicted outputs.

Gradient Descent is a fundamental op miza on algorithm in machine learning used to minimize


the cost or loss func on during model training.

 It itera vely adjusts model parameters by moving in the direc on of the steepest decrease in
the cost func on.

 The algorithm calculates gradients, represen ng the par al deriva ves of the cost func on
concerning each parameter.

These gradients guide the updates, ensuring convergence towards the op mal parameter values
that yield the lowest possible cost.

Gradient Descent is versa le and applicable to various machine learning models, including linear
regression and neural networks. Its efficiency lies in naviga ng the parameter space efficiently,
enabling models to learn pa erns and make accurate predic ons. Adjus ng the learning rate is
crucial to balance convergence speed and avoiding overshoo ng the op mal solu on.

Gradient Descent Python Implementa on

Diving further into the concept, let’s understand in depth, with prac cal implementa on.

Import the necessary libraries

 Python3

import torch

import torch.nn as nn
import matplotlib.pyplot as plt

Set the input and output data

 Python3

# set random seed for reproducibility

torch.manual_seed(42)

# set number of samples

num_samples = 1000

# create random features with 2 dimensions

x = torch.randn(num_samples, 2)

# create random weights and bias for the linear regression model

true_weights = torch.tensor([1.3, -1])

true_bias = torch.tensor([-3.5])

# Target variable

y = x @ true_weights.T + true_bias

# Plot the dataset

fig, ax = plt.subplots(1, 2, sharey=True)

ax[0].sca er(x[:,0],y)

ax[1].sca er(x[:,1],y)

ax[0].set_xlabel('X1')

ax[0].set_ylabel('Y')

ax[1].set_xlabel('X2')

ax[1].set_ylabel('Y')
plt.show()

Output:

X vs Y

Let’s first try with a linear model:

 Python3

# Define the model

class LinearRegression(nn.Module):

def __init__(self, input_size, output_size):

super(LinearRegression, self).__init__()

self.linear = nn.Linear(input_size, output_size)


def forward(self, x):

out = self.linear(x)

return out

# Define the input and output dimensions

input_size = x.shape[1]

output_size = 1

# Instan ate the model

model = LinearRegression(input_size, output_size)

Note:

The number of weight values will be equal to the input size of the model, And the input size in
deep Learning is the number of independent input features i.e we are pu ng inside the model

In our case, input features are two so, the input size will also be two, and the corresponding
weight value will also be two.

We can manually set the model parameter

 Python3

# create a random weight & bias tensor

weight = torch.randn(1, input_size)

bias = torch.rand(1)

# create a nn.Parameter object from the weight & bias tensor

weight_param = nn.Parameter(weight)

bias_param = nn.Parameter(bias)

# assign the weight & bias parameter to the linear layer

model.linear.weight = weight_param
model.linear.bias = bias_param

weight, bias = model.parameters()

print('Weight :',weight)

print('bias :',bias)

Output:

Weight : Parameter containing:

tensor([[-0.3239, 0.5574]], requires_grad=True)

bias : Parameter containing:

tensor([0.5710], requires_grad=True)

Predic on

 Python3

y_p = model(x)

y_p[:5]

Output:

tensor([[ 0.7760],

[-0.8944],

[-0.3369],

[-0.3095],

[ 1.7338]], grad_fn=<SliceBackward0>)

You might also like