Data Visualization

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Data Visualization

Data Visualization
• Data visualization allows us to quickly interpret the data
and adjust different variables to see their effect.

• It allows us to observe the existing patterns .

• It helps to identify the extreme value that could be


anomaly.
Why Data Visualization Matters
• Visual perception vs. numerical data: humans are better
at processing visual information.

• Communicating insights effectively to stakeholders.

• Making data-driven decisions.


.

• Python provides various libraries that come with


different features for visualizing data.

• All these libraries come with different features and can


support various types of graphs. Four such python
libraries-

• Matplotlib
• seaborn
• ggplot
• plotly
.
• Matplotlib is a 2D plotting library.
.

• An easy-to-use, low-level data visualization library that


uses NumPy arrays and other extension codes to provide
better performance even on large size arrays.
• It consists of various plots like scatter plot, line plot,
histogram, etc.

Matplotlib workflow
Types of Data Visualizations
• Charts and Graphs: • Interactive Visualizations:
– Bar charts – Dashboards
– Line charts – Interactive plots
– Pie charts – Drill-down charts
– Scatter plots
– Histograms

• Geospatial Visualizations: • Advanced Visualizations:


– Maps – Tree maps
– Choropleth maps – Sankey diagrams
– Heatmaps – Word clouds
– Parallel coordinates
Bar Chart
• Bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is
proportional to the values which they represent.
• To represent the frequency distribution of categorial
variables.
• It can be created using the bar() method.
Horizontal bar chart
• To represent more than seven category
• To represent ranking (eg. election results) and performance
Stacked bar chart
• These are used to show part to whole, so how much
different element contribute to total
Histogram
• A histogram is a type of bar chart with different heights
where the X-axis represents the bin ranges while the Y-
axis (height) gives information about frequency.

• It shows the frequency distribution of continuous data


eg. Time, age, weight.

• The hist() function is used to compute and create a


histogram. In histogram, if we pass categorical data then
it will automatically compute the frequency of that data
i.e. how often each value occurred.
Line Chart
• Line chart is used to represent a relationship between
two data X and Y on a different axis.

• Eg- stock market price changes over time.

• It is plotted using the plot() function.


Pie chart
• Pie chart is usually least used chart for data
analysis.
Scatter plots
• It is set of points that represents the values obtained for
two different variables plotted on horizontal and vertical
axes.

• Scatter plot is used to observe relationships or the


correlation between two numerical variables and uses dots
to represent relationship between them.

• Used to show the clustering trends or outlier.

• The scatter() method in the matplotlib library is used to


draw a scatter plot.
Scatter plots
Types of correlation
Comparisons of graphs
Line chart
import matplotlib.pyplot as plt

# initializing the data


x = [1, 2, 3, 4]
y = [20, 30, 40, 50]

# plotting the data


plt.plot(x, y) # state less way of plotting

# Adding the title


plt.title("Simple Line")

# Adding the labels


plt.ylabel("y-axis")
plt.xlabel("x-axis")
plt.show()
Another way

import matplotlib.pyplot as plt


x = [1, 2, 3, 4]
y = [11, 22, 33, 44]
# setup plot
fig, ax= plt.subplots(figsize=(10, 10)) # width, height
#plot data
ax.plot(x, y)
# customize plot
ax.set( title= "Simple plot",
xlabel = "X-axis",
ylabel= "Y –axis ") # save and show
fig.savefig(“images/sample-plot.png")
Scatter plot code
import matplotlib.pyplot as plt
import numpy as np
x=np.linspace(0, 10, 100) # create some data
fig, ax = plt. subplots() # add figure and axes
ax.scatter(x, np.sin(x), c='red' )
Make a bar plot from dictionary
import matplotlib.pyplot as plt
import numpy as np
x=np.linspace(0, 10, 100) # create some data
cookies_price = { "almond cookies": 10,
"cashew cookies": 15,
"plane cookies": 5}
fig, ax = plt.subplots() # add figure and axes
ax.bar(cookies_price.keys(), cookies_price.values())
ax.set( title= "Cookies store",
ylabel= " price ($) “
);
Horizontal bar plot
(when more than 7 categories comparisons required )
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [11, 22, 33, 44]
# setup plot
fig, ax= plt.subplots(figsize=(10, 10)) # width, height
#plot data
ax.barh(x, y)
# customize plot
ax.set( title= "Simple plot",
xlabel = "X-axis",
ylabel= "Y –axis ") # save and show
fig.savefig(“images/sample-plot.png")
Horizontal bar plot with dictionary

import matplotlib.pyplot as plt


import numpy as np
x=np.linspace(0, 10, 100) # create some data
cookies_price = { "almond cookies": 10,
"cashew cookies": 15,
"plane cookies": 5}
fig, ax = plt.subplots() # add figure and axes
ax.barh(list(cookies_price.keys()), list(cookies_price.values()))
ax.set( title= "Cookies store",
ylabel= " price ($) “
);
Subplot
import matplotlib.pyplot as plt
import numpy as npx=np.linspace(0, 10, 100) # create some datacookies_
price = { "almond cookies": 10,
"cashew cookies": 15,
"plane cookies": 5}

# add figure and axes


fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(10,5))
ax1.plot(x, x/2);
ax2.scatter(np.random.random(10), np.random.random(10));
ax3.bar(cookies_price.keys(), cookies_price.values())
ax4.hist(np.random.randn(1000));
ax1.set( title= "Simple Plot", xlabel = "X-axis", ylabel= " Y-axis ");
Output graph
Box or whisker plot
• Box plot is a graphical representation of the distribution
of a dataset.
• It displays key summary statistics such as
the median, quartiles and potential outliers in a concise
and visual manner to compare different datasets.
• Box plots helps to identify the average value of the data,
how dispersed the data is, whether the data is skewed or
not (skewness).
Box or whisker plot

Box plot

Data distribution and skewness


.
A box plot consist of 5 things.

• Minimum
• First Quartile or 25%,
(calculated via (n+1​)/4 th term for odd number of data points)
• Median (Second Quartile) or 50%
• Third Quartile or 75%
(calculated via 3(n+1)/4 th term for odd number of data points)
• Maximum
.

Q. Create a box plot for 12 values – 10, 12, 11, 15, 11, 14, 13, 17, 12,
22, 14, 11.
Sol- arrange in ascending order
10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22
Median (Q2) = (12+13)/2 = 12.5; Since there were even values
Q1 = (11+11)/2 = 11 (first 6 values )
Q3 = (14+15)/2 = 14.5 (next 6 values )
IQR(Interquartile range) = Q3-Q1 = 14.5-11 = 3.5
Lower Limit = Q1-1.5*IQR = 11-1.5*3.5 = 5.75
Upper Limit = Q3+1.5*IQR = 14.5+1.5*3.5 = 19.75
Minimum maximum range within [5.75,19.75]
Box plot
import matplotlib.pyplot as plt
import numpy as np
y=[10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22]
plt.boxplot(y)
plt.show()
Box plot with random data set
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(2)
x=np.random.normal(70,10,200)
plt.boxplot(x)
plt.show()
Plotting functions with Pandas
Call the plot() with data frame and Pass the plot
type as parameter kind
• df.plot(kind='bar')
• df.plot(kind=‘scatter')

• pandas.DataFrame.plot
— pandas 2.2.1 documentation (pydata.org)
Plotting functions with Pandas
.
import pandas as pd
import matplotlib.pyplot as plt
data_dict = {'name': [‘s1', ‘s2', ‘s3', ‘s4', ‘s5', ‘s6'],
'age': [20, 20, 21, 20, 21, 20],
'math_marks': [100, 90, 91, 98, 92, 95],
'physics_marks': [90, 100, 91, 92, 98, 95],
'chem_marks': [93, 89, 99, 92, 94, 92]
}
df = pd.DataFrame(data_dict)
df.head()
df.plot(kind='bar',
x='name',
y=‘chem_marks',
color=‘blue')
plt.title(‘Bar Plot')
plt.show()
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("C:/Users/HP/Desktop/monthsales.csv")
df.head()
#plot data
df.plot(kind='line', color=['red', 'blue', 'brown', 'yellow]) # panda plot function
plt.title= "year wise sales",
plt.xlabel = "Months",
plt.ylabel= "Sales"
plt.show()
Time series Analysis
• A Time Series is defined as a series of data points
indexed in time order. The time order can be daily,
monthly, or even yearly or
• A time series is a set of observation taken at specified
times usually at equal intervals.
• Time Series Forecasting It is the process of using a
statistical model to predict future values of a time
series based on previous observed values.
.

• .
Components of time series
Components of time series
• Trends(T): shows a general direction of the time series data
over a long period of time. A trend can be increasing(upward),
decreasing(downward), or horizontal(stationary)
• Seasonality(S) : repeating trends or pattern respect to timing,
direction, and magnitude.
• Cycle(C) : repeating but don’t have any fix pattern and can
occur any time so harder to predict.
• Irregularity (I)(noise or residual): These are the fluctuations in
the time series data which become evident when trend and
cyclical variations are removed. These variations are
unpredictable, erratic, and may or may not be random.
Example of a Time Series that illustrates the number of passengers of an airline
per month from the year 1949 to 1960. (seasonal data)
When not to apply time series

• When the values are constant

• When the values are in the form of functions

Types of Time Series Data

Continuous Time Series Data

Discrete Time Series Data


Decomposition
• TSI Decomposition : It is used to separate different
components of a time series.

• The term stands for Trend, Seasonality and Irregularity

• Decomposition model can be additive or multiplicative.

Additive model : Y= T+S+I (when S is constant)


Multiplicative model: Y= T*S*I (when S is changing)
Stationarity
• Time series has a particular behavior over time, there is
very high probability that it will follow the same in future.
• Constant mean, constant variance, auto covariance that
does not depend on time.

• Tests to check stationarity-


Rolling statistics (plotting the moving average and moving
variance) and ADCF tests (non-stationary)
Non stationary process
Time series plot: Example : 1
import matplotlib.pyplot as plt
import pandas as pd
dates = pd.date_range('2024-01-01', periods=10) # Sample time series data
values = [5, 7, 8, 9, 6, 3, 5, 8, 7, 6]
# Creating a pandas DataFrame
df = pd.DataFrame(values, index=dates, columns=['Value'])
# Plotting the time series
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Value'], marker='o', linestyle='-')
plt.title('TS Plot')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()
Time series plot: Example : 2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.stattools import adfuller
df = pd.read_csv("C:/Users/HP/Desktop/stock_data.csv", parse_dates=True,
index_col="Date")
# displaying the first five rows of dataset
df.head()
df.drop(columns='Unnamed: 0', inplace =True)
df.head()
sns.set(style="whitegrid") # Setting the style to whitegrid for a clean background
plt.figure(figsize=(12, 6)) # Setting the figure size
sns.lineplot(data=df, x='Date', y='High', label='High Price', color='blue')
plt.xlabel('Date')
plt.ylabel('High')
plt.title('Share Highest Price Over Time')
plt.show()
Output graph
References
• Matplotlib documentation — Matplotlib 3.8.3
documentation
• pandas.DataFrame.plot
— pandas 2.2.1 documentation (pydata.org)
• Matplotlib Tutorial – GeeksforGeeks
• NPTEL :: Computer Science and Engineering -
NOC:Python for Data Science

You might also like