Data Visualization
Data Visualization
Data Visualization
Data Visualization
• Data visualization allows us to quickly interpret the data
and adjust different variables to see their effect.
• Matplotlib
• seaborn
• ggplot
• plotly
.
• Matplotlib is a 2D plotting library.
.
Matplotlib workflow
Types of Data Visualizations
• Charts and Graphs: • Interactive Visualizations:
– Bar charts – Dashboards
– Line charts – Interactive plots
– Pie charts – Drill-down charts
– Scatter plots
– Histograms
Box plot
• Minimum
• First Quartile or 25%,
(calculated via (n+1)/4 th term for odd number of data points)
• Median (Second Quartile) or 50%
• Third Quartile or 75%
(calculated via 3(n+1)/4 th term for odd number of data points)
• Maximum
.
Q. Create a box plot for 12 values – 10, 12, 11, 15, 11, 14, 13, 17, 12,
22, 14, 11.
Sol- arrange in ascending order
10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22
Median (Q2) = (12+13)/2 = 12.5; Since there were even values
Q1 = (11+11)/2 = 11 (first 6 values )
Q3 = (14+15)/2 = 14.5 (next 6 values )
IQR(Interquartile range) = Q3-Q1 = 14.5-11 = 3.5
Lower Limit = Q1-1.5*IQR = 11-1.5*3.5 = 5.75
Upper Limit = Q3+1.5*IQR = 14.5+1.5*3.5 = 19.75
Minimum maximum range within [5.75,19.75]
Box plot
import matplotlib.pyplot as plt
import numpy as np
y=[10, 11, 11, 11, 12, 12, 13, 14, 14, 15, 17, 22]
plt.boxplot(y)
plt.show()
Box plot with random data set
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(2)
x=np.random.normal(70,10,200)
plt.boxplot(x)
plt.show()
Plotting functions with Pandas
Call the plot() with data frame and Pass the plot
type as parameter kind
• df.plot(kind='bar')
• df.plot(kind=‘scatter')
• pandas.DataFrame.plot
— pandas 2.2.1 documentation (pydata.org)
Plotting functions with Pandas
.
import pandas as pd
import matplotlib.pyplot as plt
data_dict = {'name': [‘s1', ‘s2', ‘s3', ‘s4', ‘s5', ‘s6'],
'age': [20, 20, 21, 20, 21, 20],
'math_marks': [100, 90, 91, 98, 92, 95],
'physics_marks': [90, 100, 91, 92, 98, 95],
'chem_marks': [93, 89, 99, 92, 94, 92]
}
df = pd.DataFrame(data_dict)
df.head()
df.plot(kind='bar',
x='name',
y=‘chem_marks',
color=‘blue')
plt.title(‘Bar Plot')
plt.show()
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv("C:/Users/HP/Desktop/monthsales.csv")
df.head()
#plot data
df.plot(kind='line', color=['red', 'blue', 'brown', 'yellow]) # panda plot function
plt.title= "year wise sales",
plt.xlabel = "Months",
plt.ylabel= "Sales"
plt.show()
Time series Analysis
• A Time Series is defined as a series of data points
indexed in time order. The time order can be daily,
monthly, or even yearly or
• A time series is a set of observation taken at specified
times usually at equal intervals.
• Time Series Forecasting It is the process of using a
statistical model to predict future values of a time
series based on previous observed values.
.
• .
Components of time series
Components of time series
• Trends(T): shows a general direction of the time series data
over a long period of time. A trend can be increasing(upward),
decreasing(downward), or horizontal(stationary)
• Seasonality(S) : repeating trends or pattern respect to timing,
direction, and magnitude.
• Cycle(C) : repeating but don’t have any fix pattern and can
occur any time so harder to predict.
• Irregularity (I)(noise or residual): These are the fluctuations in
the time series data which become evident when trend and
cyclical variations are removed. These variations are
unpredictable, erratic, and may or may not be random.
Example of a Time Series that illustrates the number of passengers of an airline
per month from the year 1949 to 1960. (seasonal data)
When not to apply time series