DSBDAL - Assignment No 9
DSBDAL - Assignment No 9
No: 9
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Exploratory Data Analysis
2. Univariate Analysis
---------------------------------------------------------------------------------------------------------------
There are various techniques to understand the data, And the basic need is the knowledge of Numpy
for mathematical operations and Pandas for data manipulation. Titanic dataset is used. For
demonstrating some of the techniques, use an inbuilt dataset of seaborn as tips data which explains
the tips each waiter gets from different customers.
import numpy as np
import pandas pd
import matplotlib.pyplot as plt
#titanic dataset
data = pd.read_csv("titanic_train.csv")
#tips dataset
tips = load_dataset("tips")
Univariate Analysis
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.
Categorical Data:
A variable that has text-based information is referred to as categorical variables. Now following
are various plots which we can use for visualizing Categorical data.
1) CountPlot:
Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any
column. It is the same visual form of the value counts function. In our data-target variable is
survived and it is categorical so plot a countplot of this.
sns.countplot(data['Survived'])
plt.show()
OUTPUT:
2) Pie Chart:
The pie chart is also the same as the countplot, only gives us additional information about the
percentage presence of each category in data means which category is getting how much
weightage in data. Now we check about the Sex column, what is a percentage of Male and
Female members traveling.
data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")
plt.show()
OUTPUT:
Numerical Data:
1) Histogram:
plt.hist(data['Age'], bins=5)
plt.show()
OUTPUT:
2) Distplot:
Distplot is also known as the second Histogram because it is a slight improvement version of
the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram which
explains PDF(Probability Density Function) which means what is the probability of each value
occurring in this column.
sns.distplot(data['Age'])
plt.show()
OUTPUT:
3) Boxplot:
Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
summary some terms we need to describe.
• Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50 values are there
below 25th percentile
• Minimum and Maximum – These are not minimum and maximum values, rather they
describe the lower and upper boundary of standard deviation which is calculated using
Interquartile range(IQR).
IQR = Q3 - Q1
Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).
We have study about various plots to explore single categorical and numerical data. Bivariate
Analysis is used when we have to explore the relationship between 2 different variables and we
have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when we analyze more than 2 variables together then
it is known as Multivariate Analysis. we will work on different plots for Bivariate as well on
Multivariate Analysis.
1) Scatter Plot:
To plot the relationship between two numerical variables scatter plot is a simple plot to do. Let
us see the relationship between the total bill and tip provided using a scatter plot.
sns.scatterplot(tips["total_bill"], tips["tip"])
Multivariate analysis with scatter plot:
We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want to
find the separate ratio of male and female with total bill and tip provided.
plt.show()
OUTPUT:
We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose along with gender we also want to know whether the customer was a smoker or not so
we can do this.
plt.show()
OUTPUT:
If one variable is numerical and one is categorical then there are various plots that we can use
for Bivariate and Multivariate analysis.
1) Bar Plot:
Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The blacktip
on top of each bar shows the confidence Interval. let us explore P-Class with age.
sns.barplot(data['Pclass'], data['Age'])
plt.show()
OUTPUT:
Multivariate analysis using Bar plot:
Hue’s argument is very useful which helps to analyze more than 2 variables. Now along with
the above relationship we want to see with gender.
data["Sex"]) plt.show()
OUTPUT:
2) Boxplot:
We have already study about boxplots in the Univariate analysis above. we can draw a separate
boxplot for both the variable. let us explore gender with age using a boxplot.
sns.boxplot(data['Sex'], data["Age"])
OUTPUT:
Along with age and gender let’s see who has survived and who has
OUTPUT:
3) Distplot:
Distplot explains the PDF function using kernel density estimation. Distplot does not have a
hue parameter but we can create it. Suppose we want to see the probability of people with an
age range that of survival probability and find out whose survival probability is high to the age
range of death ratio.
plt.show()
OUTPUT:
In above graph, the blue one shows the probability of dying and the orange plot shows the
survival probability. If we observe it we can see that children’s survival probability is
higher
than death and which is the opposite in the case of aged peoples. This small analysis tells
sometimes some big things about data and it helps while preparing data stories.
1) Heatmap:
If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and then
with heatmap.
pd.crosstab(data['Pclass'], data['Survived'])
Now with heatmap, we have to find how many people survived and died.
sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))
2) Cluster map:
sns.clustermap(pd.crosstab(data['Parch'],
data['Survived'])) plt.show()
OUTPUT:
Conclusion-