0% found this document useful (0 votes)

159 views12 pages

DSBDAL - Assignment No 9

This document discusses exploratory data analysis techniques for univariate and bivariate/multivariate analysis using Python. For univariate analysis of categorical data, it demonstrates using countplots and pie charts to visualize the distribution of variables. For numerical data, it shows histograms, distribution plots, and boxplots. For bivariate/multivariate analysis, it provides examples of scatter plots, bar plots, boxplots, and distribution plots to explore relationships between two or more variables that can be either numerical, categorical, or mixed. Specific techniques demonstrated include adding hue and style parameters to plots to analyze additional variables.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views12 pages

DSBDAL - Assignment No 9

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Assignment

No: 9

Title of the Assignment: Data Visualization II

1. 1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for distribution
of age with respect to each gender along with the information about whether they survived or
not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.

Objective of the Assignment: Students should be able to perform the data

visualization using Python on any open source dataset.

Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Exploratory Data Analysis

2. Univariate Analysis

---------------------------------------------------------------------------------------------------------------

Exploratory Data Analysis

There are various techniques to understand the data, And the basic need is the knowledge of Numpy
for mathematical operations and Pandas for data manipulation. Titanic dataset is used. For
demonstrating some of the techniques, use an inbuilt dataset of seaborn as tips data which explains
the tips each waiter gets from different customers.

Import libraries and loading Data

import numpy as np

import pandas pd
import matplotlib.pyplot as plt

import seaborn as sns

from seaborn import load_dataset

#titanic dataset

data = pd.read_csv("titanic_train.csv")

#tips dataset

tips = load_dataset("tips")

Univariate Analysis

Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.

Categorical Data:

A variable that has text-based information is referred to as categorical variables. Now following
are various plots which we can use for visualizing Categorical data.

1) CountPlot:

Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any
column. It is the same visual form of the value counts function. In our data-target variable is
survived and it is categorical so plot a countplot of this.

sns.countplot(data['Survived'])

plt.show()

OUTPUT:
2) Pie Chart:

The pie chart is also the same as the countplot, only gives us additional information about the
percentage presence of each category in data means which category is getting how much
weightage in data. Now we check about the Sex column, what is a percentage of Male and
Female members traveling.

data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")

plt.show()

OUTPUT:

Numerical Data:

Analyzing Numerical data is important because understanding the distribution of variables

helps to further process the data. Most of the time, we will find much inconsistency with
numerical data so we have to explore numerical variables.

1) Histogram:

A histogram is a value distribution plot of numerical columns. It basically creates bins in

various ranges in values and plots it where we can visualize how values are distributed. We
can have a look where more values lie like in positive, negative, or at the center(mean). Let’s
have a look at the Age column.

plt.hist(data['Age'], bins=5)

plt.show()
OUTPUT:

2) Distplot:

Distplot is also known as the second Histogram because it is a slight improvement version of
the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram which
explains PDF(Probability Density Function) which means what is the probability of each value
occurring in this column.

sns.distplot(data['Age'])

plt.show()

OUTPUT:
3) Boxplot:

Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
summary some terms we need to describe.

• Median – Middle value in series after sorting

• Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50 values are there
below 25th percentile

• Minimum and Maximum – These are not minimum and maximum values, rather they
describe the lower and upper boundary of standard deviation which is calculated using
Interquartile range(IQR).

IQR = Q3 - Q1

Lower_boundary = Q1 - 1.5 * IQR

Upper_bounday = Q3 + 1.5 * IQR

Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).

Bivariate/ Multivariate Analysis:

We have study about various plots to explore single categorical and numerical data. Bivariate
Analysis is used when we have to explore the relationship between 2 different variables and we
have to do this because, in the end, our main task is to explore the relationship between
variables to build a powerful model. And when we analyze more than 2 variables together then
it is known as Multivariate Analysis. we will work on different plots for Bivariate as well on
Multivariate Analysis.

Explore the plots when both the variable is numerical.

1) Scatter Plot:

To plot the relationship between two numerical variables scatter plot is a simple plot to do. Let
us see the relationship between the total bill and tip provided using a scatter plot.

sns.scatterplot(tips["total_bill"], tips["tip"])
Multivariate analysis with scatter plot:

We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we want to
find the separate ratio of male and female with total bill and tip provided.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"])

plt.show()

OUTPUT:

We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose along with gender we also want to know whether the customer was a smoker or not so
we can do this.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"], style=tips['smoker'])

plt.show()
OUTPUT:

Numerical and Categorical:

If one variable is numerical and one is categorical then there are various plots that we can use
for Bivariate and Multivariate analysis.

1) Bar Plot:

Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The blacktip
on top of each bar shows the confidence Interval. let us explore P-Class with age.

sns.barplot(data['Pclass'], data['Age'])

plt.show()

OUTPUT:
Multivariate analysis using Bar plot:

Hue’s argument is very useful which helps to analyze more than 2 variables. Now along with
the above relationship we want to see with gender.

sns.barplot(data['Pclass'], data['Fare'], hue =

data["Sex"]) plt.show()

OUTPUT:

2) Boxplot:

We have already study about boxplots in the Univariate analysis above. we can draw a separate
boxplot for both the variable. let us explore gender with age using a boxplot.

sns.boxplot(data['Sex'], data["Age"])

OUTPUT:

Multivariate analysis with boxplot:

Along with age and gender let’s see who has survived and who has

not. sns.boxplot(data['Sex'], data["Age"], data["Survived"])

plt.show()

OUTPUT:

3) Distplot:

Distplot explains the PDF function using kernel density estimation. Distplot does not have a
hue parameter but we can create it. Suppose we want to see the probability of people with an
age range that of survival probability and find out whose survival probability is high to the age
range of death ratio.

sns.distplot(data[data['Survived'] == 0]['Age'], hist=False, color="blue")

sns.distplot(data[data['Survived'] == 1]['Age'], hist=False, color="orange")

plt.show()

OUTPUT:

In above graph, the blue one shows the probability of dying and the orange plot shows the
survival probability. If we observe it we can see that children’s survival probability is
higher
than death and which is the opposite in the case of aged peoples. This small analysis tells
sometimes some big things about data and it helps while preparing data stories.

Categorical and Categorical:

Now, we will work on categorical and categorical columns.

1) Heatmap:

If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and then
with heatmap.

pd.crosstab(data['Pclass'], data['Survived'])

Now with heatmap, we have to find how many people survived and died.

sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))
2) Cluster map:

We can also use a cluster map to understand the relationship between

two categorical variables. A cluster map basically plots a dendrogram
that shows the categories of similar behavior together.

sns.clustermap(pd.crosstab(data['Parch'],

data['Survived'])) plt.show()

OUTPUT:

Conclusion-

In this way we have explored the functions of the

python library for Data Preprocessing, Data Wrangling
Techniques and How to Handle missing values on Iris
Dataset.

IB Biology Lab Manual
100% (4)
IB Biology Lab Manual
87 pages
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Organizational Readiness to E-Transformation
From Everand
Organizational Readiness to E-Transformation
Aqel M. Aqel
No ratings yet
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
14 pages
Reporting and Query Tools and Applications: Tool Categories
No ratings yet
Reporting and Query Tools and Applications: Tool Categories
13 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
2022 Dec. ITT401-A
No ratings yet
2022 Dec. ITT401-A
2 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
24 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Steganography Project Report For Major Project in B Tech
No ratings yet
Steganography Project Report For Major Project in B Tech
74 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Specialized Visualization Tools - Coursera PDF
50% (2)
Specialized Visualization Tools - Coursera PDF
3 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Bca PDF
No ratings yet
Bca PDF
114 pages
Vanishing and Exploding
No ratings yet
Vanishing and Exploding
9 pages
Introduction To Client Server
No ratings yet
Introduction To Client Server
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Database Management System Assignment
No ratings yet
Database Management System Assignment
8 pages
APP Question Bank Unit3
100% (1)
APP Question Bank Unit3
5 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
Fake News Detection Report
No ratings yet
Fake News Detection Report
20 pages
business-analytics-local-author-book-1
No ratings yet
business-analytics-local-author-book-1
233 pages
Minor Project Report Format MCA
No ratings yet
Minor Project Report Format MCA
11 pages
Faculty of Engineering Scit B. Tech It/Cse/Cce VI Semester First Mid Term Examination: 2021-22 Data Mining and Warehousing (IT3240)
No ratings yet
Faculty of Engineering Scit B. Tech It/Cse/Cce VI Semester First Mid Term Examination: 2021-22 Data Mining and Warehousing (IT3240)
2 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
DBMS LAB MANUAL FINAL (AutoRecovered)
No ratings yet
DBMS LAB MANUAL FINAL (AutoRecovered)
46 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
BI UNIT-II Chp01(Mathematical models for decision making)
No ratings yet
BI UNIT-II Chp01(Mathematical models for decision making)
9 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
Mutual Fund Performance Analyser
No ratings yet
Mutual Fund Performance Analyser
24 pages
Big Data
No ratings yet
Big Data
22 pages
Data Warehousing: Online Analytical Processing (OLAP)
No ratings yet
Data Warehousing: Online Analytical Processing (OLAP)
44 pages
Knowledge Representation Issue
No ratings yet
Knowledge Representation Issue
18 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
ML 2
No ratings yet
ML 2
6 pages
Hadoop Questions and Answers Part 100
No ratings yet
Hadoop Questions and Answers Part 100
34 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Module 5
No ratings yet
Module 5
16 pages
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
0% (2)
391 - CS8091 Big Data Analytics - Anna University 2017 Regulation Syllabus
2 pages
It6006 Data Analytics Syllabus
No ratings yet
It6006 Data Analytics Syllabus
1 page
Soft Computing
No ratings yet
Soft Computing
13 pages
Bangladeshi Flower Identification Using Computer Vision and Machine Learning Techniques
100% (1)
Bangladeshi Flower Identification Using Computer Vision and Machine Learning Techniques
16 pages
Big Data
No ratings yet
Big Data
30 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
PHD Progress Report PPT 20191222-c
No ratings yet
PHD Progress Report PPT 20191222-c
36 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
AI Assignment No 1
No ratings yet
AI Assignment No 1
1 page
Assignment No 6
No ratings yet
Assignment No 6
3 pages
1 HTML With Frames
No ratings yet
1 HTML With Frames
4 pages
DSBDAL - Assignment No 10
No ratings yet
DSBDAL - Assignment No 10
5 pages
P&S R19 - Unit-4
No ratings yet
P&S R19 - Unit-4
9 pages
Impact of Technology Integration On Academic Performance
No ratings yet
Impact of Technology Integration On Academic Performance
7 pages
SC0x LiveEvent2 Statistics
No ratings yet
SC0x LiveEvent2 Statistics
14 pages
Pr3 Chapter 3
No ratings yet
Pr3 Chapter 3
3 pages
QA Index
No ratings yet
QA Index
8 pages
Syllabus Nts 08 24 Final
No ratings yet
Syllabus Nts 08 24 Final
23 pages
Sampling Theory: Double Sampling (Two Phase Sampling)
No ratings yet
Sampling Theory: Double Sampling (Two Phase Sampling)
12 pages
STA301 Subjective Questions Short Notes DOWNLOADPDF
No ratings yet
STA301 Subjective Questions Short Notes DOWNLOADPDF
22 pages
Exam Sta 426
No ratings yet
Exam Sta 426
6 pages
Assignment No.1 (Code 0837)
No ratings yet
Assignment No.1 (Code 0837)
12 pages
Stats_Lecture-4
No ratings yet
Stats_Lecture-4
30 pages
Math2565Winter2016outline 4
No ratings yet
Math2565Winter2016outline 4
2 pages
Solutions To Homework Assignment 1: X S X N N
No ratings yet
Solutions To Homework Assignment 1: X S X N N
6 pages
Transportation Engineering Trip Generation & Distribution
No ratings yet
Transportation Engineering Trip Generation & Distribution
9 pages
Quantitative Methods
No ratings yet
Quantitative Methods
5 pages
ISOMAP in ML
No ratings yet
ISOMAP in ML
12 pages
Mid-Price Prediction Based On Machine Learning Methods With Technical and Quantitative Indicators
No ratings yet
Mid-Price Prediction Based On Machine Learning Methods With Technical and Quantitative Indicators
43 pages
Project Report
100% (3)
Project Report
36 pages
Worksheet 3 For Eng
No ratings yet
Worksheet 3 For Eng
1 page
Nciph ERIC14
No ratings yet
Nciph ERIC14
5 pages
Time Management
No ratings yet
Time Management
39 pages
Math 8.1-8.2-8.3
No ratings yet
Math 8.1-8.2-8.3
4 pages
Research in HRM
No ratings yet
Research in HRM
4 pages
Thesis Topics in Population Studies
100% (2)
Thesis Topics in Population Studies
8 pages
Econ Shu301 CH11
No ratings yet
Econ Shu301 CH11
53 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
Abawi
No ratings yet
Abawi
14 pages
End of chapter 4 summary - Hypothesis Testing
No ratings yet
End of chapter 4 summary - Hypothesis Testing
1 page
Advertisement Notice 3
No ratings yet
Advertisement Notice 3
4 pages