Study Notes
Data Visualization:
Bar Plots
Count Plots
Histograms
Cat Plots (Box, Violin, Swarm, Boxen)
Multiple Plots using FacetGrid
Joint Plots
KDE Plots
Pairplots
Heatmaps
Scatter Plots
Study Notes- Data Visualization
1. Bar Plots
Bar plots are an effective way to visualize various data types, including counts,
frequencies, percentages, or averages. They are particularly valuable for comparing data
across different categories.
Use Cases:
1. Categorical Comparison: In a bar plot, each bar represents a specific category, and the
height of the bar reflects the aggregated value associated with that category (such as
count, sum, or mean).
For instance, you can use a bar plot to show the average age of Titanic passengers based on
gender.
# Simple barplot
sns.barplot(data=titanic, x="who", y="age", estimator='mean',
errorbar=None, palette='viridis')
plt.title('Simple Barplot')
plt.xlabel('Person')
plt.ylabel('Average Age')
plt.show();
using Seaborn
2. Proportional Representation with Stacked Bar Charts:
Bar plots can also be used to visualize proportions or percentages. By adjusting the
height of each bar to reflect the proportion of observations within a category, stacked
bar charts allow for a comparison of the relative distribution across different categories.
2
Study Notes- Data Visualization
For example, a stacked bar chart could show the proportion of males from various towns
aboard the Titanic.
#Prepare data for next plot
data = titanic.groupby('embark_town').agg({'who':'count','sex': lambda x: (x=='male').sum()}).reset_index()
data.rename(columns={'who':'total', 'sex':'male'}, inplace=True)
data.sort_values('total', inplace=True)
# Barplot Showing Part of Total
sns.set_color_codes("pastel")
sns.barplot(x="total", y="embark_town", data=data,
label="Female", color="b")
sns.set_color_codes("muted")
sns.barplot(x="male", y="embark_town", data=data,
label="Male", color="b")
plt.title('Barplot Showing Part of Total')
plt.xlabel('Number of Persons')
plt.legend(loc='upper right')
plt.show()
using Seaborn
3. Comparing Subcategories within Categories using Clustered Bar Plots:
Clustered bar plots group multiple bars within each category to represent different
subcategories, making it easier to compare and analyze data across them.
For instance, you could use a clustered bar plot to compare the average age of males and
females within each class.
3
Study Notes- Data Visualization
# Clustered barplot
sns.barplot(data=titanic, x='class', y='age', hue='sex',
estimator='mean', errorbar=None, palette='viridis')
plt.title('Clustered Barplot')
plt.xlabel('Class')
plt.ylabel('Average Age')
plt.show();
using Seaborn
2. Count Plots
A count plot visualizes the frequency of occurrences for each category within a
categorical variable. The x-axis shows the categories, while the y-axis indicates the count
or frequency of each category.
Use Cases:
Frequency Distribution of Categorical Variables: Each bar in the plot represents a
category, and its height reflects the number of observations in that category, helping
identify the most and least common categories.
For example, the count plot can be used to show the status of passengers on the Titanic.
# Simple Countplot
sns.countplot(data=titanic, x='alive', palette='viridis')
plt.title('Simple Countplot')
plt.show();
4
Study Notes- Data Visualization
using Seaborn
Analyzing the relationship between different categorical variables
For example, examining the status of passengers based on gender on the Titanic.
# Clustered Countplot
sns.countplot(data=titanic, y="who",
hue="alive", palette='viridis')
plt.title('Clustered Countplot')
plt.show();
using Seaborn
3. Histograms
Histograms are visual representations that display the distribution of a dataset, helping
5
Study Notes- Data Visualization
to uncover key characteristics such as normality, skewness, or multiple peaks. They
show the frequency or count of data points within specific intervals or "bins." The x-axis
represents the range of values in the dataset, divided into equal bins, while the y-axis
shows the frequency or count of observations within each bin. The height of each bar
corresponds to the number of data points in that bin.
Use Cases:
4. To visualize the distribution, central tendency, range, and spread of a continuous or
numeric variable, as well as to identify any patterns or outliers.
# Histogram with KDE
sns.histplot(data=iris, x='sepal_width', kde=True)
plt.title('Histogram with KDE')
plt.show();
using Seaborn
2. 2. Compare theCompare the distribution of multiple continuous variables
For example, comparing the distribution of petal length and sepal length in flowers.
# Histogram with multiple features
sns.histplot(data=iris[['sepal_length','sepal_width']])
plt.title('Multi-Column Histogram')
plt.show()
6
Study Notes- Data Visualization
3. Compare the distribution of a continuous variable across different categories
For example, comparing the distribution of petal length among various flower species.
#Stacked Histogram
sns.histplot(iris, x='sepal_length', hue='species', multiple='stack',
linewidth=0.5)
plt.title('Stacked Histogram')
plt.show()
using Seaborn
4. Cat Plots (Box, Violin, Swarm, Boxen)
A catplot is a high-level, flexible function that integrates several categorical seaborn
plots, such as boxplots, violinplots, swarmplots, pointplots, barplots, and countplots.
Use Cases:
7
Study Notes- Data Visualization
Analyze the relationship between categorical and continuous variables
Obtain a statistical summary of a continuous variable
Examples:
# Boxplot
sns.boxplot(data=tips, x='time', y='total_bill', hue='sex', palette='viridis')
plt.title('Boxplot')
plt.show()
using Seaborn
# Violinplot
sns.violinplot(data=tips, x='day', y='total_bill', palette='viridis')
plt.title('Violinplot')
plt.show()
8
Study Notes- Data Visualization
using Seaborn
#Swarmplot
sns.swarmplot(data=tips, x='time', y='tip', dodge=True, palette='viridis', hue='sex', s=6)
plt.title('SwarmPlot')
plt.show()
using Seaborn
#StripPlot
sns.stripplot(data=tips, x='tip', hue='size', y='day', s=25, alpha=0.2,
jitter=False, marker='D',palette='viridis')
plt.title('StripPlot')
plt.show()
using Seaborn
9
Study Notes- Data Visualization
5Multiple Plots using FacetGrid
FacetGrid is a feature in the Seaborn library that enables the creation of multiple data subsets
arranged in a grid-like structure. Each plot in the grid represents a category, and these subsets
are defined by the column names specified in the 'col' and 'row' attributes of FacetGrid(). The
plots in the grid can be of any type supported by Seaborn, such as scatter plots, line plots, bar
plots, or histograms.
Use Cases:
Compare and analyze different groups or categories within a dataset
Create subplots efficiently
Example: Boxplots for pulse rate during various activities
# Creating subplots using FacetGrid
g = sns.FacetGrid(exercise, col='kind', palette='Paired')
# Drawing a plot on every facet
g.map(sns.boxplot, 'pulse')
g.set_titles(col_template="Pulse rate for {col_name}")
g.add_legend();
using Seaborn
Scatter plots for flipper length and body mass of Penguins from different islands
# Creating subplots using FacetGrid
g = sns.FacetGrid(penguins, col='island',hue='sex', palette='Paired')
# Drawing a plot on every facet
g.map(sns.scatterplot, 'flipper_length_mm', 'body_mass_g')
g.set_titles(template="Penguins of {col_name} Island")
g.add_legend();
10
Study Notes- Data Visualization
using Seaborn
6. Joint Plots
A joint plot combines univariate and bivariate visualizations in one figure. The central plot
typically features a scatter plot or hexbin plot to represent the joint distribution of two
variables. Additional plots, such as histograms or Kernel Density Estimates (KDEs), are displayed
along the axes to show the individual distributions of each variable.
Use Cases:
Analyzing the relationship between two variables
# Hex Plot with Histogram margins
sns.jointplot(x="mpg", y="displacement", data=mpg,
height=5, kind='hex', ratio=2, marginal_ticks=True)
Comparing the individual distributions of two variables
Example: Comparing displacement and miles per gallon (MPG) for cars
11
Study Notes- Data Visualization
Comparison of acceleration and horsepower for cars from different countries
# Scatter Plot with KDE Margins
sns.jointplot(x="horsepower", y="acceleration", data=mpg,
hue="origin", height=5, ratio=2, marginal_ticks=True);
7. KDE Plots
A KDE (Kernel Density Estimate) plot provides a smooth, continuous representation of the
probability density function for a continuous random variable. The y-axis represents the density
or likelihood of observing specific values, while the x-axis displays the variable's values.
Use Cases:
Visualizing the distribution of a single variable (univariate analysis)
Gaining insights into the shape, peaks, and skewness of the distribution
Example: Comparing the horsepower of cars in relation to the number of cylinders
#Overlapping KDE Plots
sns.kdeplot(data=mpg, x='horsepower', hue='cylinders', fill=True,
palette='viridis', alpha=.5, linewidth=0)
plt.title('Overlapping KDE Plot')
plt.show(
12
Study Notes- Data Visualization
Comparing the weight of cars across different countries:
#Stacked KDE Plots
sns.kdeplot(data=mpg, x="weight", hue="origin", multiple="stack")
plt.title('Stacked KDE Plot')
plt.show();
8. Pairplots
A pair plot is a visualization technique that helps explore relationships between multiple
variables in a dataset. It creates a grid of scatter plots where each variable is plotted against
13
Study Notes- Data Visualization
every other variable, with diagonal entries displaying histograms or density plots to show the
distribution of values for each variable.
Use Cases:
Identifying correlations or patterns between variables, such as linear or non-linear
relationships, clusters, or outliers
Example: Visualizing the relationships between different features of penguins
#Simple Pairplot
sns.pairplot(data=penguins, corner=True);
# Pairplot with hues
sns.pairplot(data=penguins, hue='species');
14
Study Notes- Data Visualization
By adding hue to the plot, we can clearly distinguish key differences between the various
species of penguins.
9. Heatmaps
Heatmaps are visualizations that use color-coded cells to represent the values within a matrix
or table of data. In a heatmap, the rows and columns correspond to two different variables, and
the color intensity of each cell indicates the value or magnitude of the data point at their
intersection.
Use Cases:
Correlation analysis and visualizing pivot tables that aggregate data by rows and
columns.
Example: Visualizing the correlation between all the numerical columns in the mpg
dataset.
Selection of numeric columns from the dataset
num_cols = list(mpg.select_dtypes(include='number'))
15
Study Notes- Data Visualization
fig = plt.figure(figsize=(12,7))
#Correlation Heatmap
sns.heatmap(data=mpg[num_cols].corr(),
annot=True, cmap=sns.cubehelix_palette(as_cmap=True))
plt.title('Heatmap of Correlation matrix');
plt.show();
10. Scatter Plots
A scatter plot visualizes the relationship between two continuous variables by displaying
individual data points on a graph. The x-axis represents one variable, and the y-axis represents
the other, creating a pattern of scattered points that illustrates their interaction.
Use Cases:
1. Relationship Analysis: Scatter plots help identify the relationship between two variables, such
as positive correlation (both increase together), negative correlation (one increases as the other
decreases), or no correlation.
Example: A scatter plot can show that the horsepower and weight of cars are positively
correlated.
# Simple Scatterplot
sns.scatterplot(data=mpg, x='weight', y='horsepower', s=150, alpha=0.7)
plt.title('Simple Scatterplot')
plt.show();
16
Study Notes- Data Visualization
using Seaborn
Outlier Detection: Scatter plots effectively highlight outliers, which are data points that significantly
deviate from the general trend or pattern.
Clustering and Group Identification: By analyzing the distribution of points, scatter plots can reveal
natural groupings or patterns among the variables.
Example: Comparing the horsepower and weight of cars manufactured in different countries.
# Scatterplot with Hue
sns.scatterplot(data=mpg, x='weight', y='horsepower', s=150, alpha=0.7,
hue='origin', palette='viridis')
plt.title('Scatterplot with Hue')
plt.show()
# Scatterplot with Hue and Markers
sns.scatterplot(data=mpg, x='weight', y='horsepower', s=150, alpha=0.7,
style='origin',palette='viridis', hue='origin')
plt.title('Scatterplot with Hue and Markers')
plt.show()
17
Study Notes- Data Visualization
# Scatterplot with Hue & Size
sns.scatterplot(data=mpg, x='weight', y='horsepower', sizes=(40, 400), alpha=.5,
palette='viridis', hue='origin', size='cylinders')
plt.title('Scatterplot with Hue & Size')
plt.show
Trend Analysis: Scatter plots can illustrate the progression or changes in variables over
time by plotting data points in chronological order, making it easier to identify trends or
shifts in behavior.
Model Validation: Scatter plots are useful for assessing a model's accuracy by
comparing predicted values against actual values, highlighting any deviations or
patterns in the model’s predictions.
18