CS2 2 Study Unit 7 Introduction To Data Visualization
CS2 2 Study Unit 7 Introduction To Data Visualization
Np Numpy pd Pandas
7.0 Introduction
According to English Language adage, a picture is worth a thousand words. This means that
complex and sometimes multiple ideas can be conveyed by a single still image, which conveys its
meaning rather than a mere data analysis inform of ordinary tables or verbal descriptions.
In the other hand, data visualization provides an accessible way to see and understand trends
(upward or downward direction), outliers (extreme values), and patterns in data. Our eyes are
easily drawn to colours and pattern. Data visualization helps grab the readers' interest and keeps
their eyes on the message of the data.
It can be hard for the audience to grasp the true meaning of the findings without data
visualization.
Presentation of data and information is not just about picking any data visualization design.
Matching data to the right visualization begins by answering the following five (5) key questions:
With those questions (and your answers) in mind, we’ll dive into different visualization techniques
at which we can represent our data and data story to life.
CS2: Computer Science Level 2
In this section, you will learn about various data visualization techniques.
Presents categorical data with rectangular bars with heights or lengths proportional to the
values that they represent. The bars can be plotted vertically or horizontally.
A bar graph shows comparisons among discrete categories. One axis of the chart shows the
specific categories being compared, and the other axis represents a measured value.
Some bar graphs present bars clustered in groups of more than one, showing the values of more
than one measured variable. These clustered groups can be differentiated using colour.
CS2: Computer Science Level 2
Name: Histogram
Visual dimensions:
bin limits
count/length
color
An approximate representation of the distribution of numerical data. Divide the entire range of
values into a series of intervals and then count how many values fall into each interval this is
called binning. The bins are usually specified as consecutive, non-overlapping intervals of a
variable. The bins (intervals) must be adjacent, and are often (but not required to be) of equal
size.
The height of the bar represents the number of penguins that lies within flipper length (mm)
respective bin (range).
CS2: Computer Science Level 2
Visual dimensions:
x position
y position
shape
color
size
Uses Cartesian coordinates to display values for typically two variables for a set of data.
Points can be coded via color, shape and/or size to display additional variables.
Each point on the plot has an associated x and y term that determine its location on the
cartesian plane.
Scatter plots are often used to highlight correlation between variables (x and y).
CS2: Computer Science Level 2
Visual dimensions:
color
Represents one categorial variable which is divided into slices to illustrate numerical
proportion. In a pie chart, the arc length of each slice is proportional to the quantity it
represents.
For example, as shown in the graph above, the proportion of Penguins species.
Visual dimensions:
x axis
y axis
CS2: Computer Science Level 2
A method for graphically depicting groups of numerical data through their quartiles.
Box plots may also have lines extending from the boxes (whiskers) indicating variability
outside the upper and lower quartiles.
Outliers may be plotted as individual points.
The two boxes graphed on top of each other represent the middle 50% of the data, with the
line separating the two boxes identifying the median data value and the top and bottom
edges of the boxes represent the 75th and 25th percentile data points respectively.
Shows the distribution of quantitative data in a way that facilitates comparisons across
levels of a categorical variable. For example, comparing the distribution of sepal length
among flower species (e.g. Carnation, Lily, and Rose).
Summary
Additional resources
For more additional resources on data visualization, check the following resources:
https://www.klipfolio.com/resources/articles/what-is-data-visualization
https://www.import.io/post/what-is-data-visualization/
CS2: Computer Science Level 2
There are many tools/packages for data visualization in Python programming. Some of them are
as follows:
Plotly generates the most interactive graphs; allows saving them offline and create very
rich web-based visualizations
Pandas also possesses its own data visualization functionalities based on Matplotlib.
In this course, we will consider using Seaborn and Pandas for data visualization. We will also
consider using some functions in matplotlib package, since Seaborn and Pandas based their
visualization on matplotlib.
7.1.4 Installation
The Anaconda distribution of Python comes with Matplotlib and Pandas pre-installed and no
further installation steps are necessary. Seaborn is not directly included but can easily be installed
with conda install seaborn or pip install seaborn. Open the Anaconda Prompt or your terminal
and run:
or
Using the code above, we have imported both Seaborn and Pandas. We assign both of these aliases
to make calling their methods easier. numpy is assigned the alias np, pandas is assigned the alias
pd, matplotlib.pyplot is assigned the alias plt, and seaborn is assigned the alias sns.
For our data visualization in this section, we shall use Food servers’ tips dataset.
tips_data.head()
CS2: Computer Science Level 2
tips_data.tail()
tips_data.columns
tips_data.shape
(744, 7)
tips_data.profile_report()
CS2: Computer Science Level 2
• Darkgrid
• Whitegrid
• Dark
• White
• Ticks
sns.set_style("whitegrid")
Scatter plots are used to plot data points on the horizontal and vertical axis. It shows how much
one variable is affected by another. It shows the extent of correlation. It is also used to find the
relationship between two variables that are continuous.
To plot a scatter plot we use sns.relplot() function of seaborn library. It can be done by using:
You can use any variable to classify scatter plot. For this, there is a parameter called hue. You can
use hue as follows:
As you can see the scatter plot is classified based on sex by giving color to each point.
Important note
We can relabel our coordinates (x and y axis) and also give the plot a title by using matplolib
functions as follows:
Important note
We always use semi colon (;) at the last line of code to stop the texts written before the visual is
shown. For example:
Seaborn makes this easy by using the lmplot() function. For example,
You can remove the confidence interval in the regression line by setting ci = None in sns.lmplot()
function
Line chart
With some datasets, you may want to understand changes in one variable as a function of time,
in this situation, a good choice is to draw a line plot. In seaborn, this can be accomplished by
using sns.lineplot() function. For example, let’s import Nigeria monthly COVID-19 cases
dataset.
Pair Plot
In the pair plot, one variable in the same data row is matched with the value of another variable.
That is, it combines or do permutation and combination of all the variables in the dataset. This plot
can only be a plot on numerical data. It can plot by using sns.pairplot() on the DataFrame:
sns.pairplot(tips_data);
CS2: Computer Science Level 2
Join Plot
A join plot helps to learn about the relationship between 2 numeric variables. It is used to do
univariate analysis. It displays a correlation between two variables. You can plot a join plot as
Correlation Matrices
We can plot correlation matrices by using a feature called heatmap. Heatmap helps us to find a
correlation (interrelation) between every continuous variable in our dataset.
The basic requirement for finding correlation is that the variable should be numeric (continuous)
i.e. data type must be int or float.
Correlation matrices cannot be found for categorical features because they are object (or string)
data type. Whenever you will find correlation matrices the value will be ranging from -1 to +1
which is Pearson correlation. Therefore, to find the correlation you can use:
CS2: Computer Science Level 2
tips_data.corr()
As you can see, we are getting only 3 features because only these here are numerical and the rest
are categorical.
sns.heatmap(tips_data.corr());
Histogram Plot
Hist Plot helps to create histograms. To do that, we will use a function called as histplot(). It
creates a frequency distribution of continuous variables. It can be created by
using sns.histplot() on the variable of choice:
CS2: Computer Science Level 2
We can also show the distribution of a continuous variable using a box plot. The box plot shows
the quartile values of the distribution. Each value in the box plot corresponds to actual observation
in the data. It also shows outliers. You can plot boxplot as:
The distribution of total bill is symmetric i.e. it is not skewed. It looks like it is normally
distributed. The point above the upper whisker shows some extreme values (total bill that is
abnormal) known as outlier.
Important note
We can change orientation of any plot in seaborn either vertically or horizontally by changing the
position of x and y axis. For example:
Count Plot
It shows the counts of observations in each category using bars. This is different from histogram
because it has a gap (space) after every bar and the number of count (length of the bar) is
proportional to the number of categories in that variable.
As you can see it plots the number of bars as there are categories in the variable. We can use
.value_counts() to confirm that.
tips_data["day"].value_counts()
Important note
We can order the bar in the right order of the week i.e. from Monday to Sunday using order
parameter
CS2: Computer Science Level 2
sns.countplot(x = "day", order = ["Mon","Tues", "Wed", "Thur", "Fri", "Sat", "Sun"], data = tips
_data)
plt.xlabel("Day of the week")
plt.ylabel("Number of visitors");
As you can see, a lot of visitors came to the restaurant on Saturday when compare to other days of
the week
We can also compare day of the week and gender using the hue parameter as follows:
For both male and female, the most common size of the party in the restaurant is 2.
CS2: Computer Science Level 2
Bar Plot
Bar plot does the same work as count plot. But in this, we have to specify both x and y. Based on
one feature it will display other value.
Those that smoke paid a slightly higher bill than those that did not smoke during their stay in the
restaurant.
Important note
You can remove the error bar by setting ci = None in the sns.barplot().
Those that did not smoke give more tips than those that smoke
We can also compare smoker and time of the day using the hue parameter as follows:
It seems that people give more tips at the dinner than at the lunch time irrespective of whether they
smoke or not at the restaurant.
Pie Chart
A pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical
proportion. Seaborn is not supporting pie chart currently. We will use .plot() attribute of Pandas
to do this.
tips_data["gender"].value_counts().plot(kind = "pie");
CS2: Computer Science Level 2
To highlight the first and fourth value in the size of the party, use explode parameter and then put
a non-zero value to those positions
Box Plot
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that
facilitates comparisons between variables or across levels of a categorical variable.
Violin Plot
Violin plots take boxplots one step further by showing the kernel density distribution within each
category. You can plot violin plot as:
We can save or export the plot by using plt.savefig() function. First, we need to create our plot.
For example, visualize size of the party and then use plt.savefig() to export a plot as a PNG file
i.e. save it as size.png
plt.ylabel("Number of attendees")
plt.savefig("size.png"
CS2: Computer Science Level 2
Pilot answer 2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Pilot answer 3
Solution 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Solution 2
CS2: Computer Science Level 2
penguins = pd.read_csv("activity_datasets/penguins.csv")
Solution 3
penguins.describe()
Solution 4
sns.boxplot(x = "bill_length_mm", data = penguins);
or
CS2: Computer Science Level 2
Solution 5
sns.distplot(penguins["body_mass_g"]);
CS2: Computer Science Level 2
Pilot answer 4
Solution 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Solution 2
penguins = pd.read_csv("activity_datasets/penguins.csv")
Solution 3
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins)
Solution 4
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins, order = ["Adelie", "Chinstrap"
, "Gentoo"])
CS2: Computer Science Level 2
Solution 5
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins, order = ["Adelie", "Chinstrap
", "Gentoo"], hue = "sex");
Solution 6
sns.lmplot(x = "bill_length_mm", y = "flipper_length_mm", data = penguins, ci = None);
CS2: Computer Science Level 2
Solution 7
sns.pairplot(penguins);
CS2: Computer Science Level 2
CS2: Computer Science Level 2
1. Data visualization is the graphical representation of data by visual elements such as charts,
Infographics, and maps to understand the data
2. Before you visualize your dataset, you need to answer five (5) questions in section 4.1
3. We have different visualization designs or techniques such as bar chart, pie chart,
histogram, scatter diagram, etc.
4. Some visuals are good to represent categorical data e.g. bar chart while some are good for
continuous data e.g. histogram, scatter diagram, etc.
5. Scatter diagram is used to visualize the relationship between two continuous variables.
Additional resources
For more additional resources on data visualization, check the following resources:
https://datagy.io/python-seaborn/
http://bit.ly/Introduction-to-seaborn-YouTube-video
https://seaborn.pydata.org/tutorial/function_overview.html
https://seaborn.pydata.org/tutorial/relational.html
https://seaborn.pydata.org/tutorial/categorical.html