L5 6 DataViz
L5 6 DataViz
L5 6 DataViz
Data Visualization
Practice
2
Contents
• Lecture 1: Overview of Data Science
• Lecture 2: Data crawling and preprocessing
• Lecture 3: Data cleaning and integration
• Lecture 4: Exploratory data analysis
• Lecture 5: Data visualization
• Lecture 6: Multivariate data visualization
• Lecture 7: Machine learning
• Lecture 8: Big data analysis
• Lecture 9: Capstone Project guidance
• Lecture 10+11: Text, image, graph analysis
• Lecture 12: Evaluation of analysis results
3
This lecture
1. How to choose the right chart?
2. Bar Chart – Column Chart
3. Line Chart
4. Histogram
5. Scatter Plot
6. Violin
7. Other charts
8. Multivariable Visualization
4
1. How to choose the right chart?
• Data visualization is a technique to communicate
insights from data through visual representation
• Main goal: is to distill large datasets into visual
graphics to allow for a straighforward understanding of
complex relationship within the data
• It is important to choose the right chart for visualizing
your data
5
What story do you want to tell?
• It is important to understand why we need a kind of
chart
• Graphs
• Plots
• Maps
• Diagrams
• ...
• Relationship
• Data over time
• Ranking
• Distribution
• Comparison
6
Relationship
• To display a connection or correlation between two or
more variables
• When assessing a relationship between data sets, we
are trying to understand how these data sets combine
and interact with each other
• The relationship or correlation can be positive or
negative
• Whether or not the variables might be supportive or working
against each other
7
Relationship
• Scatter plot
• Histogram
• Pair Plot
• Heat map
8
Data over time
• Goal: to explore the relationship between variables to
find trends or changes over time
• The date/time appears as a link property between
variables, so a kind of relationship
• Line chart
• Area chart
• Stack Area Chart
• Area Chart Unstacked
9
Ranking
• Goal: to display the relative order of data values
• Vertical bar chart
• Horizontal bar chart or Column Chart
• Multi-set bar chart
• Stack bar chart
• Lollipop Chart
10
Distribution
• Goal: to see how a variable is distributed
• Histogram
• Density Curve with Histogram
• Density plot
• Box plot
• Strip plot
• Violin Plot
• Population Pyramid
11
Comparison
• Goal: to display the trends between multiple variable in
datasets or multiple categories within a single variable
• Bubble chart
• Bullet chart
• Pie chart
• Net pie chart
• Donut chart
• TreeMap
• Diverging bar
• Choropleth map
• Bubble map
12
2. Bar/Column Chart
• A series of bars illustrating a variable’s development
• 4 types of bar charts:
• Horizontal bar chart
• Vertical bar chart
• Group bar chart
• Stacked bar chart
• This kind of chart is appropriated when we want to
track the development of one or two variables over
time
• One axis shows the specific categories being
compared (independent variable)
• The other axis represents a measured value
(dependent variable)
13
Vertical Bar Chart (Column Chart)
• Distinguish it from histograms
• not to display a continuous developments over an interval
• discrete data
• data is categorical and used to answer the question of how
many in each category
• Used to compare several items in a specific range of
values
• Ideal for comparing a single category of data between
individual sub-items
14
Vertical Bar Chart (Column Chart)
Benefits from
both position
Quantitative
(top of bar)
Dependent and length
variable (size of bar)
Discrete/Nominal
Independent variable
15
Vertical Bar Chart (Column Chart)
import numpy as np
import matplotlib.pyplot as plt
xvals = range(len(linear_data))
plt.bar(xvals, linear_data, width=0.3)
exp_xvals = []
for item in xvals:
exp_xvals.append(item+0.3)
plt.bar(exp_xvals, exponential_data, width=0.3,
color='r’)
16
Vertical Bar Chart (Column Chart)
import numpy as np
import matplotlib.pyplot as plt
xvals = np.arange(len(linear_data))
exp_xvals = []
for item in xvals:
exp_xvals.append(item+0.3)
fig, ax = plt.subplots()
ax.bar(xvals, linear_data, width=0.3)
ax.bar(exp_xvals, exponential_data, width=0.3,
color='r')
ax.legend(['Linear data', 'Exponential data'])
ax.set_xticks(xvals + 0.3 / 2)
ax.set_xticklabels(xvals)
plt.show()
17
Horizontal Bar Chart
• Represent the data horizontally
• The data categories are shown on the y-axis
• The data values are shown on the x-axis
• The length of each bar is equal to the value
corresponding to the data category
• All bars go across from left to right
• Use barh() function
18
Stacked Bar Chart
• Stacked bar charts segment their bars
• Used to show how a broader category is divided into
smaller categories
• The relationship of each part on the total amount is
also showed
• Place each value for the segment after the previous
one
• The total value of the bar chart is all the segment
values added together
• Ideal for comparing the total amount across each
group/segmented bar
19
Stacked Bar Chart
20
Stacked Bar Chart
21
3. Line Chart
• Line charts are used to display quantitative values over
a continuous interval or period
• Drawn by first plotting data points on a cartesian
coordinate grid and then connecting them
• Y-axis has a quantitative value
• X-axis is a timescale or a sequence of intervals
• Best for continuous data
• Most frequently used to show trends and analyze how
the data has changed over time
22
Line charts
Quantitative continuous
independent variable
23
Line chart (pylab vs pyplot
from pylab import *
t = arange(0.0, 2.0, 0.01)
s = sin(2.5*pi*t)
plot(t,s)
xlabel('time (s)')
ylabel('voltage (mV)')
title('Sine Wave')
grid(True)
show()
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0.0, 2.0, 0.01)
s = np.sin(2.5*np.pi*t)
plt.plot(t,s)
plt.xlabel('time (s)')
plt.ylabel('voltage (mV)')
plt.title('Sine Wave')
plt.grid(True)
plt.show()
24
Line chart (cont.)
import numpy as np
import matplotlib.pyplot as plt
linear_data =
np.array([1,2,3,4,5,6,7,8])
exponential_data =
linear_data**2
plt.plot(linear_data, '-o',
exponential_data, '-o')
plt.show()
25
Line chart (cont.)
import numpy as np
import matplotlib.pyplot as
plt
linear_data =
np.array([1,2,3,4,5,6,7,8])
exponential_data =
linear_data**2
plt.plot(linear_data, '-o',
exponential_data, '-o')
plt.gca().fill_between(range(l
en(linear_data)),
linear_data, exponential_data,
facecolor='blue',
alpha=0.25)
plt.show()
26
Area Chart
• Built based on line chart
• The area between the x-axis and the line is filled in
with color or shading
• Ideal for clearly illustrating the magnitude of change
between two or more data points
• Use stackplot() function
• Or just fill in color the area between two lines
27
Area Chart
28
4. Histogram
• Histogram is an accurate representation of the
distribution of numerical data
• An estimation of the probability distribution of a
continuos variable
• To construct a histogram, follow these steps
• Bin the range of values
• Divide the entire range of values into a series of intervals
• Count how many values fall into each interval
• Bins are usually specified as consecutive, non-
overlapping intervals of variable
29
Histogram example
30
Histogram example
import numpy as np
import matplotlib.pyplot as plt
plt.show()
31
Histogram example
import numpy as np
import matplotlib.pyplot as plt
plt.show()
32
Histogram example
33
5. Scatter plot
• A kind of chart that is often used in statistics and data
science
• It consists of multiple data points plotted across two
axes
• Each variable depicted in a scatter plot would have
various observations
• Used to identify the data’s relationship with each
variable (i.e., correlation, trend patterns)
• In machine learning, scatter plots are often used in
regression, where x and y are continuous variable
• Also being used in clustering scatters or outlier
detection
34
Practice with Pandas and Seaborn to
manipulating data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris.head()
35
Practice with Pandas and Seaborn to
manipulating data
36
Use scatter plot for Iris data
• Plot two variables: SepalLengthCm and SepalWidthCm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris =
pd.read_csv("../input/Iris.csv")
iris.head()
iris["Species"].value_counts()
iris.plot(kind="scatter",
x="SepalLengthCm",
y="SepalWidthCm")
plt.show()
37
Use scatter plot for Iris data
• Display color for each kind of Iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris =
pd.read_csv("../input/Iris.csv")
iris.head()
iris["Species"].value_counts()
col = iris['Species'].map({"Iris-
setosa":'r', "Iris-
virginica":'g', "Iris-
versicolor":'b'})
iris.plot(kind="scatter",
x="SepalLengthCm",
y="SepalWidthCm", c=col)
plt.show()
38
Marginal Histogram
• Histograms added to the margin of each axis of a
scatter plot for analyzing the distribution of each
measure
• Assess the relationship between two variables and
examine their distributions
39
Marginal Histogram
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
iris =
pd.read_csv("../input/Iris.csv")
iris.head()
iris["Species"].value_counts()
sns.jointplot(x="SepalLengthCm",
y="SepalWidthCm", data=iris,
size=5)
plt.show()
40
6. Other kinds of chart
Box Plot
• Box and Whisker Plot (or
Box Plot) is a convenient
way of visually displaying
the data distribution
through their quartiles
41
Box Plot
• Some observations from viewing Box Plot
• What the key values are such as: the average, median, 25th
percentile etc.
• If there are any outliers and what their values are
• Is the data symmetrical
• How tightly is the data grouped
• If the data is skewed and if so, in what direction
42
Box Plot
import pandas as pd
import seaborn as sns
import matplotlib.pyplot
as plt
iris =
pd.read_csv("../input/Ir
is.csv")
iris.head()
sns.boxplot(x="Species",
y="PetalLengthCm",
data=iris)
plt.show()
43
Box Plot
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as
plt
iris =
pd.read_csv("../input/Iris.
csv")
iris.head()
ax =
sns.boxplot(x="Species",
y="PetalLengthCm",
data=iris)
ax =
sns.stripplot(x="Species",
y="PetalLengthCm",
data=iris, jitter=True,
edgecolor="gray")
plt.show()
44
Violin Plot
• Combination of the box plot with a kernel density plot
• Same information from box plot
45
Violin Plot
• Shows the entire distribution of the data
46
Violin Plot
• Histogram shows the symmetric shape of the distribution
47
Violin Plot
• The kernel density plot used for creating the violin plot is
the same as the one added on top of the histogram
48
Violin Plot
• Wider sections of the violin plot represent a higher
probability of observations taking a given value
• The thinner sections correspond to a lower probability.
49
Violin Plot of Iris data
import pandas as pd
import seaborn as sns
import
matplotlib.pyplot as
plt
iris =
pd.read_csv("../input/
Iris.csv")
iris.head()
sns.violinplot(x="Spec
ies",
y="PetalLengthCm",
data=iris, size=6)
plt.show()
50
Regression Plot
• Create a regression line between 2 parameters and
helps to visualize their linear relationships
• Example: data set tips of seaborn contains information
about:
• the people who probably had food at the restaurant and
whether or not they left a tip
• the gender of the people, whether they smoke, day, time
• Use seaborn’s function lmplot() to create regression
plot
51
Regression Plot example
52
Regression Plot Example
53
Regression Plot Example
Distinguish two
categories by sex
54
Heatmaps
• The underlying idea: replace numbers with colors
• The goal of heatmaps is to provide a colored visual
summary of information
• Heatmaps are useful for cross-examining multivariate
data, through placing variables in rows and columns
and coloring cells within the table
• All the rows are one category (labels displayed on the
left side)
• All the columns are another category (labels displayed
on the bottom)
• Data in a cell demonstrates the relationship between
two variables in the connecting row and column
55
Heatmap Example
56
Heatmap with seaborn
57
Heatmap with seaborn
58
Graphs
edge
node
59
Graphs
edge
node
60
Directed Graphs and Hierarchies
• Directed vs Undirected
• Cyclic vs acyclic
• Tree
• Minimally connected
• N nodes, n-1 edges
• Single parent node can
have multiple child
nodes
• Hierarchy
• Acyclic directed graph
• Having a root node
61
Node Degree
• Degree of a node =
number of edges
• Directed graph nodes
have an in-degree and
an out-degree
• Social Networks
• Many low degree
nodes and fewer high
degree nodes
• Also called power-law
or scale-free graphs
62
Graph Visualization
• For visualizing more abstract and non-quantitative data
• For example:
• The relationship/contacts of individuals in a population (also
called network of contacts)
• The hierarchical structure of classes in a module
• Matplotlib does not support this kind of visualization
63
Roassal: an agile visualization tool
• Roassal is a DSL, written in Smalltalk and integrated in
Pharo/Moose – an open source platform for software
and data analysis
• Installing from: http://www.moosetechnology.org
64
Hierarchy
| b |
b := RTMondrian new.
b shape circle size: 30.
b nodes: RTShape withAllSubclasses.
b shape arrowedLine
withShorterDistanceAttachPoint
.
b edgesFrom: #superclass.
b layout forceWithCharge: -500.
b build.
^ b view
65
Network structure
| b lb |
b := RTMondrian new.
b shape circle color: (Color red alpha: 0.4).
b nodes: Collection withAllSubclasses.
b edges connectFrom: #superclass.
b shape
bezierLineFollowing: #superclass;
color: (Color blue alpha: 0.1).
b edges
notUseInLayout;
connectToAll: #dependentClasses.
b normalizer normalizeSize: #numberOfMethods min: 5
max: 50.
b layout force.
b build.
lb := RTLegendBuilder new.
lb view: b view.
lb addText: 'Circle = classes, size = number of
methods; gray links = inheritance;'.
lb addText: 'blue links = dependencies; layout =
force based layout on the inheritance links'.
lb build.
^ b view @ RTZoomableView
66
Tree Map
67
Tree map layout
1 2 1 1 1 2 2 1 1 1 1 1 1
68
Tree map layout
16
11 5
4 4 3 3 2
1 2 1 1 1 2 2 1 1 1 1 1 1
69
Tree map layout
16
11 5
4 4 3 3 2
1 2 1 1 1 2 2 1 1 1 1 1 1 11/16 5/16
70
Tree map layout
16
11 5
4/11
4 4 3 3 2 3/5
1 2 1 1 1 2 2 1 1 1 1 1 1 11/16
4/11 5/16
71
Tree map layout
16
11 5 1/3
1/4 2/4
4/11 1/4
4 4 3 3 2 3/5
1/3
1 2 1 1 1 2 2 1 1 1 1 1 1 1/4 1/411/16
4/11 2/4 1/3
5/16
72
8. Multivariable Visualization
• For data tables with n>3
variables: parallel
coordinates
• Each vertical line
corresponds to a variable
• A point in n-dimensional
space is represented as
a polyline with vertices on
the parallel axes
• the position of the vertex
on the i-th axis
corresponds to the value of
the i-th attribute for this
record
• It might be interesting to try
different axis arrangements
73
Parallel Plot
• Parallel Coordinates Plots allow to compare the feature
of several individual observations on a set of numerical
variables
• Each vertical bar represents a variable and usually has
its own scale
• Values are plotted as series of lines connected across
each axis
• Color can be used to represent different groups of
individuals or highlight a specific one
• Allow to compare variations of adjacent axis
• Changing the order can lead to the discovery of new patterns
in the dataset
74
Parallel plot with pandans for Iris data
75
Practice
76
CitiesExt.csv
• Ten countries with the highest population, bar chart
showing populations
• Pie chart showing relative number of cities with
negative longitude and positive longitude. Label the
two slices “west” for west of the Prime Meridian
(negative longitude), and “east” for east of the Prime
Meridian (positive longitude)
• Is there is any relationship between the latitude of
cities in a country (x-axis) and the population of that
country (y-axis) (scatter plot)
77
PlayersExt.csv
• Create a bar chart showing the average number of minutes
played by players in each of the four positions.
• Create a stacked bar chart for teams that played more than
4 games, showing their number of wins, draws, and losses.
• Create a pie chart showing the relative percentage of teams
with 0, 1, and 2 red cards. Note: the pie should have three
slices.
• Create a scatterplot of players showing passes (y-axis)
versus minutes (x-axis). (Why are there some lines of dots?)
• Create a map of countries colored light to dark blue based
on how many goals their team made (“goalsFor”).
• Create a pie chart showing the relative percentage of
players making <= 0.25 passes per minute, >= 0.5 passes
per minute, and between 0.25 and 0.5.
78
Thank you
for your
attention!!!
79