Direktorat SITP DJP Kemenkeu RI
Data Science Training, July 2022
EDA & Data Visualization
Instructor: Siti Aminah, M.Kom
Faculty of Computer Science, Universitas Indonesia
Outline Outline
Exploratory Data Analysis
Descriptive Statistics
Data Visualization
Data Visualization Principles
Basic Visualization Tools
Specialized Visualization Tools
Advanced Visualization Tools
2
CRISP-DM: Cross-industry standard process for data mining
Kenneth Jensen, CC BY-SA 3.0, via Wikimedia Commons
3
Exploratory Data Analysis
4
EDA: Take a peek at data
• EDA is a term for an initial analysis
done with datasets.
• It's basically taking a peek at the
data to understand more about
what it represents and how to use
it.
• It's often a precursor to more
advanced data analytics
techniques.
5
Exploratory data analysis (EDA)
Exploratory data analysis (EDA) is an approach:
• to analyzing datasets
• by summarizing their main characteristics
• often with visual methods.
The term EDA was coined by John W. Tukey in the book
"Exploratory Data Analysis" in 1977.
Why EDA?
• We need to familiarize with a new dataset: How does it look like?
• How many attributes, and of what kind?
• Are there any missing values?
• How are the values distributed?
• Is our dataset imbalanced? (= if left untreated, our model can be biased)
• Hunting for something interesting: What catches your eyes?
• Are there any outliers?
• Are there any correlations between attributes?
• How do the distributions compare between different samples?
7
EDA approach
• Descriptive statistics • Data visualizations
• Central tendency • Single attribute (univariate
• Measure of variation analysis):
• Skewness & kurtosis Barcharts, histogram, pie charts,
donut charts
• Correlations
• Multiple attributes (multivariate
analysis):
Scatter plots, bubble charts, line
charts, heat maps
8
Descriptive Statistics
9
Kenali Data Anda
• Kategorikal vs Numerik
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 10
Kenali Data Anda
• Nominal, Ordinal, Interval, Rasio
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 11
Central Tendency
Central Tendency
Mean Median Mode
12
Measure of Variation
Measure of Variation
Range IQR Variance, Stdev
Kecerdasan Artifisial dan Sains Data Dasar | Semester Genap 2021/2022 | Fakultas Ilmu Komputer – Universitas Indonesia 13
Measure of Variation
Range
Range = max - min
The simplest measure of variation, often denoted by indicating
the largest and smallest values separately.
14
Measure of Variation
Inter-Quartile Range (IQR)
Divides a dataset into quartiles:
• Q1 (lower quartile): 25th percentile
- Median of lower half
• Q2 (median): 50th percentile
• Q3 (upper quartile): 75th percentile
- Median of upper half
IQR = Q3-Q1
15
Measure of Variation
Inter-Quartile Range (IQR)
• From the data (n = 7):
5, 7, 4, 4, 6, 2, 8
• Q1 = ?
• Q2 = ?
• Q3 = ?
• IQR = ? Range = ?
16
Measure of Variation
Inter-Quartile Range (IQR)
• From the data (n = 7):
5, 7, 4, 4, 6, 2, 8 -> Sorted: 2, 4, 4, 5, 6, 7, 8
• Q1 = median of lower half = 4
• Q2 = 5
• Q3 = median of upper half = 7
• IQR = Q3-Q1 = 3 Range = 6
17
Outliers
An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
Before abnormal observations can be
singled out, it is necessary to
characterize normal observations.
Outliers, according to IQR, are data
points whose values are:
• less than Q1-1.5*IQR, or
• more than Q3+1.5*IQR
18
Variance & Standard Deviation
Variance = Average of the squared deviation of the observations from
the mean
Standard deviation s = Square root of the variance
19
Skewness & Kurtosis
• Skewness
A measure of asymmetry
• Kurtosis
A measure of outliers
20
Skewness
• Skewness is a measure of asymmetry of the data around the mean.
• When a distribution is skewed, the mode remains the most
commonly occurring value, the median remains the middle value in
the distribution, but the mean is generally ‘pulled’ in the direction
of the tails.
21
Skewness
• where 𝑥1 is each data point, 𝑥 is the arithmetic mean, 𝑛
is the size of the data , and 𝑠 is the standard deviation.
• The skewness for a normal distribution is zero, and any symmetric
data should have skewness near zero. Negative values for the
skewness indicate data that are skewed left and positive values for
the skewness indicate data that are skewed right.
22
Kurtosis
• High kurtosis indicates the presence of outliers!
https://www.analyticsvidhya.com/blog/2021/05/shape-of-data-skewness-and-kurtosis/
23
Kurtosis
• where 𝑥1 is each data point, 𝑥 is the arithmetic mean, 𝑛
is the size of the data , and 𝑠 is the standard deviation.
• A normal distribution has kurtosis exactly 3 (mesokurtic).
• A distribution with kurtosis<3 is called platykurtic.
• A distribution with kurtosis>3 is called leptopkurtic
24
(Pearson) Correlation
• It is a technique to investigate the relationship between
two variables: that is, measures the strength of
the association between the two variables
• Pearson's correlation coefficient (r) is a type of correlation coefficient
• Correlation coefficient returns a value between -1 and 1
• -1 denotes strongest negative correlation
• 0 denotes no correlation
• 1 denotes strongest positive correlation
25
(Pearson) Correlation
Berapa nilai korelasi (Pearson r) masing-masing gambar ini?
26
(Pearson) Correlation
27
Data Visualization
28
29
More Examples
• The famous GapminderVideo, Hans Rosling: 200 Countries, 200
Years, 4 Minutes:
https://www.youtube.com/watch?feature=player_embedded&v=jb
kSRLYSojo
• NY Times Interactive Visualizations (e.g., 2013 Federal Budget)
http://www.nytimes.com/interactive/2012/02/13/us/politics/2013-
budget-proposal-graphic.html
30
Why Data Visualization?
01 BUILD 02 BUILD 03 BUILD 04 BUILD
VISUALS VISUALS VISUALS VISUALS
Enables Communicate Share Support
exploratory data clearly unbiased recommendation
data analysis representation to different
stake holder
of data
31
Three Key Points of Build Visuals
by DARKHORSE ANALYTICS
Any feature or
design you incorporate
in your plot to make it
more attractive or
pleasing should
LESS IS: support the message
that the plot is meant
to get across and not
distract from it.
more effective
32
Look at this figure
33
• It looks like features such as the
blue background or 3D
orientation are meant to convey
anything.
• In fact, these additional
unnecessary features distract
from the main message and
can be confusing to the
audience.
34
• The message here is that people
are most likely to choose bacon
over other types of pig meat, so
let's get rid of everything that can
be distracting from this core
message.
• It is simple, cleaner, less
distracting, and much easier to
read.
35
36
• The proportion of each pie is wrong.
• Unnecessary sky background.
37
38
• Are you sure the internet users are only 1/3rd of the total population?
39
LINE PLOTS
BASIC AREA PLOTS
VISUALIZATION
TOOLS HISTOGRAMS
BAR CHARTS
40
LINE PLOTS
• Line plot is a plot in the form of a series of data points connected by
straight line segments.
• The best use case for a line plot is when you have a continuous dataset
and you're interested in visualizing the data over a period of time.
41
LINE PLOTS
• For example, say we're interested in the trend
of immigrants from Haiti to Canada.
• We can generate a line plot and the resulting
figure will depict the trend of Haitian
immigrants to Canada from 1980 to 2013.
• Based on the line plot, we can then research
for justifications of obvious anomalies or
changes
• From previous plot, we see that there is a
spike of immigration from Haiti to Canada in
2010.
• A quick Google search for major events in Haiti
in 2010 would return the tragic earthquake that
took place in 2010, and therefore this influx of
immigration to Canada was mainly due to that
tragic earthquake.
42
AREA PLOTS
• An area Plot (also known as an area chart or area graph) depicts
accumulated totals using numbers or percentages over time.
• It is based on the line plot and is commonly used when trying to compare
two or more quantities.
43
AREA PLOTS
44
HISTOGRAMS
• A histogram is a way of representing the frequency distribution of a numeric
dataset.
• It takes as input one numerical variable.
• The variable is cut into several bins, and the number of observations per bin is
represented by the height of the bar.
• To construct a histogram, the first step is to “bin” the range of values — that is, divide
the entire range of values into a series of intervals — and then count how many
values fall into each interval. The bins are usually specified as consecutive, non-
overlapping intervals of a variable.
45
HISTOGRAMS
https://id.wikipedia.org/wiki/Histogram#/media/Berkas:Black_cherry_tree_histogram.svg
46
HISTOGRAMS
Histogram with different bin size
https://chartio.com/learn/charts/histogram-complete-guide/
47
HISTOGRAMS
The number of bins needs to be:
• large enough to reveal interesting features;
• small enough not to be too noisy.
Choice of bin size has an inverse relationship with the number of bins.
• The larger the bin sizes, the fewer bins there will be to cover the whole range
of data.
• With a smaller bin size, the more bins there will need to be.
• It is worth taking some time to test out different bin sizes to see how the
distribution looks in each one, then choose the plot that represents the data
best.
https://chartio.com/learn/charts/histogram-complete-guide/
48
HISTOGRAMS
Use a zero-valued base line
https://chartio.com/learn/charts/histogram-complete-guide/
49
HISTOGRAMS
Updating Histogram with
Colors
https://matplotlib.org/3.3.4/gallery/statistics/hist.html
50
BAR CHARTS
• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.
51
BAR CHARTS
• Unlike a histogram, a bar chart also known as a bar graph is a type of plot
where the length of each bar is proportional to the value of the item that it
represents.
• It is commonly used to compare the values of a variable at a given point in
time.
52
BAR CHARTS
Single Bar Chart
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
53
BAR CHARTS
Dual Bar Chart
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
54
BAR CHARTS
Stacked Bar Chart
https://www.open.edu/openlearn/mod/oucontent/view.php?id=90853&extra=thumbnailfigure_idm333
55
BAR CHARTS
Horizontal Bar Chart
Apa keunggulannya?
https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
56
BAR CHARTS
Percentiles as
Horizontal Bar Chart
https://matplotlib.org/3.4.0/gallery/statistics/barchart_demo.html#sphx-glr-gallery-statistics-barchart-demo-py
57
Histogram vs Bar Chart
58
PIE CHARTS
BOX PLOTS
SPECIALIZED VISUALIZATION
TOOLS
SCATTER PLOTS
HEAT MAPS
59
PIE CHARTS
• A pie chart is a
circular statistical
graphic divided into
slices to illustrate
numerical proportion.
• The input data you
must provide is an
array of numbers,
where each numbers
will be mapped to one
of the pie item.
Source: https://www.python-graph-gallery.com/pie-plot-matplotlib-basic
60
PIE CHARTS
Multiple pie charts to show
changes in parts-to-whole
relationship
Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
61
PIE CHARTS
• Some people suggest no to use Pie Charts
• Graphs of data should tell us about the quantities involved and help us to make
accurate comparisons between these quantities. The quantities in each category
should be easy to estimate and the category labels should be clear.
• Pies and doughnuts fail because:
• Quantity is represented by slices; humans aren’t particularly good at estimating
quantity from angles, which is the skill needed.
• Matching the labels and the slices can be hard work.
• Small percentages (which might be important) are tricky to show.
Source: https://www.jmp.com/en_us/statistics-knowledge-portal/exploratory-data-analysis/pie-chart.html
62
PIE CHARTS
• You need to add the percentage
to every slice.
• You need to directly label every
slice.
• You have run out of colors for the
slices.
• You decide to explode the chart
to solve your first three problems.
https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
63
PIE CHARTS
https://scc.ms.unimelb.edu.au/resources/data-visualisation-and-exploration/no_pie-charts
64
BOX PLOTS
• A box plot is a way of statistically representing the distribution
of given data through five main dimensions:
• The first dimension is minimum of the data.
• The second dimension is first quartile.
• The third dimension is median.
• The fourth dimension is third quartile.
• And the final dimension is maximum of the data.
65
BOX PLOTS
66
SCATTER PLOTS
• A scatter plot is a type of
plot that displays values
pertaining to typically two
variables against each
other.
• Usually it is a dependent
variable to be plotted
against an independent
variable in order to
determine if any
correlation between the
two variables exists.
https://www.data-to-viz.com/graph/scatter.html
67
SCATTER PLOTS
https://www.data-to-viz.com/graph/scatter.html
68
HEAT MAPS
• Heatmaps visualise data through variations in colouring.
• When applied to a tabular format, Heatmaps are useful for cross-
examining multivariate data, through placing variables in the rows
and columns and colouring the cells within the table.
• Heatmaps are good for showing variance across multiple variables,
revealing any patterns, displaying whether any variables are similar
to each other, and for detecting if any correlations exist in-between
them.
69
HEAT MAPS
https://datavizcatalogue.com/methods/heatmap.html
70
WAFFLE CHARTS
ADVANCED
WORD CLOUDS
VISUALIZATION TOOLS
BUBBLE PLOTS
71
WAFFLE CHARTS
• A Waffle Charts is an interesting visualization that is normally created to
display progress towards goals.
• As its name, it usually consists some small squares arranged in a M-by-N
layout.
• The squares are colored according to the proportions you are aiming to
visualize, similarly to how you would color different slices of a pie chart.
72
WAFFLE CHARTS
https://datascience.stackexchange.com/questions/57603/how-this-visualisation-was-made
73
WORD CLOUDS
• A word cloud is simply a depiction of the importance of different words
in the body of text.
• A word cloud works in a simple way; the more a specific word appears in
a source of textual data the bigger and bolder it appears in the world
cloud.
• Assuming that we didn't know anything about the content of these
documents, a word cloud can be very useful to assign a topic to some
unknown textual data.
74
WORD CLOUDS
75
BUBBLE PLOTS
A bubble plot is a scatterplot where a third dimension is added: the value of an additional variable is
represented through the size of the dots. You need 3 numerical variables as input: one is represented by
the X axis, one by the Y axis, and one by the size.
76
BUBBLE PLOTS
Bubble plots over Maps
https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Map_of_earthquakes_in_Indonesia_1
900-2019.svg/1280px-Map_of_earthquakes_in_Indonesia_1900-2019.svg.png
77
References & Credits
• Chirag Shah, Hands on Introduction to Data Science, Cambridge
University Press, 2020
• Data Visualization from IBM Data Science Training Materials and
cognitiveclass.ai
• Siti Aminah & Dhimas Arief Darmawan, Data Visualization, Salindia Mata
Kuliah Data Sains Semester Genap 2020/2021, Fakultas Ilmu Komputer,
Universitas Indonesia
• Fariz Darari, EDA & Visualization, Salindia Mata Kuliah Data Sains
Semester Gasal 2020/2021, Fakultas Ilmu Komputer, Universitas
Indonesia
• Gambar dan tangkapan layar hanya untuk kebutuhan penjelasan
• Hak cipta tetap ada pada pemilik aslinya.
78
Wish You Success
79