Data Visualization in
Data Science
Maloy Manna
biguru.wordpress.com
linkedin.com/in/maloy
twitter.com/itsmaloy
Synopsis
Having data is not enough. Adding context to data is essential to understand the
data, find patterns and engage audiences. Data visualization is a key element of data
science, the interdisciplinary field which deals with finding insights from data.
In this webinar, we explore the roles of data visualization at different stages of
the data science process, and why it is essential.
We also look at how data is encoded visually with shape, size, color and other
variables and also the basic principles of visual encoding can be applied to build
better visualizations.
We cover narratives, types of bias and maps.
Finally we look at how various tools both open source and off-the-shelf
software thats used in data science to build effective data visualizations.
Speaker profile
Maloy Manna
Project Manager - Engineering
AXA Data Innovation Lab
Over 14 years experience building data driven products and services
Previous organizations: Thomson Reuters, Saama, Infosys, TCS
biguru.wordpress.com
linkedin.com/in/maloy
twitter.com/itsmaloy
Contents
Defining Data visualization
Data science process
Data visualization
Visual encoding of data
Narrative structures
Dataviz Technology & Tools
Defining Data visualization
Visual display of quantitative information
Mapping data to visual elements
Encoding data with size, shape, color...
Storytelling / narrative elements
Defining Data Visualization
Exploratory
Find insights
Conversation between data and you
Explanatory
Present insights
Data science project life-cycle
Acquire data
Prepare data
Analysis &
Modeling
Evaluation &
Interpretation
Deployment
Operations &
Optimization
Data science process
EDA:
Exploratory
Data Analysis
Data Wrangling
Exploratory
Explanatory
Data Visualization
Source: Computational Information Design | Ben Fry
Exploratory data visualization
Data analysis approaches:
Classical:
Problem > Data > Model > Analysis > Conclusions
EDA: [Exploratory Data Analysis]
Problem > Data > Analysis > Model > Conclusions
Bayesian:
Problem > Data > Model > Prior distribution > Analysis > Conclusions
EDA = approach, not a set of techniques
Exploratory data visualization
Statistical approaches:
Quantitative
Hypothesis testing
Analysis of variance (ANOVA)
Point estimates and confidence intervals
Least squares regression
Graphical
Scatter plots
Histograms
Probability plots
Residual plots
Box plots
Block plots
Exploratory data visualization
Graphical
Scatter plots
Histograms
Probability plots
Residual plots
Box plots
Block plots
Exploratory data visualization
Graphical analysis procedures:
Testing assumptions
Model selection
Model validation
Estimator selection
Relationship identification
Factor effect determination
Outlier detection
MUST USE for deriving insights from data
Exploratory data analysis
Anscombe's quartet
N=11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237
Correlation = 0.816
Exploratory data analysis
Explanatory data visualization
Design
Engineering
Journalism
Explanatory data visualization
Visualization is both an art and science
Harry Beck's subway map of London
Visual encoding of data
Data Types
Quantitative
Continuous, Discrete
Categorical
Nominal, Ordered, Interval
Visual encoding of data
Categorical scales and graph design
Visual encoding of data
Bandwidth of our senses: [Tor Norretranders]
Visual encoding of data
Data visual display elements
Position x
Position y
Retinal variables
Size, Orientation (ordered data)
Color Hue, Shape (nominal data)
Animation
Visual encoding of data
Ranking visual display elements (framework):
1.
2.
Position along a common-scale e.g. scatter plots
Position on identical but non-aligned scales
E.g. multiple scatter plots
3. Length e.g. bar chart
4. Angle & Slope e.g. pie-chart
5. Area e.g. bubbles
6.
7.
Volume, density & color saturation e.g. heat-map
Color hue e.g. highlights
Ref. Graphical Perception & graphical methods for analyzing scientific data William
Cleveland & Robert McGill (1985)
Design principles
Choose the right type of chart
Trends / Change over time Line charts
Distributions Histograms
Summary Information Table
Relationships Scatter Plots
Get it right in black & white (before adding color)
Prefer 2D to 3D for statistical charts
Use color to highlight
Avoid rainbow palette
Avoid chartjunk : less is more
Try to have a high data-ink ratio
Design principles
Choose the right type of chart
Ranking
Time-series
Correlation
Nominal comparison
Deviation
Narrative structures
Data Journalism
Traditional journalism
Data journalism
Data around narrative
Narrative around data
Linear flow
Complex, often non-linear flow
Physical static media
Online interactive media
Narrative structures
Narrative structures
Narrative structures
Bias (and ethics: Dont lie with data)
Bar-charts must have a zero-baseline
Present data in its context
Narrative structures
Bias: Misleading with data
Selective presentation with line-charts
Author Bias
Data Bias
Reader Bias
Narrative structures
Bias and Errors (statistics):
Selection bias e.g. in sampling
Omitted-variable bias
Errors:
Hypothesis testing
Null Hypothesis = default/no-effect state
Null Hypothesis H0
Valid
Invalid
Reject
Type I error
False positive
Correct inference
True positive
Accept
Correct inference
True negative
Type II error
False negative
Narrative structures
Storytelling:
Visual narratives have moved from author-driven to viewerdriven with use of highly interactive media for data visualization
Author-driven
Viewer-driven
Author driven
Viewer driven
Strong ordering
Exploratory
Heavy messaging
Ability to ask questions
Need for clarity and speed
Build own story
DataViz Technologies & Tools
Off-the-shelf:
Tableau, Qlikview
Tools:
Predefined charts: Raw, Chartio, Plotly
Google fusion tables, Excel, Gephi
Code & Javascript libraries:
R ggplot2, ggvis, rCharts + shiny(interactive apps)
Python matplotlib,
D3.js, Dimple.js, Leaflet, Rickshaw (use JSON data)
Linux gnuplot
DataViz Technologies & Tools
Tableau data viz
DataViz Technologies & Tools
Chart in R ggplot2
References
Visual display of Quantitative Information: Edward Tufte http://goo.gl/qb5ej
Exploratory Data Analysis: John Tukey http://goo.gl/tV57HP
Data Science Life cycle : Maloy Manna
http://www.datasciencecentral.com/profiles/blogs/the-data-science-project-lifecycle
Selecting right graph for your message: Stephen Few
www.perceptualedge.com/articles/ie/the_right_graph.pdf
Practical rules for using color in charts: Stephen Few
www.perceptualedge.com/articles/visual.../rules_for_using_color.pdf
OpenIntro Statistics: https://www.openintro.org/stat/
Misleading with statistics: Eric Portelance
https://medium.com/i-data/misleading-with-statistics-c63780efa928
Computational Information Design: Ben Fry
http://benfry.com/phd/dissertation-050312b-acrobat.pdf