Concepts and
Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights
reserved.
1
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types
Data Visualization
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
wi
crosstabs
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction 2 Beer, Bread
sequences 3 Beer, Coke, Diaper, Milk
Genetic sequence data 4 Beer, Bread, Diaper, Milk
Spatial, image and multimedia:
5 Coke, Diaper, Milk
Spatial data: maps
Image data:
Video data:
3
Important Characteristics of
Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
4
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data
points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns -
>attributes.
5
Attributes
Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red,
white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome
(e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in
a collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
represented using a finite number of digits
Continuous attributes are typically represented as
floating-point variables
9
Graphic Displays of Basic Statistical
Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres.
frequencies
Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of
coordinates and plotted as points in the plane
10
Histogram Analysis
Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
It shows what proportion of cases
30
fall into each of several categories
25
Differs from a bar chart in that it
is the area of the bar that denotes 20
the value, not the height as in bar 15
charts, a crucial distinction when
the categories are not of uniform 10
width 5
The categories are usually 0
specified as non-overlapping 10000 30000 50000 70000 90000
intervals of some variable. The
categories (bars) must be
adjacent
11
Histograms Often Tell More than
Boxplots
The two histograms
shown in the left
may have the same
boxplot
representation
The same values
for: min, Q1,
median, Q3, max
But they have
rather different data
distributions
12
Chapter 2: Getting to Know Your
Data
Data Objects and Attribute Types
Data Visualization
13
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
14
Pixel-Oriented Visualization
Techniques
For a data set of m dimensions, create m windows on the
screen, one for each dimension
The m dimension values of a record are mapped to m pixels
at the corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income (b) Credit (c) transaction (d) age
Limit volume 15
Laying Out Pixels in Circle
Segments
To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment
(a) Representing a data
(b) Laying out pixels in circle
record in circle segment
segment
16
Geometric Projection Visualization
Techniques
Visualization of geometric transformations and
projections of the data
Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes
Projection pursuit technique: Help users find
meaningful projections of multidimensional data
Prosection views
Hyperslice
Parallel coordinates
17
Ribbons with Twists Based on Vorticity
Direct Data Visualization
Data Mining: Concepts and Techniques 18
Scatterplot Matrices
Used by ermission of M. Ward, Worcester Polytechnic Institute
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]
19
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
Visualization of the data as perspective landscape
The data needs to be transformed into a (possibly artificial) 2D spatial
representation which preserves the characteristics of the data
20
Parallel Coordinates
n equidistant axes which are parallel to one of the screen
axes and correspond to the attributes
The axes are scaled to the [minimum, maximum]: range of
the corresponding attribute
Every data item corresponds to a polygonal line which
intersects each of the axes at the point which corresponds to
the value for the attribute
• • •
Attr. 1 Attr. 2 Attr. 3 Attr. k
21
Parallel Coordinates of a Data Set
22
Icon-Based Visualization
Techniques
Visualization of the data values as features of icons
Typical visualization methods
Chernoff Faces
Stick Figures
General techniques
Shape coding: Use shape to represent certain
information encoding
Color icons: Use color icons to encode more
information
Tile bars: Use small icons to represent the
relevant feature vectors in document retrieval
23
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g.,
let x be eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics--
head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth
size, and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
REFERENCE: Gonick, L. and Smith, W.
The Cartoon Guide to Statistics. New York:
Harper Perennial, p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.htm
l
24
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece
stick figure (1
body and 4
limbs w.
different
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
angle/length)
25
Hierarchical Visualization
Techniques
Visualization of the data using a
hierarchical partitioning into subspaces
Methods
Dimensional Stacking
Worlds-within-Worlds
Tree-Map
Cone Trees
InfoCube
26
Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are ‘stacked’ into each other
Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
27
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to
the outer x-, y-axes and ore grade and depth mapped to the inner x-, y-
axes
28
Worlds-within-Worlds
Assign the function and two most important parameters to
innermost world
Fix all other parameters at constant values - draw other (1 or 2
or 3 dimensional worlds choosing these as the axes)
Software that uses this paradigm
N–vision: Dynamic
interaction through
data glove and stereo
displays, including
rotation, scaling
(inner) and translation
(inner/outer)
Auto Visual: Static
interaction by means
of queries
29
Tree-Map
Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
MSR Netscan Image
Ack.: 30
Tree-Map of a File System
(Schneiderman)
31
InfoCube
A 3-D visualization technique where
hierarchical information is displayed as
nested semi-transparent cubes
The outermost cubes correspond to the top
level data, while the subnodes or the lower
level data are represented as smaller cubes
inside the outermost cubes, and so on
32
Three-D Cone Trees
3D cone tree visualization technique
works well for up to a thousand nodes or
so
First build a 2D circle tree that arranges
its nodes in concentric circles centered
on the root node
Cannot avoid overlaps when projected to
2D
G. Robertson, J. Mackinlay, S. Card.
“Cone Trees: Animated 3D Visualizations
of Hierarchical Information”, ACM
SIGCHI'91
Graph from Nadeau Software Consulting
website: Visualize a social network data
set that models the way an infection
spreads from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
33
Visualizing Complex Data and
Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance
of tag is
represented by
font size/color
Besides text data,
there are also
methods to visualize
relationships, such
as visualizing social
networks
Newsmap: Google News Stories in