ITS665dm Topic2-DataUnderstanding
ITS665dm Topic2-DataUnderstanding
ITS665dm Topic2-DataUnderstanding
Topic 2
Understanding your Data
Data
Record Ordered
Relational records Video data: sequence
Data matrix, e.g., numerical
of images
matrix, crosstabs
Document data: text Temporal data: time-
documents: term-frequency series
vector
Sequential Data:
Transaction data
transaction
Graph and network
sequences
World Wide Web
Social or information Genetic sequence
networks data
Molecular Structures Spatial, image and
multimedia:
3
Record Data
P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s
o f x L o a d o f y l o a d
1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2
1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1
Document Data
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa"> 2
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
5 1
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers 5
Other Types of Data
Ordered Data
Sequences of transactions Genomic sequence data
Items/Events
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
An element of the sequence
Data Objects
10
Attributes
Ratio-scaled
11
Attribute Types
12
Attribute Types
13
Attribute Types
Ordinal
Values have a meaningful order (ranking) but
magnitude between successive values is not
known.
Size = {small, medium, large}, grades, army
rankings
Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-,
D+, D, E, F)
15
Numeric Attribute Types
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K is twice as high as 5
K).
e.g., temperature in Kelvin, length,
counts,
monetary quantities (e.g., you are 100 times
richer with $100 than with $1).
16
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countable infinite set of
values
E.g., zip codes, profession, or the set of
discrete attributes
Note that discrete attributes may have numeric
values, such as 0 and 1 for binary attributes, or,
the values 0 to 110 for the attribute Age.
17
Discrete vs. Continuous Attributes
Continuous Attribute
Has real numbers as attribute values
18
Properties of Attribute Values
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
20
Examples
Source:
http://www.perceptualedge.com/articles/dmreview/qua
nt_vs_cat_data.pdf
BASIC STATISTICAL DESCRIPTIONS
OF DATA
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size
1 n
x xi
x
n i 1 N
Weighted arithmetic mean: n
positively skewed
negatively skewed
24
Measuring the Dispersion of Data
25
Properties of Normal Distribution Curve
26
Boxplot Analysis
27
Visualization of Data Dispersion:
Boxplot Analysis
28
Histogram Analysis
29
Histograms Often Tell More than Boxplots
30
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi indicates
that approximately 100 fi% of the data are below or
equal to the value xi
31
Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
View: Is there is a shift in going from one distribution to
another?
Example shows unit price of items sold at Branch 1 vs.
Branch 2 for each quantile. Unit prices of items sold at
Branch 1 tend to be lower than those at Branch 2.
32
Scatter plot
Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
33
Positively and Negatively Correlated Data
34
Positively and Negatively Correlated Data
If the pattern of plotted points slopes from lower left If the pattern of plotted points slopes from upper left to
to upper right, this means that the values of X lower right, then the values of X increase as the values
increase as the values of Y increase, which of Y decrease, suggesting a negative correlation .
suggests a positive correlation.
35
Uncorrelated Data
36
Exercise 1
39
Geometric Techniques
40
Scatterplot Matrices
news articles
Used by permission of B. Wright, Visible Decisions Inc.
visualized as
a landscape
Stick Figures
General techniques
Shape Coding: Use shape to represent certain
information encoding
Color Icons: Using color icons to encode more
information
TileBars: The use of small icons representing the
43
44
Chernoff Faces
A way to display variables on a two-dimensional surface,
e.g., let x be eyebrow slant, y be eye size, z be nose length,
etc.
The figure shows faces produced using 10 characteristics--
head eccentricity, eye size, eye spacing, eye eccentricity,
pupil size, eyebrow slant, nose size, mouth shape, mouth
size, and mouth opening): Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
45
46
Dimensional Stacking
attr ib u te4
attr ib u te2
a ttr ib u te3
a ttri b u te 1
Partitioning of the n-dimensional attribute space in 2-D
subspaces, which are stacked into each other
Partitioning of the attribute value ranges into classes.
The important attributes should be used on the outer
levels.
Adequate for data with ordinal attributes of low cardinality
But, difficult to display more than nine dimensions
Important to map dimensions appropriately
Dimensional Stacking
Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-
axes and ore grade and depth mapped to the inner x-, y-axes
47
Tree-Map
Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending on the
attribute values
The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
48
Tree-Map of a File System
(Schneiderman)
49
Three-D Cone Trees
50
InfoCube
A 3-D visualization technique where hierarchical
information is displayed as nested semi-
transparent cubes
The outermost cubes correspond to the top level
data, while the subnodes or the lower level data
are represented as smmaller cubes inside the
outermost cubes, and so on
51
Source of Public Datasets
53