Outline
• What are the types of features that make the data?
• What kind of values does each feature have?
• Which attributes are discrete, continuous?
• What do the data look like?
• How are the values distributed?
• How to visualize the data?
• Can we spot any outliers?
• Can we measure similarity of different data objects?
2
Types of Data Sets
• Record
– Relational records
– Data matrix, e.g., numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
–
wi
Document data: text documents: term-frequency vector
n
y
– Transaction data
• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
– World Wide Web
Document 2 0 7 0 2 1 0 0 3 0 0
– Social or information networks
– Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
• Ordered
– Video data: sequence of images TID Items
– Temporal data: time-series 1 Bread, Coke, Milk
– Sequential Data: transaction sequences 2 Beer, Bread
– Genetic sequence data 3 Beer, Coke, Diaper, Milk
• Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
– Spatial data: maps 5 Coke, Diaper, Milk
– Image data:
– Video data: 3
Important Characteristics of Structured Data
• Dimensionality
– Curse of dimensionality
• Sparsity
– Only presence counts
• Resolution
– Patterns depend on the scale
• Distribution
– Centrality and dispersion
4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by features.
• Database rows -> data objects; columns ->features.
5
Features
• Features (or dimensions, attributes, variables): a data
field, representing a characteristic or feature of a data
object.
– E.g., customer _ID, name, address
• Types:
– Nominal
– Binary
– Numeric: quantitative
• Interval-scaled
• Ratio-scaled 6
Feature Types
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between successive values is not
known.
– Size = {small, medium, large}, grades, army rankings
7
Numeric Feature Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
– E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• Count features, years of experience, no of words (objects as documents)
8
Discrete vs. Continuous Features
• Discrete Feature
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
– Note: Binary attributes are a special case of discrete attributes
• Continuous Feature
– Has real numbers as feature values
• E.g., temperature, height, or weight
– Continuous attributes are typically represented as floating-point variables
9
Outline
• What are the types of features that make the data?
• What kind of values does each feature have?
• Which attributes are discrete, continuous?
• What do the data look like?
• How are the values distributed?
• How to visualize the data?
• Can we spot any outliers?
• Can we measure similarity of different data objects?
10
Basic Statistical Descriptions of Data
• Motivation
– To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
– median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
– Data dispersion: analyzed with multiple granularities of precision
– Boxplot or quantile analysis on sorted intervals
11
Measuring the Central Tendency
1 n
x xi
• Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size. n i 1
n
– Weighted arithmetic mean: w x i i
– Trimmed mean: chopping extreme values x i 1
n
• Median: w
i 1
i
– Middle value if odd number of values, or average of the
middle two values otherwise
Median Interval
– Estimated by interpolation (for grouped data):
n / 2 ( freq)l
median L1 ( ) width
• Mode freqmedian
– Value that occurs most frequently in the data
– Unimodal, bimodal, trimodal
– Empirical formula: mean mode 3 (mean median)
12
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
positively skewed negatively skewed
13
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter-quartile range: IQR = Q3 – Q1
– Five number summary: min, Q1, median, Q3, max
– Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
– Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
n n
1 1
xi 2
2 2 2
( xi )
N i 1 N i 1
– Standard deviation s (or σ) is the square root of variance s2 (or σ2)
14
Graphic Displays of Basic Statistical
Descriptions
• Boxplot: graphic display of five-number summary
• Histogram: x-axis are values, y-axis repres. frequencies
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane
15
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extended to
Minimum and Maximum
– Outliers: points beyond a specified outlier
threshold, plotted individually
16
Histogram Analysis
• Histogram: Graph display of tabulated 40
frequencies, shown as bars 35
• It shows what proportion of cases fall into 30
each of several categories 25
• Differs from a bar chart in that it is the area of 20
the bar that denotes the value, not the height 15
as in bar charts, a crucial distinction when the 10
categories are not of uniform width 5
• The categories are usually specified as non- 0
10000 30000 50000 70000 90000
overlapping intervals of some variable. The
categories (bars) must be adjacent
17
Histograms Often Tell More than Boxplots
• The two histograms shown in the left
may have the same boxplot
representation
– The same values for: min, Q1,
median, Q3, max
• But they have rather different data
distributions
18
Scatter plot
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted as points in
the plane
19
Positively and Negatively Correlated Data
• The left half fragment is positively correlated
• The right half is negative correlated
20
Uncorrelated Data
21
Outline
• What are the types of features that make the data?
• What kind of values does each feature have?
• Which attributes are discrete, continuous?
• What do the data look like?
• How are the values distributed?
• How to visualize the data?
• Can we spot any outliers?
• Can we measure similarity of different data objects?
22
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
23
Data Matrix and Dissimilarity Matrix
• Data matrix x11 ... x1f ... x1p
– n data points with p dimensions ... ... ... ... ...
x ... x if ... x ip
– Two modes i1
... ... ... ... ...
x ... x nf ... x np
n1
• Dissimilarity matrix
– n data points, but registers only the distance 0
d(2,1) 0
– A triangular matrix
d(3,1) d ( 3, 2 ) 0
– Single mode
: : :
d ( n,1) d ( n, 2 ) ... ... 0
24
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
• Method 1: Simple matching
– m: # of matches, p: total # of variables
d (i, j) p p
m
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M nominal states
– E.g. map color, yellow set to 1, rest 0 25
Proximity Measure for Binary Attributes
Object j
• A contingency table for binary data
Object i
• Distance measure for symmetric binary variables:
• Distance measure for asymmetric binary variables:
• Jaccard coefficient (similarity measure for
asymmetric binary variables):
26
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– Gender is a symmetric attribute
– The remaining attributes are asymmetric binary
Object j
– Let the values Y and P be 1, and the value N 0
01 Object j
d ( jack , mary ) 0.33 Object i
2 01
11
d ( jack , jim ) 0.67
Object i
d (jack, mary) = ? 111
1 2
d ( jim , mary ) 0.75
11 2
27
28
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
– Gender is a symmetric attribute
– The remaining attributes are asymmetric binary
– Let the values Y and P be 1, and the value N 0
01 Object j
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111 Object i
1 2
d ( jim , mary ) 0.75
11 2
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
29
Distance on Numeric Data: Minkowski
Distance
• Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0
– d(i, j) = d(j, i) (Symmetry)
– d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
30
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
– E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
31
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5 Manhattan (L1)
x3 2 0
L x1 x2 x3 x4
x4 4 5
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
32
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
• Other vector objects: gene features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
33
Example: Cosine Similarity
• cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|: the length of vector d
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
34
Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• Many types of data sets, e.g., numerical, text, graph, Web, image.
• Gain insight into the data by:
– Basic statistical data description: central tendency, dispersion, graphical displays
– Data visualization: map data onto graphical primitives
– Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of research.
35
References
• W. Cleveland, Visualizing Data, Hobart Press, 1993
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley
& Sons, 1990.
• H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech. Committee on
Data Eng., 20(4), Dec. 1997
• D. A. Keim. Information visualization and visual data mining, IEEE trans. on Visualization and Computer
Graphics, 8(1), 2002
• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and Machine Intelligence,
21(9), 1999
• E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press, 2001
• C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies, Information
Visualization, 8(1), 2009
36