Unit 2 Data Preprocessing
Unit 2 Data Preprocessing
Objects
5 No Divorced 95K Yes
6 No Married 60K No
A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
2.Data integration
– Integration of multiple databases, data cubes, or files
3.Data transformation
– Normalization and aggregation
Methods
Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
– smooth by fitting the data into regression functions
Clustering
– detect and remove outliers
X1 x
•Also, it is possible to
predict one variable using
the other variable.
Data integration:
– Combines data from multiple sources into a coherent store
Attribute/feature construction
– New attributes constructed from the given ones
Data reduction
– Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability