Data Preprocessing
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Why Data Preprocessing?
Data in the real world is soiled.
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
e.g., occupation=“ ”
– noisy: containing errors or outliers
e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
What is Data?
Collection of data objects and Attributes
their attributes
Tid Refund Marital Taxable
An attribute is a property or Status Income Cheat
characteristic of an object 1 Yes Single 125K No
– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No
– Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature
Objects
5 No Divorced 95K Yes
6 No Married 60K No
A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes
– Object is also known as record, 9 No Married 75K No
point, case, sample, entity, or 10 No Single 90K Yes
instance
10
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Noise
Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data objects
in the data set
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Missing Values
Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
– Major issue when merging data from heterogeneous sources
Examples:
– Same person with multiple email addresses
Data cleaning
– Process of dealing with duplicate data issues
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
-- … Why Preprocess the Data
Reason for data cleaning
– Incomplete data (missing data)
– Noisy data (contains errors)
– Inconsistent data (containing discrepancies)
Reasons for data integration
– Data comes from multiple sources
Reason for data transformation
– Some data must be transformed to be used for mining
Reasons for data reduction
– Performance
No quality data no quality mining results!
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Major Tasks in Data Preprocessing
1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
2.Data integration
– Integration of multiple databases, data cubes, or files
3.Data transformation
– Normalization and aggregation
4.Data reduction (Sampling, dimensionality reduction,
feature subset selection)
– Obtains reduced representation in volume but produces the same
or similar analytical results
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
5.Data discretization
– For classification algorithms sometimes it is required that data
should be in the form of categorical attributes
– Algo. That find association patterns require that the data be in the
form of binary attributes.
– Thus it is required to transform a continuous attribute to a
categorical attribute( discretization).
– Part of data reduction but with particular importance, especially for
numerical data
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Forms of Data Preprocessing
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning
Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning
: How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification)—not effective
unless the tuple contains several attributes with the
missing values
Fill in the missing value manually- not feasible for large
datasets and time- consuming
Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the most probable value: inference-based such as Bayesian
formula or regression or decision tree induction
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning : How to Handle Noisy Data?
Noise- a random error or variance in a measured variable.
Incorrect attribute values may due to
– faulty data collection
– data entry problems
– data transmission problems
– data conversion errors
– Data decay problems
– technology limitations, e.g. buffer overflow or field size limits
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning : How to Handle Noisy Data?
Methods
Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Regression
– smooth by fitting the data into regression functions
Clustering
– detect and remove outliers
Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning : Regression
•Data can be smoothed by y
fitting the data to a
function such as with
regression. Y1
•Linear regression involves
finding the ‘best’ line to fit
Y1’ y=x+1
2 variables.
X1 x
•Also, it is possible to
predict one variable using
the other variable.
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
1.Data Cleaning : Cluster Analysis
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
2. Data Integration
Data integration:
– Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id B.cust-#
– Integrate metadata from different sources
Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Integration
: Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
Attribute/feature construction
– New attributes constructed from the given ones
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Reduction Strategies
Why data reduction?
– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the
complete data set
Data reduction
– Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results
Data reduction strategies
– Aggregation
– Sampling
– Dimensionality Reduction
– Feature subset selection
– Feature creation
– Discretization and Binarization
– Attribute Transformation
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Reduction : Aggregation
Combining two or more attributes (or objects) into a single
attribute (or object)
Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Data Reduction : Sampling
Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
Statisticians sample because obtaining the entire set of data
of interest is too expensive or time consuming.
Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene
Thank You
PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene