M 2.3 Data Preprocessing
M 2.3 Data Preprocessing
• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
• Data discretization
Data pre-processing methods
Data Cleaning
• Real world data is incomplete, noisy and
inconsistent.
• Data cleaning fill in missing values, smooth
out noise while identify outliers and correct
inconsistencies in the data.
Data cleaning methods :-
1) Missing Values
• Ignore the tuple
Can be done when class label is missing. It
is not effective, unless tuple contains several
attribute with missing values.
• Fill in the missing value manually
• Time consuming given large data set with
many missing values
• Use some global constants to fill in the missing values
Use ex: -α , “unknown “ etc.
But there is a chance for misinterpreting
“unknown”
• Use the attribute mean to fill in the missing value. For
example customer average income is 25000 then you
can use this value to replace missing value for income.
• Use the attribute mean for all samples belonging to the
same class as given by the tuple
• Use the most probable value to fill in the missing
value. This value is determined by regression,
inference based tools or decision tree induction.
2) Noisy Data
Data:11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,
40,45,45,45,71,72,73,75