Chapter 2 Data Preprocessing
Chapter 2 Data Preprocessing
Data Preprocessing
1
https://machinelearningmastery.com 2
Why Data Preprocessing?
3
3
Measures of Data Quality:Why Data Preprocessing?
• Consistency: Does information stored in one place match relevant data stored elsewhere?
• Uniqueness: Is this the only instance in which this information appears in the dataset?
4
4
Why Data Preprocessing?
• Data in the real world is full of dirty:
– incomplete: lacking attribute values
– noisy: containing errors or outliers that deviate from the expected
– inconsistent: lack of compatibility (e.g Some attributes representing a
given concept may have different names in different databases)
• To minimize such problems, employ data cleaning routines.
• Before starting data preprocessing, it will be adviceable to have overall
picture of the data at high level summary such as
– General property of the data
– Which data values should be considered as noise or outliers
• This can be done with the help of descriptive data summarization
5
5
Descriptive data summarization
• Descriptive summary about data can be generated with the help of measure of central
tendency of the data and dispersion of the data
• Measure of central tendency [computing a typical score on the variable] and it includes
– Mean
– Median
– Mode
– Mid-Range
• Measure of dispersion[computing the degree to which data is distributed around this
central tendency] includes
– range
– Standard deviation 6
6
Graphic display of basic descriptive summaries
9
9
How to Handle Missing Data
• Ignore the tuple: usually done when class label is missing
• Fill in the missing value manually: tedious and infeasible
• Use a global constant to fill in the missing value: E.g., “unknown”, a new
class?! Simple but not recommended as this constant may form some interesting
pattern and mislead decision process
• Use the attribute mean: for all samples belonging to the same class to fill in the
missing value with the mean value of attributes
• Use the most probable value: fill in the missing values by predicting its value
from correlation of the available values
• Except the first two approach, the rest filled values are incorrect and the last two10
10
are common.
Dataset preparation for Classification
• Proper procedure in some classification system development involves three sets of data :
• Generally, the larger the training data the better the classifier
11
11
Unbalanced data
• Sometimes, classes have very unequal frequency
– medical diagnosis: 90% healthy, 10% disease
– eCommerce: 99% don’t buy, 1% buy
• Majority class classifier can be 97% correct, but useless
• If we have two classes that are very unbalanced, then it will be a
bias to evaluate our classifier method
• With two or more classes, a good approach to make a balance
between the class instances is to build BALANCED train and test
sets.
12
12
Balancing unbalanced data
• With two or more classes, a good approach to make a balance
between the class instances is to build BALANCED train and
test sets
• Approach
– randomly select desired number of minority class instances
– add equal number of randomly selected majority class
– Stratified sample: advanced version of balancing the data
• Make sure that each class is represented with approximately equal proportions
in both subsets
13
Building Classification Model
Results Known
+ Model
+ Training set
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N
+
Validation set -
+
- Final Evaluation
+
Final
14
Test Set Final Model -
Building Classification Model: Parameter tuning
15
Tips: Dataset size
• Before we start building Classification model, we should check how good is
the size of the dataset we have
• Given balanced dataset, the next most important aspect of goodness is size of
the data set
• The model should be able to converge during learning the parameters from the
dataset
• If not, appropriate measure should be taken and care must be given while
reporting performance
• We will see learning curve analysis that best suit to detect goodness of the size
of the training dataset
16
Tips: Dataset Size
What to do with small data?
• Having small data but balanced can be approached in different ways to
relay on the performance
• Note that the total data set we have will be divided into three for training,
testing and validation
• The following are the techniques to minimize the effect of the dataset size
1. k-fold cross validation: randomly dividing the set of observations into k groups,
or folds, of approximately equal size. The first fold is treated as a test set, and the
method is fit on the remaining k − 1 folds.
2. Data augmentation: techniques used to increase the amount of data by adding
slightly modified copies of already existing data
17
Tips:Dataset Size
What to do with small data: Using K-fold cross validation--10-fold is the
recommended
example:
— Break up data into groups of the same size
—
—
— Hold aside one group for testing and use the rest to build model
— Test
— Repeat
18 18
Feature Selection
• Why we need Feature Selection (FS)?
– to improve performance (in terms of speed, predictive power,
simplicity of the model).
– to visualize the data for model selection.
– To reduce dimensionality and remove noise.
– In the initialization X is a null set and k=0 (where k is the size of the subset).
– In the initialization X is a subset of features and k=d (where k is the size of the subset).
• Feature Selection:
– When classifying novel patterns, only a small number of features need to be computed (i.e.,faster
classification).
– The measurement units (length, weight, etc.) of the features are preserved.
• Dimensionality Reduction:
– When classifying novel patterns, all features need to be computed.
– The measurement units (length, weight, etc.) of the features are lost.
test accuracy