DMPA-2 Powerpoint Slides - Modified Audio
DMPA-2 Powerpoint Slides - Modified Audio
DMPA-2 Powerpoint Slides - Modified Audio
Chapter 2
Data Preprocessing
Prepared by Andrew Hendrickson, Graduate Assistant
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose
John Wiley & Sons, Inc, Hoboken, NJ, 2015.
1
Why Do We Preprocess Data?
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 2
Why Do We Preprocess Data?
(cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 3
Data Cleaning
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 4
Data Cleaning (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 5
Data Cleaning (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 6
Data Cleaning (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 7
Data Cleaning (cont’d)
• Age Field?
– Date-type fields may become obsolete
– Use date of birth, then derive Age
• Marital Status Field Contains “S”?
– What does this symbol mean?
– Does “S” imply single or separated?
– Discuss anomaly with database administrator
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 8
Handling Missing Data
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 9
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 10
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 11
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 12
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 13
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 14
Handling Missing Data (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 15
Identifying Misclassifications
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 16
Graphical Methods for Identifying
Outliers
• Outliers are values that lie near extreme limits of data
range
• Outliers may represent errors in data entry
• Certain statistical methods very sensitive to outliers and
may produce unstable results
• Neural Networks and k-Means benefit from normalized
data
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 17
Graphical Methods for Identifying
Outliers (cont’d)
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 19
Graphical Methods for Identifying
Outliers (cont’d)
• For extremely skewed data sets, the mean becomes less representative of the
variable center
• Also, the mean is sensitive to the presence of outliers
• An alternative measure is the median, defined as the field value in the middle when
the field values are sorted into ascending order
• The median is resistant to the presence of outliers
• Another measure is the mode, which represents the field value occurring with the
greatest frequency
• The mode may be used with either numerical or categorical data, but is not always
associated with the variable center
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 21
MEASURES OF CENTER AND SPREAD - cont
• The standard deviation can be interpreted as the “typical” distance between a field
value and the mean, and most field values lie within two standard deviations of the
mean
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 22
Data Transformation
• Variables tend to have ranges different from each other
• In baseball, two fields may have ranges:
– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]
• Some data mining algorithms adversely affected by
differences in variable ranges
• Variables with greater ranges tend to have larger
influence on data model’s results
• Therefore, numeric field values should be normalized
• Standardizes scale of effect each variable has on results
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 23
MIN-MAX NORMALIZATION
• Min-max normalization works by seeing how much greater the field value is than the
minimum value min(X), and scaling this difference by the range
• For example, an ultra-light vehicle, weighing only 1613 pounds (the field minimum), the
min–max normalization is:
• The midrange equals the average of the maximum and minimum values in a data set:
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 24
Z-SCORE STANDARDIZATION
• Z-score standardization works by taking the difference between the field value and
the field mean value, and scaling this difference by the standard deviation of the field
values
• For example, a vehicle weighing only 1613 pounds, the Z-score standardization is:
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 25
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 26
DECIMAL SCALING
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 27
TRANSFORMATIONS TO ACHIEVE NORMALITY
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 28
TRANSFORMATIONS TO ACHIEVE NORMALITY - cont
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 29
TRANSFORMATIONS TO ACHIEVE NORMALITY - cont
• Depending on the variable, the 3 transformations may produce a more normal
distribution than one another
• To check for normality, we construct a normal probability plot, which plots the
quantiles of a particular distribution against the quantiles of the standard normal
distribution
• In a normal probability plot, if the distribution is normal, the bulk of the points in the
plot should fall on a straight line; systematic deviations from linearity in this plot
indicate nonnormality
• Finally, when the algorithm is done with its analysis, don’t forget to “de-transform”
the data
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 30
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 31
FLAG VARIABLES
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 32
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL
VARIABLES
• In most instances, the data analyst should avoid transforming categorical variables to
numerical variables
• The exception is for categorical variables that are clearly ordered, such as the
variable survey_response, taking values always, usually, sometimes, never
• Should never be “0” rather than “1”? Is always closer to usually than usually is to
sometimes?
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 33
ADDING AN INDEX FIELD
• It is recommended that the data analyst create an index field, which tracks the sort
order of the records in the database
• Data mining data gets partitioned at least once (and sometimes several times)
• It is helpful to have an index field so that the original sort order may be recreated
• For example, using IBM / SPSS Modeler, you can use the @Index function in the
Derive node to create an index field
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 34
REMOVING VARIABLES THAT ARE NOT USEFUL
• The data analyst may wish to remove variables that will not help the analysis,
regardless of the proposed data mining task or algorithm
– Unary variables
– Variables which are very nearly unary
• Unary variables take on only a single value, so a unary variable is not so much a
variable as a constant
• Sometimes a variable can be very nearly unary
• For example, suppose that 99.95% of the players in a field hockey league are female,
with the remaining 0.05% male
• While it may be useful to investigate the male players, some algorithms will tend to
treat the variable as essentially unary
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 35
VARIABLES THAT SHOULD PROBABLY NOT BE REMOVED
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 36
REMOVAL OF DUPLICATE RECORDS
•
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 37
A WORD ABOUT ID FIELDS
• Because ID fields have a different value for each record, they will not be helpful for
your downstream data mining algorithms
• They may even be hurtful, with the algorithm finding some spurious relationship
between ID field and your target
• Thus it is recommended that ID fields should be filtered out from the data mining
algorithms, but should not be removed from the data, so that the data analyst can
differentiate between similar records
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 38