DMPA-2 Powerpoint Slides - Modified Audio

Data Mining and Predictive Analytics
Daniel Larose, Ph.D. and Chantal Larose
Chapter 2
Data Preprocessing
Prepared by Andrew Hendrickson, Graduate Assistant
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose
John Wiley & Sons, Inc, Hoboken, NJ, 2015.
1
Why Do We Preprocess Data?
• Raw data often incomplete, noisy

• May contain:
– Obsolete fields
– Missing values
– Outliers
– Data in form not suitable for data mining
– Erroneous values
Data Mining and Predictive Analytics, By Daniel Larose and Chantal Larose John Wiley & Sons, Inc, Hoboken, NJ, 2015. 2
Why Do We Preprocess Data?
(cont’d)
• For data mining purposes, database values must

undergo data cleaning and data transformation
• Data often from legacy databases where values:
–Not looked at in years
–Expired
–No longer relevant
–Missing
• Minimize GIGO (Garbage In Garbage Out)
– IF garbage input minimized  THEN garbage in results
minimized
• Data preparation is 60% of effort for data mining
process (Pyle)
Data Cleaning
Data errors in Table 2.1 examined:

• Five-numeral U.S. Zip Code?
– Not all countries use same zip code format, 90210 (U.S.) vs.
J2S7K7 (Canada)
– Should expect unusual values in some fields
– For example, global commerce
• Four Digit Zip Code?
– Leading zero truncated, 6269 vs. 06269 (New England states)
– Database field numeric and chopped-off leading zero
Data Cleaning (cont’d)
• Income Field Contains $10,000,000?

– Assumed to measure gross annual income
– Possibly valid
– Still considered outlier (extreme data value)
– Some statistical and data mining methods affected by outliers
• Income Field Contains -$40,000?
– Income less than $0?
– Value beyond bounds for expected income, therefore an error
– Caused by data entry error?
– Discuss anomaly with database administrator
• Income Field Contains $99,999?

– Other values appear rounded to nearest $5,000
– Value may be completely valid
– Value represents database code used to denote missing value?
– Confirm values in expected unit of measure, such as U.S. dollars
– Which unit of measure for income?
– Customer with zip code J2S7K7 in Canadian dollars?
• Age Field Contains “C”?

– Other records have numeric values for field
– Record categorized into group labeled “C”
– Value must be resolved
– Data mining software expects numeric values for field
• Age Field Contains 0?
– Zero-value used to indicate missing/unknown value?
– Customer refused to provide their age?
• Age Field?
– Date-type fields may become obsolete
– Use date of birth, then derive Age
• Marital Status Field Contains “S”?
– What does this symbol mean?
– Does “S” imply single or separated?
Handling Missing Data
• Missing values pose problems to data analysis methods

• More common in databases containing large number of
fields
• Absence of information rarely beneficial to task of
analysis
• In contrast, having more data almost always better
• Careful analysis required to handle issue
Handling Missing Data (cont’d)
• Examine cars dataset containing records for 261

automobiles manufactured in 1970s and 1980s
• Delete Records Containing Missing Values?
– Not necessarily best approach
– Pattern of missing values may be systematic
– Deleting records creates biased subset
– Valuable information in other fields lost
• Three Alternate Methods Available
– Insightful Miner (http://www.insightful.com) specifies method to
replace values
• (1) Replace Missing Values with User-defined Constant

– Missing numeric values replaced with 0.0
– Missing categorical values replaced with “Missing”
• (2) Replace Missing Values with Mode or Mean

– Mode of categorical field cylinders = 4
– Missing values replaced with this value
– Mean for non-missing values in numeric field cubicinches =
200.65
– Missing values replaced with 200.65
– Substituting mode or mean for missing values sometimes works

well
– Mean not always best choice for “typical” value
– Resulting confidence levels for statistical inference become
overoptimistic (Larose)
– Domain experts should be consulted regarding approach to
replace missing values
– Benefits and drawbacks resulting from the replacement of
missing values must be carefully evaluated
• (3) Replace Missing Values with Random Values

– Values randomly taken from underlying distribution
– Value for cylinders, cubicinches, and hp randomly drawn
proportionately from each field’s distribution
– Method superior compared to mean substitution
– Measures of location and spread remain closer to original
– No guarantee resulting records make sense

– Suppose randomly-generated values cylinders = 8 and
cubicinches = 82
– A strange engine size?
– Alternate methods strive to replace values more precisely
– What is likely value, given record’s other attribute values?
– For example, American car has 300 cubic inches and 150
horsepower
– Japanese car has 100 cubic inches and 90 horsepower
– American car expected to have more cylinders
Identifying Misclassifications
– Verify values valid and consistent

– Frequency distribution shows five classes: USA, France, US,
Europe, and Japan
– Count for USA = 1 and France = 1?
– Two records classified inconsistently with respect to origin of the
manufacturer
– Maintain consistency by labeling USA  US, and France 
Europe
Graphical Methods for Identifying
Outliers
• Outliers are values that lie near extreme limits of data
range
• Outliers may represent errors in data entry
• Certain statistical methods very sensitive to outliers and
may produce unstable results
• Neural Networks and k-Means benefit from normalized
data
Outliers (cont’d)
– A histogram examines values of numeric fields

– This histogram shows vehicle weights for cars data set
– The extreme left-tail contains one outlier weighing several
hundred pounds (192.5)
– Should we doubt validity of this value?
Outliers (cont’d)
– Analysis of weightlbs field shows remaining records contain
whole-numbered (no decimal) values
– Perhaps value of 192.5 is an error
– Should it be 1925?
– Cannot know for sure and requires further investigation
– Discuss the meaning of the value with someone familiar with
database content
Outliers (cont’d)
– Two-dimensional scatter plots help determine outliers between

variable pairs
– Scatter plot of mpg against weightlbs shows two possible
outliers
– Most data points cluster together along x-axis
– However, one car weighs 192.5 pounds and other gets over 500
miles per gallon?
MEASURES OF CENTER AND SPREAD
• Measures of center are a special case of measures of location, numerical summaries

that indicate where on a number line a certain characteristic of the variable lies
• Percentiles
• Quantiles
• The mean of a variable is simply the average of the valid values taken by the variable
• For extremely skewed data sets, the mean becomes less representative of the
variable center
• Also, the mean is sensitive to the presence of outliers
• An alternative measure is the median, defined as the field value in the middle when
the field values are sorted into ascending order
• The median is resistant to the presence of outliers
• Another measure is the mode, which represents the field value occurring with the
greatest frequency
• The mode may be used with either numerical or categorical data, but is not always
associated with the variable center
MEASURES OF CENTER AND SPREAD - cont
• Measures of spread (variability) include the range (maximum — minimum), the

standard deviation, the mean absolute deviation, and the interquartile range
• The sample standard deviation is perhaps the most widespread measure of variability
and is defined by
• The standard deviation can be interpreted as the “typical” distance between a field
value and the mean, and most field values lie within two standard deviations of the
mean
Data Transformation
• Variables tend to have ranges different from each other
• In baseball, two fields may have ranges:
– Batting average: [ 0.0, 0.400 ]
– Number of home runs: [ 0, 70 ]
• Some data mining algorithms adversely affected by
differences in variable ranges
• Variables with greater ranges tend to have larger
influence on data model’s results
• Therefore, numeric field values should be normalized
• Standardizes scale of effect each variable has on results
MIN-MAX NORMALIZATION
• Min-max normalization works by seeing how much greater the field value is than the
minimum value min(X), and scaling this difference by the range
• For example, an ultra-light vehicle, weighing only 1613 pounds (the field minimum), the
min–max normalization is:
• The midrange equals the average of the maximum and minimum values in a data set:
• The heaviest vehicle has a min-max normalization value of
Z-SCORE STANDARDIZATION
• Z-score standardization works by taking the difference between the field value and
the field mean value, and scaling this difference by the standard deviation of the field
values
• For example, a vehicle weighing only 1613 pounds, the Z-score standardization is:
• For the heaviest car, the Z-score standardization is:
DECIMAL SCALING
•
TRANSFORMATIONS TO ACHIEVE NORMALITY
•
TRANSFORMATIONS TO ACHIEVE NORMALITY - cont
•
TRANSFORMATIONS TO ACHIEVE NORMALITY - cont
• Depending on the variable, the 3 transformations may produce a more normal
distribution than one another
• To check for normality, we construct a normal probability plot, which plots the
quantiles of a particular distribution against the quantiles of the standard normal
distribution
• In a normal probability plot, if the distribution is normal, the bulk of the points in the
plot should fall on a straight line; systematic deviations from linearity in this plot
indicate nonnormality
• Finally, when the algorithm is done with its analysis, don’t forget to “de-transform”
the data
NUMERICAL METHODS FOR IDENTIFYING OUTLIERS
•
FLAG VARIABLES
•
TRANSFORMING CATEGORICAL VARIABLES INTO NUMERICAL
VARIABLES
• In most instances, the data analyst should avoid transforming categorical variables to
numerical variables
• The exception is for categorical variables that are clearly ordered, such as the
variable survey_response, taking values always, usually, sometimes, never
• Should never be “0” rather than “1”? Is always closer to usually than usually is to
sometimes?
ADDING AN INDEX FIELD
• It is recommended that the data analyst create an index field, which tracks the sort
order of the records in the database
• Data mining data gets partitioned at least once (and sometimes several times)
• It is helpful to have an index field so that the original sort order may be recreated
• For example, using IBM / SPSS Modeler, you can use the @Index function in the
Derive node to create an index field
REMOVING VARIABLES THAT ARE NOT USEFUL
• The data analyst may wish to remove variables that will not help the analysis,
regardless of the proposed data mining task or algorithm
– Unary variables
– Variables which are very nearly unary
• Unary variables take on only a single value, so a unary variable is not so much a
variable as a constant
• Sometimes a variable can be very nearly unary
• For example, suppose that 99.95% of the players in a field hockey league are female,
with the remaining 0.05% male
• While it may be useful to investigate the male players, some algorithms will tend to
treat the variable as essentially unary
VARIABLES THAT SHOULD PROBABLY NOT BE REMOVED
• It is (unfortunately) a common – though questionable – practice to remove from

analysis the following types of variables
– Variables for which 90% or more of the values are missing
– Variables which are strongly correlated
• Variables which contain 90% missing values present a challenge to any strategy for
imputation of missing data (see Chapter 27)
• Conceivably, those who donate a lot would be inclined to report their donations, while
those who do not donate much may be inclined to skip this survey question
• Thus, the 10% who report are not representative of the whole
• In this case, it may be preferable to construct a flag variable, donation_flag, since
there is a pattern in the missingness which may turn out to have predictive power
• Inclusion of correlated variables may at best double-count a particular aspect of the
analysis, and at worst lead to instability of the model results
• Instead, it is suggested that principal components analysis be applied, where the
common variability in correlated predictors may be translated into a set of
uncorrelated principal components
REMOVAL OF DUPLICATE RECORDS
•
A WORD ABOUT ID FIELDS
• Because ID fields have a different value for each record, they will not be helpful for
your downstream data mining algorithms
• They may even be hurtful, with the algorithm finding some spurious relationship
between ID field and your target
• Thus it is recommended that ID fields should be filtered out from the data mining
algorithms, but should not be removed from the data, so that the data analyst can
differentiate between similar records

DMPA-2 Powerpoint Slides - Modified Audio

Uploaded by

Copyright:

Available Formats

DMPA-2 Powerpoint Slides - Modified Audio

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMPA-2 Powerpoint Slides - Modified Audio

Uploaded by

Copyright:

Available Formats

Data Mining and Predictive Analytics

Daniel Larose, Ph.D. and Chantal Larose

• Raw data often incomplete, noisy

• For data mining purposes, database values must

Data errors in Table 2.1 examined:

• Income Field Contains $10,000,000?

• Income Field Contains $99,999?

• Age Field Contains “C”?

• Missing values pose problems to data analysis methods

• Examine cars dataset containing records for 261

• (1) Replace Missing Values with User-defined Constant

• (2) Replace Missing Values with Mode or Mean

– Substituting mode or mean for missing values sometimes works

• (3) Replace Missing Values with Random Values

– No guarantee resulting records make sense

– Verify values valid and consistent

– A histogram examines values of numeric fields

– Two-dimensional scatter plots help determine outliers between

• Measures of center are a special case of measures of location, numerical summaries

• Measures of spread (variability) include the range (maximum — minimum), the

• The heaviest vehicle has a min-max normalization value of

• For the heaviest car, the Z-score standardization is:

• It is (unfortunately) a common – though questionable – practice to remove from

You might also like