Chapter 1: Data Preprocessing
Data Mining
Heger Arfaoui - ENIT - 2023
References
Chapter 2: Data Preprocessing
Outline
• Data Preprocessing: An Overview
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Why Data preprocessing?
Real-world data
• Data in the real world is dirty:
incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
• GIGO! Garbage In Garbage Out
A multi-dimensional measure of data quality
• A well-accepted multi-dimensional view:
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable is the data?
Interpretability: how easily the data can be understood?
• Two different users may have two different assessments of the quality of the data
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available
• Missing data may be due to:
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• intentional
• Missing data may need to be inferred
How to handle missing data?
• Ignore records with missing values in training data especially large
dataset
few missing records
• Replace missing value with...
• default or special value(e.g.,0,“missing”)
• average/median value for numerics
• most frequent value for nominals
• Try to predict missing values:
• handle missing values as learning problem
• target: attribute which has missing values
• training data: instances where the attribute is present
• test data: instances where the attribute is missing
Missing data: caveats
Note: values may be missing for various reasons
...and, more importantly: at random vs. not at random
• Examples for not random:
– Non-mandatory questions in questionnaires
• e.g., “how often do you drink alcohol?”
– Values only valid for certain data sub-populations
• e.g.,“are you currently pregnant?”
– Sensors failing under certain conditions
• e.g.,at high temperatures
• In those cases, averaging and imputation causes information loss – In other words: “missing” can be
information!
Missing data caveats (ctd)
Missing data caveats (ctd)
Noisy data
• Noise: Random error in a measured variable.
• Incorrect attribute values may be due to:
faulty data collection instruments
data entry problems
data transmission problems
technology limitations
inconsistency in naming convention
How to handle noisy data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
•
• detect suspicious values and check by human (e.g., deal with possible outliers)
Binning method for data smooting
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Detect and remove outliers
Data Integration
Data integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Entity identification problem:
• Identify real-world entities from multiple data sources, e.g., Bill Clinton = William
Clinton
• Detecting and resolving data value conflicts
• For the same real-world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
•
Handling redundant data in data integration
• Redundant data occur often when integrating multiple DBs
• The same attribute may have different names in different databases
• One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational analysis
S( A - A)( B - B)
rA, B =
(n - 1)s As B
• Careful integration can help reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
•
Data Reduction
Data reduction
• A database/data warehouse may store terabytes of data. Complex data
analysis may take a very long time to run on the complete data set.
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
•
Data reduction strategies
• Numerosity reduction (some simply call it: Data Reduction)
• Histograms, clustering
• Sampling
• Dimensionality reduction: e.g., remove unimportant
attributes
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Data compression
Sampling
• Sampling: obtaining a small sample s to represent the whole data set N
• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling
Types of sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data
Dimensionality reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
Feature selection
Basic Heuristics
• Remove nominal attributes...
• which have more than p% identical values
• example: millionaire=false
• which have more than p% different values
• example: names, IDs
• Remove numerical attributes
• which have little variation, i.e., standard deviation <s
• Compute pairwise correlations between attributes and remove highly correlated attributes:
• Naive Bayes requires independent attributes. Will benefit from removing correlated attributes
PCA: Principal Component Analysis
• feature selection methods select a subset of attributes: no new attributes are created
• PCA creates a (smaller set of) new attributes
• artificial linear combinations of existing attributes
• as expressive as possible
• Dates back to the pre-computer age
• invented by Karl Pearson (1857-1936)
•
• also known for Pearson's correlation coefficient
PCA (ctd)
• Idea: transform the coordinate system so that each new coordinate (principal component) is as expressive as possible
• expressivity: variance of the variable
• the 1 , 2 , 3 ... PC should account for as much variance as possible
st nd rd
• further PCs can be neglected
•
Source: https://knowledge.dataiku.com/latest/ml-analytics/statistics/concept-principal-component-analysis-pca.html
Data Transformation
Data transformation
• Transformation: A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of the new values
• Methods:
• Conversion
• Discretization
• Smoothing: remove noise from data (binning, clustering, regression)
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction New attributes constructed from the given ones
:
Conversion
• Binary to numeric:
eg: student=yes,no
Convert to 0, 1
• Order to numeric: Ordered attributes (e.g. grade) can be converted to
numbers preserving order
– A → 4.0
– A-→3.7
– B+→ 3.3
– B → 3.0
• Nominal to numeric
Normalization
• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can affect the data analysis
• For some data mining algorithms, differences in ranges will lead to a tendency for the
variable with a greater range to have undue influence on the results
• Data miners should normalize their numeric variables in order to standardize the
scale of effect each variable has on the results
• Algorithms that make use of distance measures (such as k-Nearest Neighbors)
benefit from normalization
• Normalization and scaling are used interchangeably
Min-max normalization
Performs a linear transformation on the original data
Min-max normalization preserves the relationships among the
original data values
Values range between 0 and 1
Min-max normalization will encounter an « out-of-bounds » error if
a future input case for normalization falls outside of the original
data range of X
Z-score normalization
Also called zero-mean normalization
Z-score standardization works by taking the difference
between the field value and the field mean value, and scaling
this difference by the standard deviation of the field values:
The z-score normalization is useful when the actual minimum
and maximum of an attribute X are unknown, or when there
are outliers that dominate the min-max normalization
Decimal scaling
• Decimal scaling ensure that every normalized values lies between -1 and 1
• d = number of digits in the data value with the largest absolute value
Normalization - remarks
• Normalization can change the original data quite a bit, especially when using
the z-score normalization or decimal scaling
• It is necessary to save the normalization parameters (e.g., the mean and
standard deviation if using z-score normalization) so that future data can be
normalized in a uniform manner
• The normalization parameters now become model parameters and the same
value should be used when the model is used on new data (e.g. testing data)
Transformations to achieve normality
• Some data mining algorithms and statistical methods require that the
variables be normally distributed
• z-score transformation does not achieve normality
Transformations to achieve normality
• The skewness of a distribution is measured by:
• Most real-world data is right-skewed, especially most financial data.
•
Transformations to achieve normality
• Common transformations to achieve normality:
ln(x)
1
sqrt(x)
1/x, …
• log transformation is suitable for strongly right-skewed data, sqrt transformation is
suitable for slightly right-skewed data
Recap
Recap
• Raw data has many problems:
• missing values
• errors
• high dimensionality
• Good preprocessing is essential for good data mining
• one of the first steps in the pipeline
• often the most time-consuming step of the pipeline
• requires lots of experimentation and fine-tuning
• Data preparation includes:
• Data cleaning, data integration, data reduction, feature selection, normalization,…
• A lot a methods have been developed but still an active area of research