Data Preprocessing
Data Preprocessing
Data Preprocessing
Content
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Data Preprocessing for Machine learning
Why data preprocessing?
Data in the real world is dirty
o incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
o noisy: containing errors or outliers
o inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
o Quality decisions must be based on quality data
o Data warehouse needs consistent integration of quality data
A multi-dimensional measure of data quality
o A well-accepted multi-dimensional view:
accuracy, completeness, consistency, timeliness, believability, value added,
interpretability, accessibility
Broad categories
intrinsic, contextual, representational, and accessibility
Why data preprocessing?
Major Tasks of Data Preprocessing
• Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
o Integration of multiple databases, data cubes, files, or notes
• Data transformation
o Normalization (scaling to a specific range)
o Aggregation
• Data reduction
o Obtains reduced representation in volume but produces the same or similar
analytical results
o Data discretization: with particular importance, especially for numerical data
o Data aggregation, dimensionality reduction, data compression, generalization
Data cleaning
Importance
• “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
Tasks of Data Cleaning
• Fill in missing values
• Identify outliers and smooth noisy data
• Correct inconsistent data
Data cleaning
Manage Missing Data
• Ignore the tuple: usually done when class label is missing
(assuming the task is classification—not effective in certain
cases)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples of the same class to fill in
the missing value: smarter
• Use the most probable value to fill in the missing value: inference
based such as regression, Bayesian formula, decision tree
Data cleaning
Manage Noisy Data
Binning Method:
• first sort data and partition into (equal-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc…
Clustering:
• detect and remove outliers
Semi Automated
• Computer and Manual Intervention
Regression
• smooth by fitting the data into regression functions
Data cleaning
Inconsistent Data
Manual correction using external references
Semi-automatic using various tools
• To detect violation of known functional dependencies and
data constraints
• To correct redundant data
Data integration and transformation
Tasks of Data Integration and transformation
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• Entity identification problem:
• identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different
sources are different
• possible reasons: different representations, different scales,
e.g., metric vs. British units, different currency
Data integration and transformation
Manage Data Integration
• Redundant data occur often when integration of multiple
databases
• Object identification: The same attribute or object may have
different names in different databases
• Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data integration and transformation
Manage Data Transformation
• Smoothing: remove noise from data (binning, clustering,
regression)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified
range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
Data reduction
Why data reduction?
• A database/data warehouse may store terabytes of
data
• Complex data analysis/mining may take a very long
time to run on the complete data set
Data reduction strategies
• Data cube aggregation
• Dimensionality reduction e.g., remove unimportant
attributes
• Attribute subset selection
• Numerosity reduction e.g., fit data into models
• Discretization and concept hierarchy generation
Data Preprocessing for ML
• Get Dataset
• Importing the Libraries
• Importing the Dataset
• Missing data
• Categorical data
• Splitting the Dataset into the Training set and Test set
• Feature scaling
• Data Preprocessing template
Data Preprocessing
Get Dataset
Can download from Internet
Create your own dataset
Data Preprocessing
Importing the Libraries
A tool that you can use to make a specific job
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
…
Data Preprocessing
Importing Dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
…
Data Preprocessing
Missing Data
Happens quite a lot actually in real life so we have
to get the trick to handle this problem.
Idea:
- Remove missing : quite dangerous (not use).
- Replace the missing data by the median or mean of the feature
column
- Replace the missing data by the most frequent value of the
feature column
Data Preprocessing
France Spain Germany