Data Preprocessing Part 1
Data Preprocessing Part 1
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
3
3
Data Quality: Why Preprocess the Data?
4
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Chapter 3: Data Preprocessing
– Data Quality
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
6
6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
7
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
8
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant : e.g., “unknown”, a new class?!
• Use the attribute mean
• Use the attribute mean for all samples belonging to the same
class
• Use the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
10
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
11
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
12
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
13
Outcome
• Data preprocessing is a data mining technique
which is used to transform the raw data in a
useful and efficient format.