0% found this document useful (0 votes)
45 views

Data Preprocessing Part 1

This document discusses data preprocessing techniques used in data warehousing and mining. It covers data quality issues like accuracy, completeness, consistency and timeliness. The major tasks in data preprocessing are data cleaning, integration, reduction, and transformation. Data cleaning involves handling incomplete, noisy and inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines multiple data sources. Data reduction reduces dimensionality and numerosity through techniques like normalization and discretization.

Uploaded by

new acc jeet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Data Preprocessing Part 1

This document discusses data preprocessing techniques used in data warehousing and mining. It covers data quality issues like accuracy, completeness, consistency and timeliness. The major tasks in data preprocessing are data cleaning, integration, reduction, and transformation. Data cleaning involves handling incomplete, noisy and inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines multiple data sources. Data reduction reduces dimensionality and numerosity through techniques like normalization and discretization.

Uploaded by

new acc jeet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Warehousing and Mining

Mrs. Pinki Vishwakarma


Associate Professor
Objective
• To understand the Data Preprocessing technique
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

3
3
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
6
6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records

7
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
8
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant : e.g., “unknown”, a new class?!
• Use the attribute mean
• Use the attribute mean for all samples belonging to the same
class
• Use the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

10
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

11
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions

• Clustering
– detect and remove outliers

• Combined computer and human inspection


– detect suspicious values and check by human (e.g., deal
with possible outliers)

12
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)

13
Outcome
• Data preprocessing is a data mining technique
which is used to transform the raw data in a
useful and efficient format.

You might also like