Why Data Preprocessing

The document discusses why data preprocessing is important. Real-world data is often dirty, incomplete, noisy, and inconsistent which can negatively impact analysis. Data preprocessing aims to clean data by handling missing values, identifying outliers, resolving inconsistencies, integrating multiple sources, and reducing data volume while maintaining quality. The major tasks involve data cleaning, integration, and reduction.

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

Why Data Preprocessing

Uploaded by

kusamee0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Why Data Preprocessing?

► Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

Why Is Data Dirty?

 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Why Is Data Preprocessing Important?

■ No quality data, no quality mining results!

o Quality decisions must be based on quality data

 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
o Data warehouse needs consistent integration of quality data

■ Data extraction, cleaning, and transformation comprises the majority of the

work of building a data warehouse.

Multi-Dimensional Measure of Data Quality

► A well-accepted multidimensional view:
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility
► Broad categories:
 Intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing
 Data cleaning
o Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
 Data integration
o Integration of multiple databases, data cubes, or files
 Data reduction
o Obtains reduced representation in volume but produces the same or
similar analytical results

Forms of Data Preprocessing

Data Cleaning
1. Importance
 “Data cleaning is one of the three biggest problems in data
warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data warehousing”—
DCI survey
2. Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula or
decision tree.

Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data

Data Integration
 Data integration:
► Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id ≡ B.cust-#
► Integrate metadata from different sources
 Entity identification problem:
► Identify real-world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
► For the same real-world entity, attribute values from different sources
are different
► Possible reasons: different representations, different scales, e.g.,
metric vs. British units

Handling Redundancy in Data Integration

o Redundant data occur often when integration of multiple databases
 Object identification: The same attribute or object may have different
names in different databases
 Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
o Redundant attributes may be able to be detected by correlation analysis
o Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
Data Reduction Strategies
◇ Why data reduction?
⇰ A database/data warehouse may store terabytes of data
⇰ Complex data analysis/mining may take a very long time to run on
the complete data set
◇ Data reduction
⇰ Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
◇ Data reduction strategies
⇰ Data cube aggregation:
⇰ Dimensionality reduction — e.g., remove unimportant attributes
⇰ Data Compression
⇰ Numerosity reduction — e.g., fit data into models
⇰ Discretization and concept hierarchy generation

Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH 3
No ratings yet
CH 3
68 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Bafpred Module 2 Week 5 6
No ratings yet
Bafpred Module 2 Week 5 6
35 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit-2 New
No ratings yet
Unit-2 New
61 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Normalization
No ratings yet
Normalization
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Chapter 2 Introduction Data Mining
No ratings yet
Chapter 2 Introduction Data Mining
2 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Why Data Preprocessing?
No ratings yet
Why Data Preprocessing?
3 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
PhysRevB 97 161108
No ratings yet
PhysRevB 97 161108
5 pages
2020 H2 Physics Band 2 Revision 1 - Dynamics, Forces
No ratings yet
2020 H2 Physics Band 2 Revision 1 - Dynamics, Forces
12 pages
Project Report (Org) 4
No ratings yet
Project Report (Org) 4
49 pages
CE102-W5-Wood and Its Properties
No ratings yet
CE102-W5-Wood and Its Properties
43 pages
Brown Book
100% (1)
Brown Book
179 pages
DS 2CD2T23G0 I520180404aawrc12389314 - 20221006123632
No ratings yet
DS 2CD2T23G0 I520180404aawrc12389314 - 20221006123632
26 pages
Dolder 1961
No ratings yet
Dolder 1961
19 pages
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
3 pages
Dissertation Outline - Shwetank
No ratings yet
Dissertation Outline - Shwetank
4 pages
Get Invoice
No ratings yet
Get Invoice
2 pages
Mathematics SL Internal Assessment Does My Dog Walk More Than Me?
No ratings yet
Mathematics SL Internal Assessment Does My Dog Walk More Than Me?
15 pages
Week 4
No ratings yet
Week 4
35 pages
Class12 CS Practical File Slides Guidelines
No ratings yet
Class12 CS Practical File Slides Guidelines
12 pages
Hyd Cylinder Details Jyo Make
No ratings yet
Hyd Cylinder Details Jyo Make
4 pages
dataVAR LAAR
No ratings yet
dataVAR LAAR
1 page
Questions For Gas Turbine Engine
No ratings yet
Questions For Gas Turbine Engine
120 pages
Artificial Intelligence Lab Manual: (ACADEMIC YEAR: 2017-21) Semester
No ratings yet
Artificial Intelligence Lab Manual: (ACADEMIC YEAR: 2017-21) Semester
41 pages
MultiControl Supplement en V3.1
No ratings yet
MultiControl Supplement en V3.1
152 pages
TTH Module 1
No ratings yet
TTH Module 1
4 pages
Vector Addition Activity
No ratings yet
Vector Addition Activity
4 pages
mt940 Details
No ratings yet
mt940 Details
18 pages
DMS (22319) - Chapter 5 Notes
100% (1)
DMS (22319) - Chapter 5 Notes
53 pages
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
No ratings yet
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
4 pages
UK Experkience ATF 2 X 25 KV PDF
No ratings yet
UK Experkience ATF 2 X 25 KV PDF
12 pages
Recent Advances and Application of Machine Learning in Food Flavor Prediction and Regulation
No ratings yet
Recent Advances and Application of Machine Learning in Food Flavor Prediction and Regulation
14 pages
Pseudocode and Flow Charts
100% (1)
Pseudocode and Flow Charts
42 pages
Wipro Technical Interview Questions
No ratings yet
Wipro Technical Interview Questions
3 pages
Q1 LE Mathematics-8 Lesson-2 Week-2
No ratings yet
Q1 LE Mathematics-8 Lesson-2 Week-2
25 pages
Preparation PLM 11
No ratings yet
Preparation PLM 11
18 pages
1LE2321-1CA11-4GA3 Datasheet en
No ratings yet
1LE2321-1CA11-4GA3 Datasheet en
1 page

Why Data Preprocessing

Uploaded by

Why Data Preprocessing

Uploaded by

Why Data Preprocessing?

► Data in the real world is dirty

Why Is Data Dirty?

■ No quality data, no quality mining results!

o Quality decisions must be based on quality data

■ Data extraction, cleaning, and transformation comprises the majority of the

Multi-Dimensional Measure of Data Quality

Forms of Data Preprocessing

Handling Redundancy in Data Integration

You might also like