Data Preprocessing Part 1

This document discusses data preprocessing techniques used in data warehousing and mining. It covers data quality issues like accuracy, completeness, consistency and timeliness. The major tasks in data preprocessing are data cleaning, integration, reduction, and transformation. Data cleaning involves handling incomplete, noisy and inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines multiple data sources. Data reduction reduces dimensionality and numerosity through techniques like normalization and discretization.

Uploaded by

new acc jeet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Data Preprocessing Part 1

Uploaded by

new acc jeet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Warehousing and Mining

Mrs. Pinki Vishwakarma

Associate Professor
Objective
• To understand the Data Preprocessing technique
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

3
3
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

4
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
5
Chapter 3: Data Preprocessing

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
6
6
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records

7
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred
8
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant : e.g., “unknown”, a new class?!
• Use the attribute mean
• Use the attribute mean for all samples belonging to the same
class
• Use the most probable value: inference-based such as Bayesian
formula or decision tree
9
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

10
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

11
How to Handle Noisy Data?
• Regression
– smooth by fitting the data into regression functions

• Clustering
– detect and remove outliers

• Combined computer and human inspection

– detect suspicious values and check by human (e.g., deal
with possible outliers)

12
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)

13
Outcome
• Data preprocessing is a data mining technique
which is used to transform the raw data in a
useful and efficient format.

Automating Salesforce Marketing Cloud 1st edition by Greg Gifford,Jason Hanshaw 9781803244648 180324464X - The latest ebook is available for instant download now
No ratings yet
Automating Salesforce Marketing Cloud 1st edition by Greg Gifford,Jason Hanshaw 9781803244648 180324464X - The latest ebook is available for instant download now
44 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Correlation
No ratings yet
Correlation
14 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Unit I Chapter III
No ratings yet
Unit I Chapter III
71 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Chapter 3& 4 (3)
No ratings yet
Chapter 3& 4 (3)
60 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Chapter 2
No ratings yet
Chapter 2
40 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
DWM
No ratings yet
DWM
14 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
CH 3
No ratings yet
CH 3
68 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit-2 DWDM
No ratings yet
Unit-2 DWDM
16 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Unit 17 Data Warehousing and Data Mining: Structure
No ratings yet
Unit 17 Data Warehousing and Data Mining: Structure
33 pages
CV671939 Mebarki-Boussaad Turnover-It
No ratings yet
CV671939 Mebarki-Boussaad Turnover-It
7 pages
ETL (Extract, Transform, and Load) Process
No ratings yet
ETL (Extract, Transform, and Load) Process
8 pages
MVA Implementing A Data Warehouse With SQL Jump Start Mod 1 Final
No ratings yet
MVA Implementing A Data Warehouse With SQL Jump Start Mod 1 Final
37 pages
Building The Data Warehouse: Principles of Dimensional Modeling
No ratings yet
Building The Data Warehouse: Principles of Dimensional Modeling
34 pages
(eBook PDF) Accounting Information Systems, Global Edition 15th Edition instant download
100% (1)
(eBook PDF) Accounting Information Systems, Global Edition 15th Edition instant download
44 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Pavan DAnalyst Q&A Resume Based All Technical, Personal Based
No ratings yet
Pavan DAnalyst Q&A Resume Based All Technical, Personal Based
6 pages
Oracle Goldengate, Streams, & Oracle Data Integrator: Fumiko Yamashita
No ratings yet
Oracle Goldengate, Streams, & Oracle Data Integrator: Fumiko Yamashita
10 pages
ETL Process in Data Warehouse
No ratings yet
ETL Process in Data Warehouse
3 pages
Talend-Sample-Resume-3 (1)
No ratings yet
Talend-Sample-Resume-3 (1)
5 pages
Abey_Resume_Template
No ratings yet
Abey_Resume_Template
1 page
uw-msim-online-program-brochure-feb-2024
No ratings yet
uw-msim-online-program-brochure-feb-2024
5 pages
Sap Data Services The Future of Enterprise Etl
No ratings yet
Sap Data Services The Future of Enterprise Etl
35 pages
Csi ZG515 Course Handout
No ratings yet
Csi ZG515 Course Handout
10 pages
Requirements
No ratings yet
Requirements
17 pages
Untitled
No ratings yet
Untitled
18 pages
SAP Community Network Wiki - Enterprise Information Management - EIM Home
No ratings yet
SAP Community Network Wiki - Enterprise Information Management - EIM Home
11 pages
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Data Factory - Pratap - Qbex Technologies - 8886230001
4 pages
Learn To Create MSBI (Microsoft Business Intelligence) Project in 7 Days - CodeProject
No ratings yet
Learn To Create MSBI (Microsoft Business Intelligence) Project in 7 Days - CodeProject
20 pages
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni - The latest ebook is available for instant download now
100% (1)
Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Zagni - The latest ebook is available for instant download now
60 pages
I Etl D: Annam Ateesh
No ratings yet
I Etl D: Annam Ateesh
5 pages
5 Best Practices For Data Warehouse Development
No ratings yet
5 Best Practices For Data Warehouse Development
12 pages
Informatica CV
100% (6)
Informatica CV
4 pages
HUMANA INTERVIEW PREP
No ratings yet
HUMANA INTERVIEW PREP
10 pages
Sandip Nagare MBA Project Report 2
No ratings yet
Sandip Nagare MBA Project Report 2
75 pages
Ariel Fleiderman Resume
No ratings yet
Ariel Fleiderman Resume
2 pages
SumanaV Bigdata
No ratings yet
SumanaV Bigdata
6 pages
SAI's Resume
No ratings yet
SAI's Resume
1 page