M 2.3 Data Preprocessing

The document discusses the challenges of real-world data, highlighting issues such as incompleteness, noise, and inconsistency, and emphasizes the necessity of quality data for effective data mining. It outlines major tasks in data pre-processing, including data cleaning, integration, transformation, reduction, and discretization, with a focus on methods for handling missing and noisy data. Various techniques for data cleaning, such as regression, clustering, and the use of data auditing tools, are also described to ensure data quality.

Uploaded by

anaghamelayil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

M 2.3 Data Preprocessing

Uploaded by

anaghamelayil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

❖ Data in the real world is dirty

✔ incomplete: lacking attribute values, lacking certain

attributes of interest, or containing only aggregate
data
✔ noisy: containing errors or outliers
✔ inconsistent: containing discrepancies in codes
or names
❖ No quality data, no quality mining results!
✔ Quality decisions must be based on quality data
✔ Data warehouse needs consistent integration of
quality data
Major Tasks in Data pre-processing

• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
• Data discretization
Data pre-processing methods
Data Cleaning
• Real world data is incomplete, noisy and
inconsistent.
• Data cleaning fill in missing values, smooth
out noise while identify outliers and correct
inconsistencies in the data.
Data cleaning methods :-
1) Missing Values
• Ignore the tuple
Can be done when class label is missing. It
is not effective, unless tuple contains several
attribute with missing values.
• Fill in the missing value manually
• Time consuming given large data set with
many missing values
• Use some global constants to fill in the missing values
Use ex: -α , “unknown “ etc.
But there is a chance for misinterpreting
“unknown”
• Use the attribute mean to fill in the missing value. For
example customer average income is 25000 then you
can use this value to replace missing value for income.
• Use the attribute mean for all samples belonging to the
same class as given by the tuple
• Use the most probable value to fill in the missing
value. This value is determined by regression,
inference based tools or decision tree induction.
2) Noisy Data

• Noise is a random error or variance in a

measured variable. Noisy Data may be due to
faulty data collection instruments, data entry
problems and technology limitation.
Handling Noisy Data
1. Binning:
Binning methods sort data value by consulting
its “neighbour- hood,” that is, the values around
it. The sorted values are distributed into a
number of “buckets,” or bins.
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Example: Data for price(in dollars):
15,4,8,21,21,24, 28,25,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
a) Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
Bin1: 9,9,9 🡪 [(4+8+15)/3=9]
Bin2:22,22,22 🡪 [(22+22+22)/3=22]
Bin3:29,29,29 🡪 [(25+28+34)/3=29]
b) Smoothing by bin medians
Each value in a bin is replaced by the median of all the values
belonging to the same bin.
Bin1: 8,8,8
Bin2:21,21,21
Bin3:28,28,28

c)Smoothing by bin boundaries

In smoothing by bin boundaries, each bin value is replaced by the closest
boundary value.
Bin1: 4,4,15
Bin2:21,21,24
Bin3:25, 25, 34
Example 2:Partition the given data into 4 bins using Equi-
depth binning method and perform smoothing according to
the following methods

Data:11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,
40,45,45,45,71,72,73,75

a) Smoothing by bin mean

b) Smoothing by bin median
c) Smoothing by bin boundaries
Divide the data into 4 equal-depth bins
bin 1:11,13,13,15,15,16
bin 2:9,20,20,20,21,21
bin3:22,23,24,30,40,45
bin4:,45,45,71,72,73,75
smoothing by means
bin 1-13.83,13.83,13.83,13.83,13.83,13.83
bin 2-20.16,20.16,20.16,20.16,20.16
bin 3-30.67,30.67,30.67,30.67,30.67,30.67
bin 4-63.5,63.5,63.5,63.5,63.5,63.5,63.5
smoothing by boundaries
bin 1:11,11,11,16,16,16
bin 2:19,19,19,21,21,21
bin3:22,22,22,22,45,45
bin4:,45,45,75,75,75,75
moothing by median
bin 1:14,14,14,14,14,14 🡪 [(13+15)/2=14]
bin 2:20,20,20,20,20,20 🡪 [(20+20)/2=20]
bin3: 27,27,27,27,27,27 🡪 [(24+30)/2=27]
bin4:,71.5,71.5,71.5,71.5,71.5,71.5
2. Regression
• Data can be smoothed by fitting the data to a
function , such as with regression.
• Linear Regression involves finding the best
line to fit two attribuutes so one attribuute can
be used to predict the other attribute
• Multiple linear regression where more than
two attribuutes are involved and the data are fit
to a multidimensional surface.
3. Clustering
• Outliers may be detected byy clustering .
• Similar values are organized into groups/clusters.
• Values fall outside the group may bee considered as
outliers.
Data Cleaning as a process
• Fiirst step in data cleaning is discrepancy
detection.
✔Discrepancy are caused by poorly designed
data entry forms, human error in data entry,
deliberate errors and data decay, inconsistent
data representation and inconsistent use of
codes.
✔Field Overloading may cause discrepancy
• Use metadata for discrepancy detection
Data Cleaning as a process
• Data should be examined by unique rules,
consecutive rules and null rules.
• Unique rule says value of the given attribute must
be different from the other values for that attribute.
• Consecutive rules there can be no missing values
between lowest and highest values for the attribute
and that all values must be unique.
• Null rules specifies the use of question marks, special
characters or other strings that may indicate null
conditions and how such values should be handled.
Data Cleaning as a process
• Data scrubbing tools use simple domain
knowledge to detect errors and correct the data
and use parsing and fuzzy matching
techniques.
• Data auditing tools find discrepancies by
analyzing the data to discover rules ,
relationships and detecting data violating the
conditions.
Data Cleaning as a process
• Data transformation define and apply a series
of transformation to correct discrepancy.

• Extraction/ transformation / Loading tool

(ETL) allows user to specify transformation
through Graphical User Interface

Datahon Questions
100% (12)
Datahon Questions
24 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
DWM
No ratings yet
DWM
14 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Unit-2 DWDM
No ratings yet
Unit-2 DWDM
16 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Unit - II
No ratings yet
Unit - II
56 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Normalization
No ratings yet
Normalization
35 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DWDM UNIT-II
No ratings yet
DWDM UNIT-II
18 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
data preprocessing
No ratings yet
data preprocessing
21 pages
Week2-2
No ratings yet
Week2-2
25 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
L3
No ratings yet
L3
34 pages
Correlation
No ratings yet
Correlation
14 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Proceedings of Spie: A Novel Framework For Change Detection in Bi-Temporal Polarimetric SAR Images
No ratings yet
Proceedings of Spie: A Novel Framework For Change Detection in Bi-Temporal Polarimetric SAR Images
13 pages
Convolutional Neural Network
No ratings yet
Convolutional Neural Network
184 pages
6 Estimation
No ratings yet
6 Estimation
28 pages
Ethical Leadership and Organizational Citizenship Behavior
No ratings yet
Ethical Leadership and Organizational Citizenship Behavior
10 pages
Report Case Study Fin534 Group Assignment
No ratings yet
Report Case Study Fin534 Group Assignment
15 pages
An Empirical Analysis of Initial Public Offering (IPO) Performance
No ratings yet
An Empirical Analysis of Initial Public Offering (IPO) Performance
25 pages
Adoption of Six Sigma DMAIC To Reduce Cost of Poor Quality
100% (1)
Adoption of Six Sigma DMAIC To Reduce Cost of Poor Quality
26 pages
HR Analytics: A Literature Review and New Conceptual Model: June 2020
No ratings yet
HR Analytics: A Literature Review and New Conceptual Model: June 2020
13 pages
Module in Assessment of Learning 2
No ratings yet
Module in Assessment of Learning 2
17 pages
Assignment MAED 103 Educational Statistics MarieKris Agan
No ratings yet
Assignment MAED 103 Educational Statistics MarieKris Agan
3 pages
Deep Long-Tailed Learning A Survey
No ratings yet
Deep Long-Tailed Learning A Survey
20 pages
Ahmad 2014 - The Perceived Impact of JIT Implementation On Firms Financial Growth Performance
No ratings yet
Ahmad 2014 - The Perceived Impact of JIT Implementation On Firms Financial Growth Performance
13 pages
Machine Learning Paradigms
100% (9)
Machine Learning Paradigms
336 pages
Final Exam SP '18
No ratings yet
Final Exam SP '18
6 pages
Research Methods 9th Edition Theresa L. White 2024 Scribd Download
100% (5)
Research Methods 9th Edition Theresa L. White 2024 Scribd Download
49 pages
(eBook PDF) Applied Statistics in Business and Economics 6th Edition download
100% (1)
(eBook PDF) Applied Statistics in Business and Economics 6th Edition download
45 pages
What Is Statistics?: Goals Goals
No ratings yet
What Is Statistics?: Goals Goals
19 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
3-Determining How Cost Behave - Cost Behavior
No ratings yet
3-Determining How Cost Behave - Cost Behavior
51 pages
56 LogFrame
No ratings yet
56 LogFrame
17 pages
Profile and Marketing Strategies of Online Sellers in General Santos City
No ratings yet
Profile and Marketing Strategies of Online Sellers in General Santos City
11 pages
C202 PDF
No ratings yet
C202 PDF
4 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
10 pages
BS EN 13850-2012
No ratings yet
BS EN 13850-2012
120 pages
Employee Motivation and Performance Do The Work en
No ratings yet
Employee Motivation and Performance Do The Work en
13 pages
IV. Understanding Ways To Collect Data: Isaphsdept@isap - Edu.ph
No ratings yet
IV. Understanding Ways To Collect Data: Isaphsdept@isap - Edu.ph
18 pages
MMZG 522 Total Quality Management: BK Rout
No ratings yet
MMZG 522 Total Quality Management: BK Rout
97 pages
Berger Et Al 2011 J Structural Geol
No ratings yet
Berger Et Al 2011 J Structural Geol
13 pages
Ex01 Linear Regression PDF
No ratings yet
Ex01 Linear Regression PDF
2 pages