0% found this document useful (0 votes)
7 views

M 2.3 Data Preprocessing

The document discusses the challenges of real-world data, highlighting issues such as incompleteness, noise, and inconsistency, and emphasizes the necessity of quality data for effective data mining. It outlines major tasks in data pre-processing, including data cleaning, integration, transformation, reduction, and discretization, with a focus on methods for handling missing and noisy data. Various techniques for data cleaning, such as regression, clustering, and the use of data auditing tools, are also described to ensure data quality.

Uploaded by

anaghamelayil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

M 2.3 Data Preprocessing

The document discusses the challenges of real-world data, highlighting issues such as incompleteness, noise, and inconsistency, and emphasizes the necessity of quality data for effective data mining. It outlines major tasks in data pre-processing, including data cleaning, integration, transformation, reduction, and discretization, with a focus on methods for handling missing and noisy data. Various techniques for data cleaning, such as regression, clustering, and the use of data auditing tools, are also described to ensure data quality.

Uploaded by

anaghamelayil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

❖ Data in the real world is dirty

✔ incomplete: lacking attribute values, lacking certain


attributes of interest, or containing only aggregate
data
✔ noisy: containing errors or outliers
✔ inconsistent: containing discrepancies in codes
or names
❖ No quality data, no quality mining results!
✔ Quality decisions must be based on quality data
✔ Data warehouse needs consistent integration of
quality data
Major Tasks in Data pre-processing

• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction
• Data discretization
Data pre-processing methods
Data Cleaning
• Real world data is incomplete, noisy and
inconsistent.
• Data cleaning fill in missing values, smooth
out noise while identify outliers and correct
inconsistencies in the data.
Data cleaning methods :-
1) Missing Values
• Ignore the tuple
Can be done when class label is missing. It
is not effective, unless tuple contains several
attribute with missing values.
• Fill in the missing value manually
• Time consuming given large data set with
many missing values
• Use some global constants to fill in the missing values
Use ex: -α , “unknown “ etc.
But there is a chance for misinterpreting
“unknown”
• Use the attribute mean to fill in the missing value. For
example customer average income is 25000 then you
can use this value to replace missing value for income.
• Use the attribute mean for all samples belonging to the
same class as given by the tuple
• Use the most probable value to fill in the missing
value. This value is determined by regression,
inference based tools or decision tree induction.
2) Noisy Data

• Noise is a random error or variance in a


measured variable. Noisy Data may be due to
faulty data collection instruments, data entry
problems and technology limitation.
Handling Noisy Data
1. Binning:
Binning methods sort data value by consulting
its “neighbour- hood,” that is, the values around
it. The sorted values are distributed into a
number of “buckets,” or bins.
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Example: Data for price(in dollars):
15,4,8,21,21,24, 28,25,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
Example: Sorted data for price(in dollars):
4,8,15,21,21,24, 25,28,34
• Partition into equal frequency bins
Bin1: 4,8,15
Bin2:21,21,24
Bin3:25,28,34
a) Smoothing by bin means
In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
Bin1: 9,9,9 🡪 [(4+8+15)/3=9]
Bin2:22,22,22 🡪 [(22+22+22)/3=22]
Bin3:29,29,29 🡪 [(25+28+34)/3=29]
b) Smoothing by bin medians
Each value in a bin is replaced by the median of all the values
belonging to the same bin.
Bin1: 8,8,8
Bin2:21,21,21
Bin3:28,28,28

c)Smoothing by bin boundaries


In smoothing by bin boundaries, each bin value is replaced by the closest
boundary value.
Bin1: 4,4,15
Bin2:21,21,24
Bin3:25, 25, 34
Example 2:Partition the given data into 4 bins using Equi-
depth binning method and perform smoothing according to
the following methods

Data:11,13,13,15,15,16,19,20,20,20,21,21,22,23,24,30,
40,45,45,45,71,72,73,75

a) Smoothing by bin mean


b) Smoothing by bin median
c) Smoothing by bin boundaries
Divide the data into 4 equal-depth bins
bin 1:11,13,13,15,15,16
bin 2:9,20,20,20,21,21
bin3:22,23,24,30,40,45
bin4:,45,45,71,72,73,75
smoothing by means
bin 1-13.83,13.83,13.83,13.83,13.83,13.83
bin 2-20.16,20.16,20.16,20.16,20.16
bin 3-30.67,30.67,30.67,30.67,30.67,30.67
bin 4-63.5,63.5,63.5,63.5,63.5,63.5,63.5
smoothing by boundaries
bin 1:11,11,11,16,16,16
bin 2:19,19,19,21,21,21
bin3:22,22,22,22,45,45
bin4:,45,45,75,75,75,75
moothing by median
bin 1:14,14,14,14,14,14 🡪 [(13+15)/2=14]
bin 2:20,20,20,20,20,20 🡪 [(20+20)/2=20]
bin3: 27,27,27,27,27,27 🡪 [(24+30)/2=27]
bin4:,71.5,71.5,71.5,71.5,71.5,71.5
2. Regression
• Data can be smoothed by fitting the data to a
function , such as with regression.
• Linear Regression involves finding the best
line to fit two attribuutes so one attribuute can
be used to predict the other attribute
• Multiple linear regression where more than
two attribuutes are involved and the data are fit
to a multidimensional surface.
3. Clustering
• Outliers may be detected byy clustering .
• Similar values are organized into groups/clusters.
• Values fall outside the group may bee considered as
outliers.
Data Cleaning as a process
• Fiirst step in data cleaning is discrepancy
detection.
✔Discrepancy are caused by poorly designed
data entry forms, human error in data entry,
deliberate errors and data decay, inconsistent
data representation and inconsistent use of
codes.
✔Field Overloading may cause discrepancy
• Use metadata for discrepancy detection
Data Cleaning as a process
• Data should be examined by unique rules,
consecutive rules and null rules.
• Unique rule says value of the given attribute must
be different from the other values for that attribute.
• Consecutive rules there can be no missing values
between lowest and highest values for the attribute
and that all values must be unique.
• Null rules specifies the use of question marks, special
characters or other strings that may indicate null
conditions and how such values should be handled.
Data Cleaning as a process
• Data scrubbing tools use simple domain
knowledge to detect errors and correct the data
and use parsing and fuzzy matching
techniques.
• Data auditing tools find discrepancies by
analyzing the data to discover rules ,
relationships and detecting data violating the
conditions.
Data Cleaning as a process
• Data transformation define and apply a series
of transformation to correct discrepancy.

• Extraction/ transformation / Loading tool


(ETL) allows user to specify transformation
through Graphical User Interface

You might also like