CST322_Module2_Extra
CST322_Module2_Extra
CST322_Module2_Extra
Part 2
1
Data Pre-processing – Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill the following.
• Missing Values
• Smooth out noise
1. Missing Values
• Many tuples have no recorded value for several attributes.
• Following methods are used to tackle the missing value
• A) Ignore the tuple
2
B) Fill in the missing value manually
1. Use a global constant to fill in the missing value
2. Use a measure of central tendency for the attribute (e.g.: Mean or median
) to fill in the missing value
C) Use the attribute mean or median for all the samples belonging to the same
class as the given tuple
2. Noisy Data
• Noise is a random error or variance in a measured variable
• Following methods are used to tackle noisy data
❖ Binning
❖ Regression
❖ Outlier analysis
3
• Noisy Data - Binning
• Binning methods smooth a sorted data value by consulting its “neighbourhood,” that is, the values
around it.
• Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median
• Smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the
bin boundaries
• In the example, the data for price are first sorted and then partitioned into equal-frequency bins of size
3
6
•
7
Data Reduction – Dimensionality Reduction
• Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data
• Working with the reduced data set should be more efficient yet produce the same
(or almost the same) analytical results
8
Dimensionality reduction
9
Wavelet Transforms
10
11
12
13
14
Principal Components Analysis
15
•
16
Attribute Subset Selection
17
•
18
19
20
Sampling
21
22
23
Data Transformation
24
25
Data Transformation by Normalization
26
Min-Max Normalization
27
28
Z-Score Normalization
29
30
Decimal Scaling Normalization
31
32