CST322_Module2_Extra

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

CST322 MODULE 2

Part 2

1
Data Pre-processing – Data Cleaning
• Real-world data tend to be incomplete, noisy, and inconsistent.
• Data cleaning (or data cleansing) routines attempt to fill the following.
• Missing Values
• Smooth out noise

1. Missing Values
• Many tuples have no recorded value for several attributes.
• Following methods are used to tackle the missing value
• A) Ignore the tuple
2
B) Fill in the missing value manually
1. Use a global constant to fill in the missing value
2. Use a measure of central tendency for the attribute (e.g.: Mean or median
) to fill in the missing value
C) Use the attribute mean or median for all the samples belonging to the same
class as the given tuple

D) Use the most probable value to fill in the missing value

2. Noisy Data
• Noise is a random error or variance in a measured variable
• Following methods are used to tackle noisy data
❖ Binning

❖ Regression

❖ Outlier analysis
3
• Noisy Data - Binning
• Binning methods smooth a sorted data value by consulting its “neighbourhood,” that is, the values
around it.

• The sorted values are distributed into a number of “buckets,” or bins

• Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median

• Smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the
bin boundaries

• In the example, the data for price are first sorted and then partitioned into equal-frequency bins of size
3

• Sorted data for price (in dollars):

• 4, 8, 15, 21, 21, 24, 25, 28, 34


4
5
• Noisy Data – Regression
• Data smoothing can also be done by regression, a technique that conforms data
values to a function
• Linear regression involves finding the “best” line to fit two attributes (or
variables) so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than
two attributes are involved and the data are fit to a multidimensional surface.

• Noisy Data – Outlier Analysis


• Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters”
• Outliers may be detected as values that fall outside of the cluster sets

6

7
Data Reduction – Dimensionality Reduction
• Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data
• Working with the reduced data set should be more efficient yet produce the same
(or almost the same) analytical results

• Data reduction strategies include


1. Dimensionality reduction
2. Numerosity reduction
3. Data compression

8
Dimensionality reduction

• It is the process of reducing the number of random variables or attributes under


consideration.
1. Wavelet Transforms

2. Principal Components Analysis

3. Attribute Subset Selection

9
Wavelet Transforms

10
11
12
13
14
Principal Components Analysis

15

16
Attribute Subset Selection

17

18
19
20
Sampling

21
22
23
Data Transformation

24
25
Data Transformation by Normalization

26
Min-Max Normalization

27
28
Z-Score Normalization

29
30
Decimal Scaling Normalization

31
32

You might also like