Data Preprocessing in Data Mining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Data Preprocessing in Data Mining

Preprocessing in Data Mining:


Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways:

1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we use data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the
data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute. The attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for
example: Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and
PCA (Principal Component Analysis).

The Curse of Dimensionality


This refers to the phenomena that generally data analysis tasks become
significantly harder as the dimensionality of the data increases. As the
dimensionality increases, the number planes occupied by the data increases
thus adding more and more sparsity to the data which is difficult to model and
visualize.
What dimension reduction essentially does is that it maps the dataset to a
lower-dimensional space, which may very well be to a number of planes which
can now be visualized, say 2D. The basic objective of techniques which are
used for this purpose is to reduce the dimensionality of a dataset by creating
new features which are a combination of the old features. In other words, the
higher-dimensional feature-space is mapped to a lower-dimensional feature-
space. Principal Component Analysis and Singular Value Decomposition are
two widely accepted techniques.

Feature Encoding

As mentioned before, the whole purpose of data preprocessing is


to encode the data in order to bring it to such a state that the machine now
understands it.
Feature encoding is basically performing transformations on the data such that
it can be easily accepted as input for machine learning algorithms while still
retaining its original meaning.

One hot encoding explained in an image


Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components.

PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.

You might also like