Data Preprocessing in Data Mining

Data preprocessing transforms raw data into a useful format for data mining. It involves cleaning data by handling missing or noisy values, transforming data through normalization, attribute selection, and discretization, and reducing data through aggregation, attribute selection, dimensionality reduction, and numerosity reduction techniques like PCA. The goal is to prepare the data for machine learning algorithms while retaining meaning.

Uploaded by

Dhananjai Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

233 views

Data Preprocessing in Data Mining

Uploaded by

Dhananjai Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Preprocessing in Data Mining

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part,
data cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by attribute mean or the most probable
value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It
can be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways:

1. Binning Method:
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is
handled separately. One can replace all data in a segment by its
mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.
The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:

1. Normalization:
It is done in order to scale the data values in a specified range (-1.0
to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by
interval levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy.
For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we use data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the
data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be
discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute. The attribute having p-
value greater than significance level can be discarded.
3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for
example: Regression Models.
4. Dimensionality Reduction:
This reduces the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and
PCA (Principal Component Analysis).

The Curse of Dimensionality

This refers to the phenomena that generally data analysis tasks become
significantly harder as the dimensionality of the data increases. As the
dimensionality increases, the number planes occupied by the data increases
thus adding more and more sparsity to the data which is difficult to model and
visualize.
What dimension reduction essentially does is that it maps the dataset to a
lower-dimensional space, which may very well be to a number of planes which
can now be visualized, say 2D. The basic objective of techniques which are
used for this purpose is to reduce the dimensionality of a dataset by creating
new features which are a combination of the old features. In other words, the
higher-dimensional feature-space is mapped to a lower-dimensional feature-
space. Principal Component Analysis and Singular Value Decomposition are
two widely accepted techniques.

Feature Encoding

As mentioned before, the whole purpose of data preprocessing is

to encode the data in order to bring it to such a state that the machine now
understands it.
Feature encoding is basically performing transformations on the data such that
it can be easily accepted as input for machine learning algorithms while still
retaining its original meaning.

One hot encoding explained in an image

Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning. It is a statistical process that
converts the observations of correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation. These new transformed features
are called the Principal Components.

PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.

PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie
recommendation system, optimizing the power allocation in various
communication channels. It is a feature extraction technique, so it contains the
important variables and drops the least important variable.

4104.3 Extended Reality - 2022 Exam Paper
100% (1)
4104.3 Extended Reality - 2022 Exam Paper
2 pages
Hadoop Hive - One
No ratings yet
Hadoop Hive - One
10 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Weka - Knowledgeflow Normalize
No ratings yet
Weka - Knowledgeflow Normalize
15 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
Chapter 6
No ratings yet
Chapter 6
20 pages
QP Solutions
100% (1)
QP Solutions
18 pages
Unit 1: Daa Two Mark Question and Answer 1
No ratings yet
Unit 1: Daa Two Mark Question and Answer 1
22 pages
Cos3711 Additional Notes
No ratings yet
Cos3711 Additional Notes
16 pages
Rayleigh Model
No ratings yet
Rayleigh Model
9 pages
Minor Project Report Format MCA
No ratings yet
Minor Project Report Format MCA
11 pages
Data Mining
No ratings yet
Data Mining
8 pages
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
No ratings yet
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
39 pages
Data Science Assignment
No ratings yet
Data Science Assignment
18 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Ad3381 - Data Base Design and Management Manual
No ratings yet
Ad3381 - Data Base Design and Management Manual
56 pages
Introduction to the Design and Analysis of Algorithms 3rd Edition Levitin Solutions Manual pdf download
No ratings yet
Introduction to the Design and Analysis of Algorithms 3rd Edition Levitin Solutions Manual pdf download
53 pages
Question Bank: Data Warehousing and Data Mining Semester: VII
No ratings yet
Question Bank: Data Warehousing and Data Mining Semester: VII
4 pages
Avanthi'S Research &technological Academy: Data Mining Lab
No ratings yet
Avanthi'S Research &technological Academy: Data Mining Lab
50 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Laboratory Manual: Government Polytechnic Porbandar
No ratings yet
Laboratory Manual: Government Polytechnic Porbandar
88 pages
CCS341_Data Warehousing_Unit 4 Notes
0% (1)
CCS341_Data Warehousing_Unit 4 Notes
19 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
Basic SQL Quiz - 2 Online Test
No ratings yet
Basic SQL Quiz - 2 Online Test
5 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Clique Problem
No ratings yet
Clique Problem
10 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Systems Planning and The Initial Investigation
No ratings yet
Systems Planning and The Initial Investigation
17 pages
BI UNIT-II Chp01(Mathematical models for decision making)
No ratings yet
BI UNIT-II Chp01(Mathematical models for decision making)
9 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
Photoshop Lab
No ratings yet
Photoshop Lab
110 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
DWDM LAB Final Manualtest
No ratings yet
DWDM LAB Final Manualtest
134 pages
Project Report On DBMS Project
No ratings yet
Project Report On DBMS Project
22 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
Nidhish Raj Mourya - Depth Buffer Method (Chapter 4)
100% (1)
Nidhish Raj Mourya - Depth Buffer Method (Chapter 4)
15 pages
Java Collections PDF
No ratings yet
Java Collections PDF
566 pages
ANSWER KEY (3) - 55 Que
No ratings yet
ANSWER KEY (3) - 55 Que
4 pages
SQL Tutorials
100% (1)
SQL Tutorials
22 pages
Chapter-3 DATA MINING PDF
No ratings yet
Chapter-3 DATA MINING PDF
13 pages
R Programming
No ratings yet
R Programming
11 pages
Cs 2032 Data Warehousing and Data Mining Question Bank by Gopi
No ratings yet
Cs 2032 Data Warehousing and Data Mining Question Bank by Gopi
6 pages
Data Mining
No ratings yet
Data Mining
2 pages
Knowledge Representation Issue
No ratings yet
Knowledge Representation Issue
18 pages
Chapter 9 Summary
No ratings yet
Chapter 9 Summary
4 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Mining
No ratings yet
Data Mining
5 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
ACF and PACF in Excel
No ratings yet
ACF and PACF in Excel
11 pages
Lecture 02
No ratings yet
Lecture 02
47 pages
Framework For E-Appointment Systems Design: R. Azouzi, P. Forget, S. D'Amours
No ratings yet
Framework For E-Appointment Systems Design: R. Azouzi, P. Forget, S. D'Amours
8 pages
Lecture 1 - Intro To Is and ERP Systems
No ratings yet
Lecture 1 - Intro To Is and ERP Systems
23 pages
Medical Billing Final Project
No ratings yet
Medical Billing Final Project
293 pages
InstallationDrawing 95013462 ACT MBT - 2500 CPM FCM-8000 R2
No ratings yet
InstallationDrawing 95013462 ACT MBT - 2500 CPM FCM-8000 R2
1 page
StarterAssets Documentation
No ratings yet
StarterAssets Documentation
13 pages
Introductory Electronic Devices and Circuits 6th Ed - Paynter
No ratings yet
Introductory Electronic Devices and Circuits 6th Ed - Paynter
1,010 pages
A063B598_I7_202410
No ratings yet
A063B598_I7_202410
274 pages
Sungrow 36
No ratings yet
Sungrow 36
2 pages
Aws-Command Line Interface
No ratings yet
Aws-Command Line Interface
152 pages
Can11 Gem5
No ratings yet
Can11 Gem5
8 pages
Pcan-Uds API Userman Eng
No ratings yet
Pcan-Uds API Userman Eng
347 pages
Pronósticos Medición Del Error: Ing. Msc. Luis Eduardo Leguizamon Castellanos
No ratings yet
Pronósticos Medición Del Error: Ing. Msc. Luis Eduardo Leguizamon Castellanos
15 pages
MG585 Broderson: Operator's Manual
No ratings yet
MG585 Broderson: Operator's Manual
38 pages
Logistic - Regression - Ipynb - Colaboratory
No ratings yet
Logistic - Regression - Ipynb - Colaboratory
3 pages
M530 Specs
No ratings yet
M530 Specs
3 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Gistfile 1
No ratings yet
Gistfile 1
49 pages
MIS Chapter 3
No ratings yet
MIS Chapter 3
27 pages
f
No ratings yet
f
2 pages
Operation Instruction PTC 19 3 V2 7 1
No ratings yet
Operation Instruction PTC 19 3 V2 7 1
28 pages
Recommendations On Using Indilinx - Barefoot Utility
No ratings yet
Recommendations On Using Indilinx - Barefoot Utility
3 pages
Dfu Simman 3g Rev M PC
No ratings yet
Dfu Simman 3g Rev M PC
33 pages
2024 IJSS SurveyFuzzyControlMechatronics
No ratings yet
2024 IJSS SurveyFuzzyControlMechatronics
105 pages
Ict CSS SHS12 Q2 Las5 Final PDF
No ratings yet
Ict CSS SHS12 Q2 Las5 Final PDF
20 pages
Principal Component Analysis (PCA) in Machine Learning
No ratings yet
Principal Component Analysis (PCA) in Machine Learning
20 pages
Spring Boot SQL Server Example CRUD
No ratings yet
Spring Boot SQL Server Example CRUD
14 pages
Anova Statistical Problem Set 5
No ratings yet
Anova Statistical Problem Set 5
12 pages
Personal info breached in Combo List 107M just now. It was my MySpace email and username as the “password” _ r_privacy
No ratings yet
Personal info breached in Combo List 107M just now. It was my MySpace email and username as the “password” _ r_privacy
2 pages