Overview of Data Preprocessing

The document discusses data preprocessing which is an important initial step in data mining and machine learning. It involves cleaning, transforming, reducing and integrating raw data to prepare it for analysis. The document outlines the various techniques used in each step of data preprocessing like data cleaning, integration, transformation and reduction.

Uploaded by

mafetza

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Overview of Data Preprocessing

Uploaded by

mafetza

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

e-ISSN: 2582-5208

International Research Journal of Modernization in Engineering Technology and Science

Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com

OVERVIEW OF DATA PREPROCESSING

Gurram Bhaskar *1, Motati Dinesh Reddy*2
*1,2Student, Department of Computer Science and Engineering, SRM Institute of Science and
Technology, Chennai, Tamilnadu, India.
ABSTRACT
Data Preprocessing is a vital initial step for anyone working with data. Data Preprocessing is part of the data
mining process which does the work of data preparation and data transformation of dataset making the
insights more efficient. Data Preprocessing makes the data clean and feasible and provides better data sets for
any business trying to get valuable insights from data. Cleaning, integration, transformation, and reduction are
some of the methods used in preprocessing. This paper explains data preprocessing and its importance and the
various steps involved in data preprocessing.
Keywords: Data Science, Data Preprocessing, Data Cleaning, Data Reduction, Data Mining.
I. INTRODUCTION
Conversion of raw data into a clear format is known as data preprocessing. In other words, it is a preceding step
that takes all the obtainable information to organize it, sort it, and blend it. Data science techniques attempt to
retrieve information from chunks of data. These databases can get incredibly massive and typically contain data
of all sorts, from comments left on the communal route to numbers coming from the analytic console. That vast
amount of data is heterogeneous naturally, which suggests that they do not share an equivalent structure – that
is if they need a structure to start with. Data preprocessing is important in any data mining process because it
directly affects the project's success rate. Since data in the real world is unclean, this decreases the
sophistication of the data under study. If there are missing attributes, attribute values, noise or outliers, and
redundant or incorrect data, the data is said to be unclean. If all of these are present, the quality of the results
will decrease
II. STEPS IN DATA PREPROCESSING
Data Preprocessing is divided into 4 steps

Fig.1: data preprocessing steps

Data Cleaning
Data cleaning is the major step in machine learning. It plays a vital role in making a perfect model.
Data cleaning is one of the most neglected steps by everyone. A proper data cleaning always gives us the best-
required output. Most of the data scientist spends their crux time in this data cleaning. If we clean our data set,

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science

[656]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
we can get the desired result just by using a simple algorithm. The next steps become easy for us. Different data
sets use a different approach for data cleaning.
Steps involved in data cleaning

Fig.2: data cleaning types

Data Integration
Combining the different data collected from different sources to form a single unit is known as Data Integration.
An injection is the first initial step of data integration. To get the best-desired business intelligence data
integration enables analytical tools. Steps that come under data integration are cleansing, ETL mapping, and
transformation
The common ways to integrate data are mentioned below.
1. Data consolidation: The process of collecting different data and combing and storing it in one p lace.
It is an especially important step in data integration and data management.
2. Data propagation: the data is transcribing from one location to another using different methods is
called data propagation. It can be synchronous or asynchronous.
3. Data visualization: Data virtualization is an address to data management that allows an application to
retrieve and confuse data without requiring technical, such as how it is formatted at source, or where it
is physically located, and can only provide a single customer view of the complete data present with us.

Fig.3: Data integration

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science

[657]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
Data Transformation
Data is regularizing and generalized. Normalization may be a process that ensures that no data is redundant, it
is all stored in a single place, and every one of the colonies is logical.
The data transformation involves these steps:
1. Smoothing: removing noise from the data set is known as smoothing using some different algorithm.
2. The process of smoothing also helps in identifying the most important lines just by underlining and
find the patterns. Collected data can be manipulated to reduce any variance or noise form.
3. Aggregation: The stored data is presented in summary format is known as aggregation. The data can be
collected from multiple sources to integrate data sources into data analysis. Aggregation is a crucial
step since the accuracy of the data used depends upon the quantity and quality of the data used.
Therefore, gathering accurate data is especially important
4. Discretization: Conversion of continuous data into sets of small values is known as discretization.
Discretization is one of the most required data mining activities in the real world. Discretization
significantly improves its efficiency by replacing its constant attributes withs its discrete values.
5. Attribute construction: The main agenda of attribute construction is to create a new attribute from
existing or original ones. Different features are represented by attributes.
For example, t-shirts are the attribute of men.
6. Generalization: Conversion of low-level attributes to high-level attributes using concept hierarchy.
For example, Age initially in Numerical form (18, 25) is converted into categorical value (young, old).
7. Normalization: Conversion of data variables into a given range. Normalization is necessary while
addressing the attributes on a different scale

Fig.4: Data transformation

Data Reduction
Data reduction is nothing but producing a reduced data set by using data cleaning. Which looks portable yet
produces the same analytical result.
Methods used to reduce the volume of data are given below:
1. Missing value ratio: Attributes that have more missing values than a threshold are removed .
2. Low variance filter: calculate each column variance and remove the columns with variance values
less than the given threshold. numerical columns can only be calculated by variance.
3. High correlation filter: In this method, excessive correlation coefficients than the threshold is
removed, since alike treads mean alike information is dragged.
4. Principal component analysis: The method of reducing the total number of variables in our data by
unsheathing important from a large pool. It reduces the measurements of our data with the aim of
reserve as much information as possible.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science

[658]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
III. IMPORTANCE OF DATA PREPROCESSING
In data mining, the success rate of the project is causally related to the data preprocessing. They reduce the
entanglement of the data under different analyses because data is real-world in turbid. To get a more accurate
outcome we need to repair all the mistakes, redundancies, missing values, and inconsistencies.
Thus, before directly using the data for our uses, we need to organize and clean the data. There are several
ways to try to do so, the process depends upon which you are handling. Initially, you would use all the above
techniques to get a better data set.
IV. CONCLUSION
In data processing, there is a vast number of preprocessing techniques that help to clear unwanted data and
give organized data. We naturally neglect all the above following techniques which end up being a problem. So,
we need to use all the techniques to get the best result.

V. REFERENCES
[1] “What do you mean by Data preprocessing and why it is needed? ", electronicsmedia.info, 7 March 2021
[2] “How Does Data Integration Add Value to Your Business - Adeptia ", Adeptia, 8 March 2021
[3] “Data Reduction In Data Mining - Various Techniques”, datamining365,8 March 2021
[4] “data-preprocessing-what-is-it-and-why-is-important”, CEO world, 7 March 2021
[5] “Data Preprocessing", Techopedia, 9 March 2021

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science

[659]

The ONE Invisible Code - An Uncommon Formula To Breakthrough Mediocrity and Rise To The Next Level (Sharat Sharma)
No ratings yet
The ONE Invisible Code - An Uncommon Formula To Breakthrough Mediocrity and Rise To The Next Level (Sharat Sharma)
139 pages
2004 Corolla Electrical Diagram - Part Locations
67% (3)
2004 Corolla Electrical Diagram - Part Locations
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
Data and DW Lab Manual Updated
No ratings yet
Data and DW Lab Manual Updated
44 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Unit 3
No ratings yet
Unit 3
18 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Down 2
No ratings yet
Down 2
61 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Chapter 4
No ratings yet
Chapter 4
20 pages
Unit 1
No ratings yet
Unit 1
8 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
UNIT-2 PREPROCESSING
No ratings yet
UNIT-2 PREPROCESSING
18 pages
Data Mining
No ratings yet
Data Mining
5 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Correlation
No ratings yet
Correlation
14 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
L3
No ratings yet
L3
34 pages
DWM
No ratings yet
DWM
14 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Dmml Notes
No ratings yet
Dmml Notes
89 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
5 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Essay Food Production
No ratings yet
Essay Food Production
1 page
Mosfet Characteristics
No ratings yet
Mosfet Characteristics
12 pages
MH9054
No ratings yet
MH9054
5 pages
City As A Political Idea
No ratings yet
City As A Political Idea
3 pages
Forecasting Odds Movements in Horse Racing
No ratings yet
Forecasting Odds Movements in Horse Racing
63 pages
Baptist Lui Ming Choi Secondary School Programme Evaluation Report For DLG-funded Other Programme (Gifted Education) 2018/19
No ratings yet
Baptist Lui Ming Choi Secondary School Programme Evaluation Report For DLG-funded Other Programme (Gifted Education) 2018/19
8 pages
DIN EN 60 (2023)
No ratings yet
DIN EN 60 (2023)
10 pages
English Planning (Secondary)
100% (1)
English Planning (Secondary)
11 pages
dO_2024Nov11223757_statement_unlocked.pdf_20241111_225830_0000
No ratings yet
dO_2024Nov11223757_statement_unlocked.pdf_20241111_225830_0000
30 pages
SE210
0% (1)
SE210
2 pages
Interview Etiquette
No ratings yet
Interview Etiquette
16 pages
Body System Activity Gingerbread Person Instructions
No ratings yet
Body System Activity Gingerbread Person Instructions
4 pages
Strategies For Revising Judgment
No ratings yet
Strategies For Revising Judgment
62 pages
Hap Free Cooling
No ratings yet
Hap Free Cooling
2 pages
9 Leadership Lessons From The Indian Cricket Team
No ratings yet
9 Leadership Lessons From The Indian Cricket Team
11 pages
Elecref Copperclad Catalog
No ratings yet
Elecref Copperclad Catalog
14 pages
Handbook of Ehealth Evaluation - An Evidence-Based Approach
No ratings yet
Handbook of Ehealth Evaluation - An Evidence-Based Approach
97 pages
Spec กฟภ.กจพ.1-ป (มต) -001-2565 f
No ratings yet
Spec กฟภ.กจพ.1-ป (มต) -001-2565 f
96 pages
Huawei Certification Examination Appointment Guide
No ratings yet
Huawei Certification Examination Appointment Guide
14 pages
06-Adversarial_Search
No ratings yet
06-Adversarial_Search
72 pages
Manuali Officina Yj/Xj 1993
100% (1)
Manuali Officina Yj/Xj 1993
10 pages
Lecture 7 Classicals I PDF
No ratings yet
Lecture 7 Classicals I PDF
12 pages
Python QB-1
No ratings yet
Python QB-1
2 pages
the meditations
No ratings yet
the meditations
1 page
Marketing Management DEC 2022
0% (1)
Marketing Management DEC 2022
4 pages
Common Bread Develop and Update Industry Knowledge
No ratings yet
Common Bread Develop and Update Industry Knowledge
18 pages
Parametric Optimization of Cutting Force and Temperature During Hard Turning of Inconel 625-A Review
No ratings yet
Parametric Optimization of Cutting Force and Temperature During Hard Turning of Inconel 625-A Review
4 pages