0% found this document useful (0 votes)
23 views

Overview of Data Preprocessing

The document discusses data preprocessing which is an important initial step in data mining and machine learning. It involves cleaning, transforming, reducing and integrating raw data to prepare it for analysis. The document outlines the various techniques used in each step of data preprocessing like data cleaning, integration, transformation and reduction.

Uploaded by

mafetza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Overview of Data Preprocessing

The document discusses data preprocessing which is an important initial step in data mining and machine learning. It involves cleaning, transforming, reducing and integrating raw data to prepare it for analysis. The document outlines the various techniques used in each step of data preprocessing like data cleaning, integration, transformation and reduction.

Uploaded by

mafetza
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

e-ISSN: 2582-5208

International Research Journal of Modernization in Engineering Technology and Science


Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com

OVERVIEW OF DATA PREPROCESSING


Gurram Bhaskar *1, Motati Dinesh Reddy*2
*1,2Student, Department of Computer Science and Engineering, SRM Institute of Science and
Technology, Chennai, Tamilnadu, India.
ABSTRACT
Data Preprocessing is a vital initial step for anyone working with data. Data Preprocessing is part of the data
mining process which does the work of data preparation and data transformation of dataset making the
insights more efficient. Data Preprocessing makes the data clean and feasible and provides better data sets for
any business trying to get valuable insights from data. Cleaning, integration, transformation, and reduction are
some of the methods used in preprocessing. This paper explains data preprocessing and its importance and the
various steps involved in data preprocessing.
Keywords: Data Science, Data Preprocessing, Data Cleaning, Data Reduction, Data Mining.
I. INTRODUCTION
Conversion of raw data into a clear format is known as data preprocessing. In other words, it is a preceding step
that takes all the obtainable information to organize it, sort it, and blend it. Data science techniques attempt to
retrieve information from chunks of data. These databases can get incredibly massive and typically contain data
of all sorts, from comments left on the communal route to numbers coming from the analytic console. That vast
amount of data is heterogeneous naturally, which suggests that they do not share an equivalent structure – that
is if they need a structure to start with. Data preprocessing is important in any data mining process because it
directly affects the project's success rate. Since data in the real world is unclean, this decreases the
sophistication of the data under study. If there are missing attributes, attribute values, noise or outliers, and
redundant or incorrect data, the data is said to be unclean. If all of these are present, the quality of the results
will decrease
II. STEPS IN DATA PREPROCESSING
Data Preprocessing is divided into 4 steps

Fig.1: data preprocessing steps

Data Cleaning
Data cleaning is the major step in machine learning. It plays a vital role in making a perfect model.
Data cleaning is one of the most neglected steps by everyone. A proper data cleaning always gives us the best-
required output. Most of the data scientist spends their crux time in this data cleaning. If we clean our data set,

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science


[656]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
we can get the desired result just by using a simple algorithm. The next steps become easy for us. Different data
sets use a different approach for data cleaning.
Steps involved in data cleaning

Fig.2: data cleaning types


Data Integration
Combining the different data collected from different sources to form a single unit is known as Data Integration.
An injection is the first initial step of data integration. To get the best-desired business intelligence data
integration enables analytical tools. Steps that come under data integration are cleansing, ETL mapping, and
transformation
The common ways to integrate data are mentioned below.
1. Data consolidation: The process of collecting different data and combing and storing it in one p lace.
It is an especially important step in data integration and data management.
2. Data propagation: the data is transcribing from one location to another using different methods is
called data propagation. It can be synchronous or asynchronous.
3. Data visualization: Data virtualization is an address to data management that allows an application to
retrieve and confuse data without requiring technical, such as how it is formatted at source, or where it
is physically located, and can only provide a single customer view of the complete data present with us.

Fig.3: Data integration

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science


[657]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
Data Transformation
Data is regularizing and generalized. Normalization may be a process that ensures that no data is redundant, it
is all stored in a single place, and every one of the colonies is logical.
The data transformation involves these steps:
1. Smoothing: removing noise from the data set is known as smoothing using some different algorithm.
2. The process of smoothing also helps in identifying the most important lines just by underlining and
find the patterns. Collected data can be manipulated to reduce any variance or noise form.
3. Aggregation: The stored data is presented in summary format is known as aggregation. The data can be
collected from multiple sources to integrate data sources into data analysis. Aggregation is a crucial
step since the accuracy of the data used depends upon the quantity and quality of the data used.
Therefore, gathering accurate data is especially important
4. Discretization: Conversion of continuous data into sets of small values is known as discretization.
Discretization is one of the most required data mining activities in the real world. Discretization
significantly improves its efficiency by replacing its constant attributes withs its discrete values.
5. Attribute construction: The main agenda of attribute construction is to create a new attribute from
existing or original ones. Different features are represented by attributes.
For example, t-shirts are the attribute of men.
6. Generalization: Conversion of low-level attributes to high-level attributes using concept hierarchy.
For example, Age initially in Numerical form (18, 25) is converted into categorical value (young, old).
7. Normalization: Conversion of data variables into a given range. Normalization is necessary while
addressing the attributes on a different scale

Fig.4: Data transformation


Data Reduction
Data reduction is nothing but producing a reduced data set by using data cleaning. Which looks portable yet
produces the same analytical result.
Methods used to reduce the volume of data are given below:
1. Missing value ratio: Attributes that have more missing values than a threshold are removed .
2. Low variance filter: calculate each column variance and remove the columns with variance values
less than the given threshold. numerical columns can only be calculated by variance.
3. High correlation filter: In this method, excessive correlation coefficients than the threshold is
removed, since alike treads mean alike information is dragged.
4. Principal component analysis: The method of reducing the total number of variables in our data by
unsheathing important from a large pool. It reduces the measurements of our data with the aim of
reserve as much information as possible.

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science


[658]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
Volume :03/ Issue:03/ March 2021 Impact Factor- 5.354 www.irjmets.com
III. IMPORTANCE OF DATA PREPROCESSING
In data mining, the success rate of the project is causally related to the data preprocessing. They reduce the
entanglement of the data under different analyses because data is real-world in turbid. To get a more accurate
outcome we need to repair all the mistakes, redundancies, missing values, and inconsistencies.
Thus, before directly using the data for our uses, we need to organize and clean the data. There are several
ways to try to do so, the process depends upon which you are handling. Initially, you would use all the above
techniques to get a better data set.
IV. CONCLUSION
In data processing, there is a vast number of preprocessing techniques that help to clear unwanted data and
give organized data. We naturally neglect all the above following techniques which end up being a problem. So,
we need to use all the techniques to get the best result.

V. REFERENCES
[1] “What do you mean by Data preprocessing and why it is needed? ", electronicsmedia.info, 7 March 2021
[2] “How Does Data Integration Add Value to Your Business - Adeptia ", Adeptia, 8 March 2021
[3] “Data Reduction In Data Mining - Various Techniques”, datamining365,8 March 2021
[4] “data-preprocessing-what-is-it-and-why-is-important”, CEO world, 7 March 2021
[5] “Data Preprocessing", Techopedia, 9 March 2021

www.irjmets.com @International Research Journal of Modernization in Engineering, Technology, and Science


[659]

You might also like