0 ratings0% found this document useful (0 votes) 42 views13 pagesData Preprocessing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
1115124, 820 Pm Data Preprocessing In Depth | Towards Data Science
Understanding Data Preprocessing
@. Harshita Singh - Follow
@ Published in Towards Data Science - 6minread - May 13,2020
4 Q a © Oo
Photo by Franki Chamaki on Unsplash
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science
Data preprocessing is an important task. It is a data mining technique that
transforms raw data into a more understandable, useful and efficient
format.
| Data has a better idea. This idea will be clearer and understandable after
Openinapp 7 Gianep) signin
@ Medium = seach F write
Real world data is generally:
Incomplete: Certain attributes or values or both are missing or only
aggregate data is available.
Noisy: Data contains errors or outliers
Inconsistent: Data contains differences in codes or names etc.
Tasks in data preprocessing
1. Data Cleaning: It is also known as scrubbing. This task involves filling of
missing values, smoothing or removing noisy data and outliers along
with resolving inconsistencies.
N
. Data Integration: This task involves integrating data from multiple
sources such as databases (relational and non-relational), data cubes,
files, etc. The data sources can be homogeneous or heterogeneous. The
data obtained from the sources can be structured, unstructured or semi-
structured in format.
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science
3. Data Transformation: This involves normalisation and aggregation of
data according to the needs of the data set.
4. Data Reduction: During this step data is reduced. The number of records
or the number of attributes or dimensions can be reduced. Reduction is
performed by keeping in mind that reduced data should produce the
same results as original data.
5. Data Discretization: It is considered as a part of data reduction. The
numerical attributes are replaced with nominal ones.
Data Cleaning
The data cleaning process detects and removes the errors and
inconsistencies present in the data and improves its quality. Data quality
problems occur due to misspellings during data entry, missing values or any
other invalid data. Basically, “dirty” data is transformed into clean data.
“Dirty” data does not produce the accurate and good results, Garbage data
gives garbage out. So it becomes very important to handle this data.
Professionals spend a lot of their time on this step.
Reasons for “dirty” or “unclean” data
1. Dummy values
2. Absence of data
3. Violation of business rules
4. Data integration problems
5. Contradicting data
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ans.1115124, 820 Pm Data Preprocessing In Depth | Towards Data Science
6. Inappropriate use of address line
7. Reused primary keys
8. Non-unique identifiers
What to do to clean data?
1. Handle Missing Values
2. Handle Noise and Outliers
3. Remove Unwanted data
Handle Missing Values
Missing values cannot be looked over in a data set. They must be handled.
Also, a lot of models do not accept missing values. There are several
techniques to handle missing data, choosing the right one is of utmost
importance. The choice of technique to deal with missing data depends on
the problem domain and the goal of data mining process. The different ways
to handle missing data are:
1. Ignore the data row: This method is suggested for records where
maximum amount of data is missing, rendering the record meaningless.
This method is usually avoided where only less attribute values are
missing. If all the rows with missing values are ignored i.e. removed, it
will result in poor performance.
N
. Fill the missing values manually: This is a very time consuming method
and hence infeasible for almost all scenarios.
w
. Use a global constant to fill in for missing values: A global constant like
“NA” or 0 can be used to fill all the missing data. This method is used
when missing values are difficult to be predicted.
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 431115124, 820 Pm Data Preprocessing In Depth | Towards Data Science
4, Use attribute mean or median: Mean or median of the attribute is used to
fill the missing value.
5. Use forward fill or backward fill method: In this, either the previous
value or the next value is used to fill the missing value. A mean of the
previous and succession values may also be used.
6. Use a data-mining algorithm to predict the most probable value
Handle Noise and Outliers
Noise in data may be introduced due to fault in data collection, error during
data entering or due to data transmission errors, etc. Unknown encoding
(Example : Marital Status — Q), out of range values (Example : Age — -10),
Inconsistent Data (Example : DoB — 4th Oct 1999, Age — 50), inconsistent
formats (Example : DoJ — 13th Jan 2000, Dol — 10/10/2016), etc. are different
types of noise and outliers.
Noise can be handled using binning. In this technique, sorted data is placed
into bins or buckets. Bins can be created by equal-width (distance) or equal-
depth (frequency) partitioning. On these bins, smoothing can be applied.
Smoothing can be by bin mean, bin median or bin boundaries.
Outliers can be smoothed by using binning and then smoothing it. They can
be detected using visual analysis or boxplots. Clustering can be used identify
groups of outlier data.The detected outliers may be smoothed or removed.
Remove Unwanted Data
Unwanted data is duplicate or irrelevant data. Scraping data from different
sources and then integrating may lead to some duplicate data if not done
efficiently. This redundant data should be removed as it is of no use and will
only increase the amount of data and the time to train the model. Also, due
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 531115724, 620 Pw Data Preprocessing In Depth | Towards Data Science
to redundant records, the model may not provide accurate results as the
duplicate data interferes with analysis process, giving more importance to
the repeated values.
Data Integration
In this step, a coherent data source is prepared. This is done by collecting
and integrating data from multiple sources like databases, legacy systems,
flat files, data cubes etc.
Data is like garbage. You'd better know what you are going to do with it before you
collect it. — Mark Twain
Issues in Data Integration
1. Schema Integration: Metadata (i.e. the schema) from different sources
may not be compatible. This leads to entity identification problem.
Example : Consider two data sources R and S. Customer id in R is
represented as cust_id and in S is represented is c_id. They mean the
same thing, represent the same thing but have different names which
leads to integration problems. Detecting and resolving them is very
important to have a coherent data source.
nN
. Data value conflicts: The values or metrics or representations of the
same data maybe different in for the same real world entity in different
data sources. This leads to different representations of the same data,
different scales etc. Example : Weight in data source R is represented in
kilograms and in source S is represented in grams. To resolve this, data
representations should be made consistent and conversions should be
performed accordingly.
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 eins1115724, 620 Pw Data Preprocessing In Depth | Towards Data Science
3. Redundant data: Duplicate attributes or tuples may occur as a result of
integrating data from various sources. This may also lead to
inconsistencies. These redundancies or inconsistencies may be reduced
by careful integration of data from multiple sources. This will help in
improving the mining speed and quality. Also, co-relational analysis can
be performed to detect redundant data.
Data Reduction
If the data is very large, data reduction is performed. Sometimes, it is also
performed to find the most suitable subset of attributes from a large number
of attributes. This is known as dimensionality reduction. Data reduction also
involves reducing the number of attribute values and/or the number of
tuples. Various data reduction techniques are:
1. Data cube aggregation: In this technique the data is reduced by applying
OLAP operations like slice, dice or rollup. It uses the smallest level
necessary to solve the problem.
N
. Dimensionality reduction: The data attributes or dimensions are
reduced. Not all attributes are required for data mining. The most
suitable subset of attributes are selected by using techniques like forward
selection, backward elimination, decision tree induction or a
combination of forward selection and backward elimination.
w
. Data compression: In this technique. large volumes of data is
compressed i.e. the number of bits used to store data is reduced. This can
be done by using lossy or lossless compression. In loss compression, the
quality of data is compromised for more compression. In lossless
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0 ms1115124, 820 Pm
Data Preprocessing In Depth | Towards Data Science
compression, the quality of data is not compromised for higher
compression level.
4. Numerosity reduction : This technique reduces the volume of data by
choosing smaller forms for data representation. Numerosity reduction
can be done using histograms, clustering or sampling of data.
Numerosity reduction is necessary as processing the entire data set is
expensive and time consuming.
Data Science Machine Learning Artificial Intelligence
Data Preparation
Some rights reserved © ©
@
Written by Harshita Singh
15 Followers » Writer for Towards Data Science
Full Stack Developer | MS Al for Earth Grantee 2020
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0
Data Preprocessing
831115128, 820 Pee
Data Preprocessing In Depth | Towards Data Science
More from Harshita Singh and Towards Data Science
PROMPT Petey seo
Pale Tell bed
© arstita singh
Google Maps & ReactJS
Google has a lot of APIs available for use. One
of the most used is the Google Maps API...
Sminread - May 16,2020
Ss Q
Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710
@ sheila Teo in Towards Data Science
How Won Singapore’s GPT-4
Prompt Engineering Competition
A deep dive into the strategies | learned for
harnessing the power of Large Language.
+ + 24minread - Dec 29, 2023
ons1115128, 820 Pee Data Preprocessing In Depth | Towards Data Science
§D Tru u in Towards Data Science © Harshita singh
How to Learn Al on Your Own (a What & Why of Data Exploration
self-study guide) What is Data Exploration? Why is it needed?
If your hands touch a keyboard for work,
Artificial Intelligence is going to change your...
+ + 12min read + Jané 2minread + May 23,2020
Sick Q16 GH 87 Q
a
See all from Harshita Singh See all from Towards Data Science
Recommended from Medium
realm of data science and machine
tg, dita preprocessing isthe compass
ties us through te rough train of
"data. Before algo can work thei
“ve ned to ensure our data is clean,
fen and ey for sal nis blog, ~E&. ¢
Senet Bo
eprocessng, exploring how to handle @
‘gales, clr, nd the mumces of
oneal ata Let's dive int
Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 1031115128, 820 Pee
@ sonatika Ray
Data Preprocessing: Handling
Missing Values, Outliers, and...
Handling Missing Values
amin read - Aug 16,202
6 Qa i
Lists
Predictive Modeling w/
Python
20stories - 784 saves
Natural Language Processing
1094 stories - 560 saves
Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710
Data Preprocessing In Depth | Towards Data Science
®. Python Programming
Data Cleaning Techniques with
Python
A Practical Guide with code examples
+ - Amin read - Sep 4,2023
Hs Q ct
Practical Guides to Machine
Learning
tO stories,
907 saves
ChatGPT prompts
34 stories - 967 saves
wins1115124, 820 Pm
EXPLORATORY
DATA ANALYSIS
@ Paresh Pati
A beginner’s Guide to exploratory
data analysis (EDA)
Table of contents:
Sminread - Aug 11,2023
Hw Q Ww
Data Preprocessing In Depth | Towards Data Science
® Rebecca in Python in Plain English
Best Practices for Exploratory Data
Analysis in Data Science
Introduction
@min read - Aug 1,2023
tL
@® iunammad Abuzar
AComprehensive Guide to Data
Preprocessing
Introduction
Sminread + Oct 12,2023,
Ser Q a
hitpssRowardsdatascience.comidata-preprocessing-e2b0bed4e7f0
@® benizcunay
Feature Encoding
Although some machine learning models are
able to deal with categorical(non numerical).
19min read » Aug 18,2023
S29 Q
xh
rans1115128, 820 Put Data Preprocessing In Depth | Towards Data Science
‘See more recommendations
Intps:towardsdatascience, comidata-preprocessing-e2b0bed4c710 1383