0% found this document useful (0 votes)

5 views5 pages

Data Preprocessing

Data preprocessing is essential for transforming raw data into a usable format for data mining, addressing issues like data quality, volume, and heterogeneity. It involves five key tasks: cleaning, integration, reduction, transformation, and discretization, each aimed at ensuring data accuracy and efficiency. Proper preprocessing is crucial to avoid misleading conclusions and to enhance the performance of mining algorithms.

Uploaded by

bhaveshtupe06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Data Preprocessing

Uploaded by

bhaveshtupe06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Data Preprocessing

Data is often called the “new oil,” but just like crude oil, raw data is rarely useful in its original
state. It is often noisy, incomplete, inconsistent, or simply too large to be processed directly.
To get accurate and meaningful results from data mining, we must first prepare the data
properly. This preparation process is called data preprocessing.

Data preprocessing plays a critical role in ensuring that the knowledge discovered from data
is reliable and valid. If we try to apply data mining techniques on raw data without cleaning
and transforming it, we may end up with wrong conclusions. For example, imagine a retail
database where some customers have missing addresses, some transactions have been
recorded twice, and some prices are wrongly entered. Mining such data would give
misleading results about customer behavior or sales patterns. Hence, preprocessing is the
foundation for successful data mining.

1. Importance of Data Preprocessing

Before we go into the steps, it is important to understand why preprocessing is required:

● Data quality issues – Real-world data may have errors, duplicate entries, missing
values, and inconsistencies.

● Large volume – Data warehouses and big databases often contain terabytes or
petabytes of records, which are too big for direct mining.

● Heterogeneity – Data may come from different sources such as spreadsheets,

databases, sensors, or the web, each with different formats.

● Improving accuracy – Well-prepared data improves the performance of mining

algorithms, leading to better classification, clustering, or prediction results.
●
● Efficiency – Preprocessing reduces data size and complexity, so mining becomes
faster and more scalable.

Thus, preprocessing acts like the cleaning and organizing stage before analysis, making
sure the data is consistent, accurate, and ready for knowledge discovery.
2. Data Cleaning

The first major task in preprocessing is data cleaning, which deals with missing values, noisy
values, and inconsistencies.

(a) Missing Values

Data often has blank or unknown fields. For example, a customer’s age might not be
recorded, or income data may be absent. Methods to handle missing values include:

● Ignoring the record (only if the dataset is large enough).

● Filling with a global constant (like “Unknown”).

● Replacing with the mean, median, or mode of the attribute.

● Using predictive models (regression, decision trees, or k-nearest neighbor) to

estimate the missing value.

(b) Noisy Data

Noise refers to random errors or variations in data. Example: typing mistakes, wrong
measurements, or outliers. Methods to reduce noise include:

● Binning: Sorting data into bins and replacing values by bin mean, median, or
boundary.

● Regression: Fitting data into a regression function and smoothing values.

● Clustering: Grouping similar records and identifying outliers as noise.

Data from different sources may have different formats or spellings. For example,
“Male/Female” vs. “M/F,” or different currency notations. Cleaning detects and corrects such
conflicts to maintain consistency.

3. Data Integration

The second task is data integration, which combines data from multiple sources like
databases, flat files, or data cubes.

Key problems in integration include:

● Entity identification – Matching records that refer to the same entity but have different
names (e.g., “cust_ID” vs. “customer_no”).

● Schema integration – Combining attributes with different names but same meaning.

● Redundancy removal – If the same information is stored in two sources, it must be

detected and merged. Correlation analysis is often used to detect redundancy.

● Value conflict resolution – If two sources record different values for the same
attribute, a strategy is needed to resolve conflicts.

Integration provides a unified view of data, which is essential for data mining in large
organizations where data comes from many departments and systems.

4. Data Reduction

Since real-world data is often huge, data reduction techniques are applied to make the
dataset smaller but still representative of the original. The goal is to reduce the volume while
preserving essential patterns.

Methods of Data Reduction:

1. Data Cube Aggregation – Summarizing data at higher abstraction levels. For example,
instead of storing daily sales, we can keep monthly or yearly totals.

2. Dimensionality Reduction – Reducing the number of attributes using techniques such as

Principal Component Analysis (PCA) or Singular Value Decomposition (SVD).

3. Attribute Subset Selection – Choosing only the most relevant attributes while discarding
irrelevant or redundant ones. Techniques like decision tree induction and stepwise
regression help in this.

4. Numerosity Reduction – Replacing large datasets with models. Example: using regression
equations or clusters to represent the data.

5. Sampling – Selecting a representative sample of the data for analysis instead of the entire
dataset.
6. Data Compression – Using encoding schemes like wavelet transforms to store data in a
compact format.

Reduction improves efficiency, saves storage, and speeds up mining algorithms without
losing much accuracy.

5. Data Transformation

Sometimes, even after cleaning, the data may not be in the right format for mining. Data
transformation modifies data into suitable forms.

● Normalization – Scaling data into a specific range.

● Min-Max Normalization: scales values between 0 and 1.

● Z-score Normalization: rescales using mean and standard deviation.

● Decimal Scaling: moves decimal point based on maximum absolute value.

● Smoothing – Removing noise using techniques like regression or binning.

● Aggregation – Summarizing data (e.g., converting transaction-level data into

customer-level totals).

● Generalization – Replacing low-level data with higher-level concepts. For example,

“23 years” can be generalized to “young.”

Transformation ensures that attributes are comparable and suitable for mining tasks such as
clustering or classification.

6. Data Discretization

Some data mining algorithms work better on categorical data instead of continuous values.
Data discretization converts continuous attributes into discrete intervals.

Methods include:
● Binning – Dividing values into equal-width or equal-frequency intervals.

● Histogram analysis – Using data distribution to define bins.

● Clustering – Grouping values and treating each cluster as a category.

● Decision tree analysis – Splitting data into intervals based on class labels.

● Concept hierarchy generation – Organizing data into multiple levels of abstraction

(e.g., city → state → country).

Discretization helps reduce data size, improve interpretability, and enhance mining accuracy.

7. Conclusion

To summarize, data preprocessing is the foundation of data mining. The five major
tasks—cleaning, integration, reduction, transformation, and discretization—together ensure
that raw data is converted into a high-quality dataset ready for analysis.

● Data cleaning removes noise and handles missing values.

● Data integration combines multiple sources into a unified view.

● Data reduction simplifies data while preserving patterns.

● Data transformation prepares data into suitable formats.

● Data discretization converts continuous data into meaningful categories.

Without preprocessing, data mining results may be misleading or inaccurate. With proper
preprocessing, we can ensure that mining techniques work efficiently and produce reliable
knowledge.

In short, preprocessing is not just the first step, but also the most important step in the entire
data mining process. It transforms raw data into a treasure of meaningful insights.

Servicenow Development Latest Book by UCS Infotech
No ratings yet
Servicenow Development Latest Book by UCS Infotech
355 pages
Instructions NP AC19 3a
No ratings yet
Instructions NP AC19 3a
3 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Unit 3
No ratings yet
Unit 3
22 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
Data Mining
No ratings yet
Data Mining
22 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
Dta Mining
No ratings yet
Dta Mining
15 pages
Down 2
No ratings yet
Down 2
61 pages
Data Binning
No ratings yet
Data Binning
9 pages
Datawarehouse&Data Mining - ALL
No ratings yet
Datawarehouse&Data Mining - ALL
46 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Data Mining Unit-1 Complete
No ratings yet
Data Mining Unit-1 Complete
45 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
FDM Notes
No ratings yet
FDM Notes
48 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Visualization
No ratings yet
Data Visualization
5 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data_Mining_Module_1_Theory
No ratings yet
Data_Mining_Module_1_Theory
4 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Mining Overview
No ratings yet
Data Mining Overview
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
3 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Mining
No ratings yet
Data Mining
6 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
What Is Data Mining: Effective Data Collection Warehousing
No ratings yet
What Is Data Mining: Effective Data Collection Warehousing
21 pages
Unit II(Dwdm)
No ratings yet
Unit II(Dwdm)
19 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Course Manual On Data Mining - CSC 425 - 015446
No ratings yet
Course Manual On Data Mining - CSC 425 - 015446
44 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
UNIT 2 Data Warehousing
No ratings yet
UNIT 2 Data Warehousing
45 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Unit-2
No ratings yet
Unit-2
16 pages
Data Preprocessing: G.A.Putri Saptawati
No ratings yet
Data Preprocessing: G.A.Putri Saptawati
9 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Chapter 1
No ratings yet
Chapter 1
6 pages
Unit - 2 Data Minig Notes
No ratings yet
Unit - 2 Data Minig Notes
15 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Seminar Content- Impact on Source System in Data Engineering
No ratings yet
Seminar Content- Impact on Source System in Data Engineering
4 pages
Data Engineering QA
No ratings yet
Data Engineering QA
2 pages
OS IT-1 Answer Key
No ratings yet
OS IT-1 Answer Key
15 pages
Os Internal 2 Notes
No ratings yet
Os Internal 2 Notes
30 pages
DBMS Assignment
No ratings yet
DBMS Assignment
33 pages
DBMS
No ratings yet
DBMS
11 pages
ESS - Notes 5
No ratings yet
ESS - Notes 5
4 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
ESS Slip Test-2
No ratings yet
ESS Slip Test-2
5 pages
ER To Relational Mapping
No ratings yet
ER To Relational Mapping
4 pages
Biology Project Class 12
No ratings yet
Biology Project Class 12
14 pages
Bio Project
No ratings yet
Bio Project
9 pages
Biology Project Class 12
No ratings yet
Biology Project Class 12
16 pages
Biology Class 12 Project
No ratings yet
Biology Class 12 Project
19 pages
PHP Lab Program - II BCOM CA
No ratings yet
PHP Lab Program - II BCOM CA
5 pages
Create Procedure
No ratings yet
Create Procedure
5 pages
Dynamo DB
No ratings yet
Dynamo DB
30 pages
Demo Script: SQL Server 2008 R2 Reporting Services Demo
No ratings yet
Demo Script: SQL Server 2008 R2 Reporting Services Demo
45 pages
Sample Data Dictionary
No ratings yet
Sample Data Dictionary
3 pages
Linux Cammand
100% (1)
Linux Cammand
3 pages
DBMS Unit5
No ratings yet
DBMS Unit5
20 pages
Senior Data Engineer - Soft2bet
No ratings yet
Senior Data Engineer - Soft2bet
2 pages
Rapid Data Migration To SAP S/4 Hana
No ratings yet
Rapid Data Migration To SAP S/4 Hana
3 pages
Cape It Unit 2 2022
No ratings yet
Cape It Unit 2 2022
19 pages
Hyperion Essbase
No ratings yet
Hyperion Essbase
79 pages
Dbms TLP Batch 06
No ratings yet
Dbms TLP Batch 06
10 pages
1-SQL Server Import and Export Wizard: Page Again
No ratings yet
1-SQL Server Import and Export Wizard: Page Again
3 pages
Sap Odata Services: Atozof
No ratings yet
Sap Odata Services: Atozof
24 pages
Reply To RFI 004 - BIM Related Queries
No ratings yet
Reply To RFI 004 - BIM Related Queries
7 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
Spuninst
No ratings yet
Spuninst
158 pages
Hadoop Online Training
No ratings yet
Hadoop Online Training
7 pages
File Converter For Tyros2
No ratings yet
File Converter For Tyros2
6 pages
PLSQL 3 3 SG
No ratings yet
PLSQL 3 3 SG
23 pages
مهارات الاتصال الإداري والحوار 2
No ratings yet
مهارات الاتصال الإداري والحوار 2
12 pages
Unit Iv: Transaction and Concurrency
No ratings yet
Unit Iv: Transaction and Concurrency
54 pages
Three-Tier Architecture
No ratings yet
Three-Tier Architecture
30 pages
SQL Interview Questions
No ratings yet
SQL Interview Questions
2 pages
SQL Practical
No ratings yet
SQL Practical
13 pages
DCIT 24 Reviewer
No ratings yet
DCIT 24 Reviewer
16 pages
Moran R. - Oracle Label Security Administrator's Guide (Part No. A90149-01) (Release 9.0.1) (2001)
No ratings yet
Moran R. - Oracle Label Security Administrator's Guide (Part No. A90149-01) (Release 9.0.1) (2001)
274 pages
Persepsi Pemustaka Terhadap Aplikasi E-Library Di UPT Perpustakaan UIN Raden Fatah Palembang Berdasarkan Teori Information System Success Model
No ratings yet
Persepsi Pemustaka Terhadap Aplikasi E-Library Di UPT Perpustakaan UIN Raden Fatah Palembang Berdasarkan Teori Information System Success Model
9 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing

1. Importance of Data Preprocessing

Before we go into the steps, it is important to understand why preprocessing is required:

●​ Heterogeneity – Data may come from different sources such as spreadsheets,

●​ Improving accuracy – Well-prepared data improves the performance of mining

(a) Missing Values

●​ Ignoring the record (only if the dataset is large enough).

●​ Filling with a global constant (like “Unknown”).

●​ Replacing with the mean, median, or mode of the attribute.

●​ Using predictive models (regression, decision trees, or k-nearest neighbor) to

(b) Noisy Data

●​ Regression: Fitting data into a regression function and smoothing values.

●​ Clustering: Grouping similar records and identifying outliers as noise.

Key problems in integration include:

●​ Redundancy removal – If the same information is stored in two sources, it must be

Methods of Data Reduction:

2. Dimensionality Reduction – Reducing the number of attributes using techniques such as

●​ Normalization – Scaling data into a specific range.

●​ Min-Max Normalization: scales values between 0 and 1.

●​ Z-score Normalization: rescales using mean and standard deviation.

●​ Decimal Scaling: moves decimal point based on maximum absolute value.

●​ Smoothing – Removing noise using techniques like regression or binning.

●​ Aggregation – Summarizing data (e.g., converting transaction-level data into

●​ Generalization – Replacing low-level data with higher-level concepts. For example,

●​ Histogram analysis – Using data distribution to define bins.

●​ Clustering – Grouping values and treating each cluster as a category.

●​ Concept hierarchy generation – Organizing data into multiple levels of abstraction

●​ Data cleaning removes noise and handles missing values.

●​ Data integration combines multiple sources into a unified view.

●​ Data reduction simplifies data while preserving patterns.

●​ Data transformation prepares data into suitable formats.

●​ Data discretization converts continuous data into meaningful categories.

You might also like

● Heterogeneity – Data may come from different sources such as spreadsheets,

● Improving accuracy – Well-prepared data improves the performance of mining

● Ignoring the record (only if the dataset is large enough).

● Filling with a global constant (like “Unknown”).

● Replacing with the mean, median, or mode of the attribute.

● Using predictive models (regression, decision trees, or k-nearest neighbor) to

● Regression: Fitting data into a regression function and smoothing values.

● Clustering: Grouping similar records and identifying outliers as noise.

● Redundancy removal – If the same information is stored in two sources, it must be

● Normalization – Scaling data into a specific range.

● Min-Max Normalization: scales values between 0 and 1.

● Z-score Normalization: rescales using mean and standard deviation.

● Decimal Scaling: moves decimal point based on maximum absolute value.

● Smoothing – Removing noise using techniques like regression or binning.

● Aggregation – Summarizing data (e.g., converting transaction-level data into

● Generalization – Replacing low-level data with higher-level concepts. For example,

● Histogram analysis – Using data distribution to define bins.

● Clustering – Grouping values and treating each cluster as a category.

● Concept hierarchy generation – Organizing data into multiple levels of abstraction

● Data cleaning removes noise and handles missing values.

● Data integration combines multiple sources into a unified view.

● Data reduction simplifies data while preserving patterns.

● Data transformation prepares data into suitable formats.

● Data discretization converts continuous data into meaningful categories.