0% found this document useful (0 votes)

44 views25 pages

Unit 2 Data Preprocessing

Uploaded by

fenel15493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views25 pages

Unit 2 Data Preprocessing

Uploaded by

fenel15493

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Why Data Preprocessing?

 Data in the real world is soiled.

– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
– noisy: containing errors or outliers
 e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

What is Data?

 Collection of data objects and Attributes

their attributes
Tid Refund Marital Taxable
 An attribute is a property or Status Income Cheat

characteristic of an object 1 Yes Single 125K No

– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No
– Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature

Objects
5 No Divorced 95K Yes
6 No Married 60K No
 A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes

– Object is also known as record, 9 No Married 75K No

point, case, sample, entity, or 10 No Single 90K Yes
instance
10

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Quality

 What kinds of data quality problems?

 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

– Noise and outliers
– missing values
– duplicate data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Noise

 Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Outliers

 Outliers are data objects with characteristics that are

considerably different than most of the other data objects
in the data set

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Missing Values

 Reasons for missing values

– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Duplicate Data

 Data set may include data objects that are duplicates, or

almost duplicates of one another
– Major issue when merging data from heterogeneous sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

-- … Why Preprocess the Data
 Reason for data cleaning
– Incomplete data (missing data)
– Noisy data (contains errors)
– Inconsistent data (containing discrepancies)

 Reasons for data integration

– Data comes from multiple sources

 Reason for data transformation

– Some data must be transformed to be used for mining

 Reasons for data reduction

– Performance

 No quality data  no quality mining results!

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Major Tasks in Data Preprocessing

 1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

 2.Data integration
– Integration of multiple databases, data cubes, or files

 3.Data transformation
– Normalization and aggregation

 4.Data reduction (Sampling, dimensionality reduction,

feature subset selection)
– Obtains reduced representation in volume but produces the same
or similar analytical results

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 5.Data discretization
– For classification algorithms sometimes it is required that data
should be in the form of categorical attributes
– Algo. That find association patterns require that the data be in the
form of binary attributes.
– Thus it is required to transform a continuous attribute to a
categorical attribute( discretization).
– Part of data reduction but with particular importance, especially for
numerical data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Forms of Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning

 Data cleaning tasks

– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning
: How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification)—not effective
unless the tuple contains several attributes with the
missing values
 Fill in the missing value manually- not feasible for large
datasets and time- consuming
 Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the most probable value: inference-based such as Bayesian
formula or regression or decision tree induction

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning : How to Handle Noisy Data?

 Noise- a random error or variance in a measured variable.

 Incorrect attribute values may due to

– faulty data collection

– data entry problems

– data transmission problems

– data conversion errors

– Data decay problems

– technology limitations, e.g. buffer overflow or field size limits

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning : How to Handle Noisy Data?

Methods
 Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
– smooth by fitting the data into regression functions

 Clustering
– detect and remove outliers

 Combined computer and human inspection

– detect suspicious values and check by human (e.g., deal with
possible outliers)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning : Regression

•Data can be smoothed by y

fitting the data to a
function such as with
regression. Y1
•Linear regression involves
finding the ‘best’ line to fit
Y1’ y=x+1
2 variables.

X1 x
•Also, it is possible to
predict one variable using
the other variable.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

1.Data Cleaning : Cluster Analysis

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

2. Data Integration

 Data integration:
– Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id  B.cust-#

– Integrate metadata from different sources

 Entity identification problem:

– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Integration
: Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple

databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling

 Attribute/feature construction
– New attributes constructed from the given ones

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Reduction Strategies

 Why data reduction?

– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the
complete data set

 Data reduction
– Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results

 Data reduction strategies

– Aggregation
– Sampling
– Dimensionality Reduction
– Feature subset selection
– Feature creation
– Discretization and Binarization
– Attribute Transformation

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Reduction : Aggregation

 Combining two or more attributes (or objects) into a single

attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Data Reduction : Sampling

 Sampling is the main technique employed for data selection.

– It is often used for both the preliminary investigation of the data
and the final data analysis.

 Statisticians sample because obtaining the entire set of data

of interest is too expensive or time consuming.

 Sampling is used in data mining because processing the

entire set of data of interest is too expensive or time
consuming.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Thank You

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
CH 3
No ratings yet
CH 3
68 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
UNIT - Introduction - DataScience - New
No ratings yet
UNIT - Introduction - DataScience - New
55 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DWM
No ratings yet
DWM
14 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit - II
No ratings yet
Unit - II
56 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
18mca52c U2
No ratings yet
18mca52c U2
23 pages
SC Ghost Garden
No ratings yet
SC Ghost Garden
20 pages
Exam LO1
No ratings yet
Exam LO1
2 pages
DMDW Lesson Plan
No ratings yet
DMDW Lesson Plan
8 pages
Lean Manufacturing in The Age of The Industrial Internet Ge Digital
No ratings yet
Lean Manufacturing in The Age of The Industrial Internet Ge Digital
13 pages
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
Pinoy Entrepreneurial Charateristics and Business Studies: Topic 6
No ratings yet
Pinoy Entrepreneurial Charateristics and Business Studies: Topic 6
61 pages
Datalog Query Language: Xi XJ
No ratings yet
Datalog Query Language: Xi XJ
4 pages
Relational Database Management System 10 Notes
No ratings yet
Relational Database Management System 10 Notes
6 pages
SQL and MySQL
No ratings yet
SQL and MySQL
11 pages
Big Data Project-2 Report
No ratings yet
Big Data Project-2 Report
22 pages
Unit 3
No ratings yet
Unit 3
18 pages
PHD Thesis Topics in Software Engineering
100% (3)
PHD Thesis Topics in Software Engineering
7 pages
Seminar Report 2010 Storage Area Network
No ratings yet
Seminar Report 2010 Storage Area Network
31 pages
NSO-156 Data ONTAP Cluster-Mod
No ratings yet
NSO-156 Data ONTAP Cluster-Mod
5 pages
Analysis of Contextual Meaning Expression Found in Mahaerzain Song Lyrics - PDF MEANING
No ratings yet
Analysis of Contextual Meaning Expression Found in Mahaerzain Song Lyrics - PDF MEANING
87 pages
Java Frame
No ratings yet
Java Frame
3 pages
Ingram Veeam AWS v5
No ratings yet
Ingram Veeam AWS v5
37 pages
Research Cover Letter Sample
100% (1)
Research Cover Letter Sample
4 pages
Lecture 3 - Variables and Datatypes
No ratings yet
Lecture 3 - Variables and Datatypes
11 pages
BradleySilkResume PDF
No ratings yet
BradleySilkResume PDF
1 page
Importing Data Into R Using RStudio - Watermark
No ratings yet
Importing Data Into R Using RStudio - Watermark
3 pages
Hbase Tutorial
100% (1)
Hbase Tutorial
107 pages
SQLite Database
No ratings yet
SQLite Database
6 pages
Crime Analysis With Crime Mapping Rachel Boba Santos Download
No ratings yet
Crime Analysis With Crime Mapping Rachel Boba Santos Download
58 pages
SQL Syntax: Database Tables
No ratings yet
SQL Syntax: Database Tables
12 pages
Week 1 Activity Sheet:: Defining A Database
100% (1)
Week 1 Activity Sheet:: Defining A Database
30 pages
AI For AI Using AI Methods For Classifyi
No ratings yet
AI For AI Using AI Methods For Classifyi
14 pages
Query-by-Example (QBE) : Online Chapter
No ratings yet
Query-by-Example (QBE) : Online Chapter
7 pages
Internationalization and Entry Strategy of Enterprises
No ratings yet
Internationalization and Entry Strategy of Enterprises
53 pages
DBMS 22wj8a6788
No ratings yet
DBMS 22wj8a6788
22 pages

Unit 2 Data Preprocessing

Uploaded by

Unit 2 Data Preprocessing

Uploaded by

Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Data in the real world is soiled.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Collection of data objects and Attributes

characteristic of an object 1 Yes Single 125K No

– Object is also known as record, 9 No Married 75K No

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 What kinds of data quality problems?

 Examples of data quality problems:

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Noise refers to modification of original values

Two Sine Waves Two Sine Waves + Noise

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Outliers are data objects with characteristics that are

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Reasons for missing values

 Handling missing values

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Data set may include data objects that are duplicates, or

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Reasons for data integration

 Reason for data transformation

 Reasons for data reduction

 No quality data  no quality mining results!

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 4.Data reduction (Sampling, dimensionality reduction,

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Data cleaning tasks

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Noise- a random error or variance in a measured variable.

 Incorrect attribute values may due to

– data entry problems

– data transmission problems

– data conversion errors

– Data decay problems

– technology limitations, e.g. buffer overflow or field size limits

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Combined computer and human inspection

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

•Data can be smoothed by y

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Schema integration: e.g., A.cust-id  B.cust-#

 Entity identification problem:

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Redundant data occur often when integration of multiple

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Smoothing: remove noise from data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Why data reduction?

 Data reduction strategies

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Combining two or more attributes (or objects) into a single

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

 Sampling is the main technique employed for data selection.

 Statisticians sample because obtaining the entire set of data

 Sampling is used in data mining because processing the

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

You might also like