0% found this document useful (0 votes)
28 views

Unit 2 Data Preprocessing

Uploaded by

fenel15493
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Unit 2 Data Preprocessing

Uploaded by

fenel15493
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Why Data Preprocessing?

 Data in the real world is soiled.


– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
– noisy: containing errors or outliers
 e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


What is Data?

 Collection of data objects and Attributes


their attributes
Tid Refund Marital Taxable
 An attribute is a property or Status Income Cheat

characteristic of an object 1 Yes Single 125K No


– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No
– Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature

Objects
5 No Divorced 95K Yes
6 No Married 60K No
 A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes

– Object is also known as record, 9 No Married 75K No


point, case, sample, entity, or 10 No Single 90K Yes
instance
10

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Quality

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– Noise and outliers
– missing values
– duplicate data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Noise

 Noise refers to modification of original values


– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Outliers

 Outliers are data objects with characteristics that are


considerably different than most of the other data objects
in the data set

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Missing Values

 Reasons for missing values


– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Duplicate Data

 Data set may include data objects that are duplicates, or


almost duplicates of one another
– Major issue when merging data from heterogeneous sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


-- … Why Preprocess the Data
 Reason for data cleaning
– Incomplete data (missing data)
– Noisy data (contains errors)
– Inconsistent data (containing discrepancies)

 Reasons for data integration


– Data comes from multiple sources

 Reason for data transformation


– Some data must be transformed to be used for mining

 Reasons for data reduction


– Performance

 No quality data  no quality mining results!

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Major Tasks in Data Preprocessing

 1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

 2.Data integration
– Integration of multiple databases, data cubes, or files

 3.Data transformation
– Normalization and aggregation

 4.Data reduction (Sampling, dimensionality reduction,


feature subset selection)
– Obtains reduced representation in volume but produces the same
or similar analytical results

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


 5.Data discretization
– For classification algorithms sometimes it is required that data
should be in the form of categorical attributes
– Algo. That find association patterns require that the data be in the
form of binary attributes.
– Thus it is required to transform a continuous attribute to a
categorical attribute( discretization).
– Part of data reduction but with particular importance, especially for
numerical data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Forms of Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning

 Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning
: How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification)—not effective
unless the tuple contains several attributes with the
missing values
 Fill in the missing value manually- not feasible for large
datasets and time- consuming
 Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the most probable value: inference-based such as Bayesian
formula or regression or decision tree induction

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : How to Handle Noisy Data?

 Noise- a random error or variance in a measured variable.

 Incorrect attribute values may due to


– faulty data collection

– data entry problems

– data transmission problems

– data conversion errors

– Data decay problems

– technology limitations, e.g. buffer overflow or field size limits

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : How to Handle Noisy Data?

Methods
 Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
– smooth by fitting the data into regression functions

 Clustering
– detect and remove outliers

 Combined computer and human inspection


– detect suspicious values and check by human (e.g., deal with
possible outliers)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : Regression

•Data can be smoothed by y


fitting the data to a
function such as with
regression. Y1
•Linear regression involves
finding the ‘best’ line to fit
Y1’ y=x+1
2 variables.

X1 x
•Also, it is possible to
predict one variable using
the other variable.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : Cluster Analysis

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


2. Data Integration

 Data integration:
– Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id  B.cust-#


– Integrate metadata from different sources

 Entity identification problem:


– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Integration
: Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling

 Attribute/feature construction
– New attributes constructed from the given ones

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction Strategies

 Why data reduction?


– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the
complete data set

 Data reduction
– Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results

 Data reduction strategies


– Aggregation
– Sampling
– Dimensionality Reduction
– Feature subset selection
– Feature creation
– Discretization and Binarization
– Attribute Transformation

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction : Aggregation

 Combining two or more attributes (or objects) into a single


attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction : Sampling

 Sampling is the main technique employed for data selection.


– It is often used for both the preliminary investigation of the data
and the final data analysis.

 Statisticians sample because obtaining the entire set of data


of interest is too expensive or time consuming.

 Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Thank You

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

You might also like