0% found this document useful (0 votes)
44 views25 pages

Unit 2 Data Preprocessing

Uploaded by

fenel15493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views25 pages

Unit 2 Data Preprocessing

Uploaded by

fenel15493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Why Data Preprocessing?

 Data in the real world is soiled.


– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
 e.g., occupation=“ ”
– noisy: containing errors or outliers
 e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


What is Data?

 Collection of data objects and Attributes


their attributes
Tid Refund Marital Taxable
 An attribute is a property or Status Income Cheat

characteristic of an object 1 Yes Single 125K No


– Examples: eye color of a person, 2 No Married 100K No
temperature, etc. 3 No Single 70K No
– Attribute is also known as variable, 4 Yes Married 120K No
field, characteristic, or feature

Objects
5 No Divorced 95K Yes
6 No Married 60K No
 A collection of attributes describe 7 Yes Divorced 220K No
an object 8 No Single 85K Yes

– Object is also known as record, 9 No Married 75K No


point, case, sample, entity, or 10 No Single 90K Yes
instance
10

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Quality

 What kinds of data quality problems?


 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:


– Noise and outliers
– missing values
– duplicate data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Noise

 Noise refers to modification of original values


– Examples: distortion of a person’s voice when talking on a poor
phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Outliers

 Outliers are data objects with characteristics that are


considerably different than most of the other data objects
in the data set

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Missing Values

 Reasons for missing values


– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values


– Eliminate Data Objects
– Estimate Missing Values
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their probabilities)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Duplicate Data

 Data set may include data objects that are duplicates, or


almost duplicates of one another
– Major issue when merging data from heterogeneous sources

 Examples:
– Same person with multiple email addresses

 Data cleaning
– Process of dealing with duplicate data issues

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


-- … Why Preprocess the Data
 Reason for data cleaning
– Incomplete data (missing data)
– Noisy data (contains errors)
– Inconsistent data (containing discrepancies)

 Reasons for data integration


– Data comes from multiple sources

 Reason for data transformation


– Some data must be transformed to be used for mining

 Reasons for data reduction


– Performance

 No quality data  no quality mining results!

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Major Tasks in Data Preprocessing

 1.Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies

 2.Data integration
– Integration of multiple databases, data cubes, or files

 3.Data transformation
– Normalization and aggregation

 4.Data reduction (Sampling, dimensionality reduction,


feature subset selection)
– Obtains reduced representation in volume but produces the same
or similar analytical results

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


 5.Data discretization
– For classification algorithms sometimes it is required that data
should be in the form of categorical attributes
– Algo. That find association patterns require that the data be in the
form of binary attributes.
– Thus it is required to transform a continuous attribute to a
categorical attribute( discretization).
– Part of data reduction but with particular importance, especially for
numerical data

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Forms of Data Preprocessing

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning

 Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning
: How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification)—not effective
unless the tuple contains several attributes with the
missing values
 Fill in the missing value manually- not feasible for large
datasets and time- consuming
 Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the most probable value: inference-based such as Bayesian
formula or regression or decision tree induction

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : How to Handle Noisy Data?

 Noise- a random error or variance in a measured variable.

 Incorrect attribute values may due to


– faulty data collection

– data entry problems

– data transmission problems

– data conversion errors

– Data decay problems

– technology limitations, e.g. buffer overflow or field size limits

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : How to Handle Noisy Data?

Methods
 Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
 Regression
– smooth by fitting the data into regression functions

 Clustering
– detect and remove outliers

 Combined computer and human inspection


– detect suspicious values and check by human (e.g., deal with
possible outliers)

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : Regression

•Data can be smoothed by y


fitting the data to a
function such as with
regression. Y1
•Linear regression involves
finding the ‘best’ line to fit
Y1’ y=x+1
2 variables.

X1 x
•Also, it is possible to
predict one variable using
the other variable.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


1.Data Cleaning : Cluster Analysis

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


2. Data Integration

 Data integration:
– Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id  B.cust-#


– Integrate metadata from different sources

 Entity identification problem:


– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Integration
: Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple


databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Transformation

 Smoothing: remove noise from data


 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling

 Attribute/feature construction
– New attributes constructed from the given ones

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction Strategies

 Why data reduction?


– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run on the
complete data set

 Data reduction
– Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical
results

 Data reduction strategies


– Aggregation
– Sampling
– Dimensionality Reduction
– Feature subset selection
– Feature creation
– Discretization and Binarization
– Attribute Transformation

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction : Aggregation

 Combining two or more attributes (or objects) into a single


attribute (or object)

 Purpose
– Data reduction
 Reduce the number of attributes or objects
– Change of scale
 Cities aggregated into regions, states, countries, etc
– More “stable” data
 Aggregated data tends to have less variability

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Data Reduction : Sampling

 Sampling is the main technique employed for data selection.


– It is often used for both the preliminary investigation of the data
and the final data analysis.

 Statisticians sample because obtaining the entire set of data


of interest is too expensive or time consuming.

 Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene


Thank You

PES Modern college of Engineering, MCA Department Prepared by : Dr.Prakash Kene

You might also like