Lecture 6 Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Data Preprocessing

Lecture 6

Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Outline
❑ Why preprocess the data?
❑ Data cleaning
❑ Data integration and transformation
❑ Data reduction
❑ Discretization and concept hierarchy generation
❑ Summary
Why Data Preprocessing?
❑ Data in the real world is dirty
1. Incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
❑ e.g., occupation=“ ”
2. Noisy: containing errors or outliers
❑ e.g., Salary=“-10”
3. Inconsistent: containing discrepancies in codes or names
❑ e.g., Age=“42” Birthday=“03/07/1997”
❑ e.g., Was rating “1,2,3”, now rating “A, B, C”
❑ e.g., discrepancy between duplicate records
Why Data Preprocessing?
❑ No quality data, no quality mining results!
❑ Quality decisions must be based on quality data
❑ Data warehouse needs consistent integration of quality data
Why Is Data Dirty?
❑ Incomplete data may come from
❑ “Not applicable” data value when collected
❑ Different considerations between the time when the data was collected and when
it is analyzed.
❑ Human/hardware/software problems
❑ Noisy data (incorrect values) may come from
❑ Faulty data collection instruments
❑ Human or computer error at data entry
❑ Errors in data transmission
❑ Inconsistent data may come from
❑ Different data sources
❑ Functional dependency violation (e.g., modify some linked data)
❑ Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
❑ No quality data, no quality mining results!
❑ Quality decision must be based on quality data
❑ E.g. duplicate or missing data may cause incorrect or even
misleading statistics
❑ Data warehouse needs consistent integration of quality data
❑ Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Knowledge Discovery Process, in practice
Multi-Dimensional Measure of Data Quality
❑ A well-accepted multidimensional view:
❑ Accuracy
❑ Completeness
❑ Consistency
❑ Timeliness
❑ Believability
❑ Value added
❑ Interpretability
❑ Accessibility
Major Tasks in Data Preprocessing
❑ Data cleaning
❑ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data transformation
❑ Normalization and aggregation
Major Tasks in Data Preprocessing
❑ Data reduction
❑ Obtains reduced representation in volume but produces the
same or similar analytical results
❑ Data discretization
❑ Part of data reduction but with particular importance,
especially for numerical data
Forms of data preprocessing
Data Preprocessing
❑ Why preprocess the data?
❑ Data cleaning
❑ Data integration and transformation
❑ Data reduction
❑ Discretization and concept hierarchy generation
❑ Summary
Data Cleaning Steps
❑ Data acquisition and metadata
❑ Missing values
❑ Unified date format
❑ Converting nominal to numeric
❑ Discretization of numeric data
Data Cleaning: Acquisition
❑ Data can be in DBMS
❑ ODBC, JDBC protocols
❑ Data in a flat file
❑ Fixed-column format
❑ Delimited format: tab, comma “,” , other
❑ E.g. Weka “arff” use comma-delimited data
❑ Verify the number of fields before and after
Types of Attributes
❑ There are different types of attributes
❑ Nominal
❑ Examples: ID numbers, eye color, zip codes
❑ Ordinal
❑ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
❑ Interval
❑ Examples: calendar dates, temperatures in Celsius or
❑ Ratio
❑ Examples: temperature, length, time, counts
Data Cleaning
❑ Data cleaning tasks

❑ Fill in missing values

❑ Identify outliers and smooth out noisy data

❑ Correct inconsistent data


Missing Data
❑ Data is not always available
❑ E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
❑ Missing data may be due to
❑ equipment malfunction
❑ inconsistent with other recorded data and thus deleted
❑ data not entered due to misunderstanding
❑ certain data may not be considered important at the time of entry
❑ not register history or changes of the data
❑ Missing data may need to be inferred.
How to Handle Missing Data?
❑ Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably)
❑ Fill in the missing value manually: tedious + infeasible?
❑ Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
❑ Use the attribute mean to fill in the missing value
❑ Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
❑ Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree
Data Cleaning: Unified Date Format
❑ We want to transform all dates to the same format internally
❑ Some systems accept dates in many formats
❑ e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
❑ dates are transformed internally to a standard value
❑ Frequently, just the year (YYYY) is sufficient
❑ For more details, we may need the month, the day, the hour, etc
❑ Representing date as YYYYMM or YYYYMMDD
Noisy Data
❑ Noise: random error or variance in a measured variable
❑ Incorrect attribute values may due to
❑ faulty data collection instruments
❑ data entry problems
❑ data transmission problems
❑ technology limitation
❑ inconsistency in naming convention
❑ Other data problems which requires data cleaning
❑ duplicate records
❑ incomplete data
❑ inconsistent data
Data Cleaning: Example
❑ Original data (fixed column format)
000000000130.06.19971979--10-3080145722 #000310 111000301.01.000100000000004
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000. 000000000000000.000000000000000.0000000...…
000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00
0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000
00000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000
000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00

❑ Clean data
0000000001,199706,1979.833,8014,5722 , ,#000310 ….
,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00
How to Handle Noisy Data?
❑ Binning method:
❑ first sort data and partition into (equi-depth) bins
❑ then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
❑ Clustering
❑ detect and remove outliers
❑ Combined computer and human inspection
❑ detect suspicious values and check by human
❑ Regression
❑ smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
❑ Equal-width (distance) partitioning:
❑ It divides the range into N intervals of equal size: uniform grid
❑ if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B-A)/N.
❑ The most straightforward
❑ But outliers may dominate presentation
❑ Skewed data is not handled well.

❑ Equal-depth (frequency) partitioning:


❑ It divides the range into N intervals, each containing approximately same number
of samples
❑ Good data scaling
❑ Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Clustering
❑ Partition data set into clusters, and one can store cluster
representation only

❑ Can be very effective if data is clustered but not if data is


“smeared”

❑ Can have hierarchical clustering and be stored in multi-


dimensional index tree structures

❑ There are many choices of clustering definitions and clustering


algorithms
Cluster Analysis
Regression Analysis
A Regression analysis makes it possible to predict another variable
on the basis of one or more variables.
Regress Analysis and Log-Linear Models
Linear regression: Data are modeled to fit a straight line
Y=+X
Two parameters ,  and  specify the line and are to be estimated by using
the data at hand.
using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
Ex :- Does the weekly working time have an influence on the hourly wage of
employees.
Regress Analysis and Log-Linear Models
Multiple regression: allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the above.
Ex: Does the weekly working time and the age of employee have an
influence on the hourly wage of employees.

Log-linear models: approximates discrete multidimensional probability


distributions.
The multi-way table of joint probabilities is approximated by a product of
lower-order tables.
Probability: p(a, b, c, d) = ab acad bcd
Conversion: Nominal to Numeric
❑ Some tools can deal with nominal values internally
❑ Other methods (neural nets, regression, nearest neighbor) require
only numeric inputs
❑ To use nominal fields in such methods need to convert them to a
numeric value
❑ Q: Why not ignore nominal fields altogether?
❑ A: They may contain valuable information
❑ Different strategies for binary, ordered, multivalued nominal fields
Conversion
❑ How would you convert binary fields to numeric?
❑ E.g. Gender=M, F
❑ How would you convert ordered attributes to numeric?
❑ E.g. Grades
Conversion: Nominal, Few Values
❑ Multi-valued, unordered attributes with small (rule of thumb <
20) no. of values
❑ e.g. Color=Red, Orange, Yellow, …, Violet
❑ for each value v create a binary “flag” variable C_v , which is 1 if
Color=v, 0 otherwise
Conversion: Nominal, Many Values
❑ Examples:
❑ US State Code (50 values)
❑ Profession Code (7,000 values, but only few frequent)
❑ Q: How to deal with such fields ?
❑ A: Ignore ID-like fields whose values are unique for each record
❑ For other fields, group values “naturally”:
❑ e.g. 50 US States Æ 3 or 5 regions
❑ Profession – select most frequent ones, group the rest
❑ Create binary flag-fields for selected values
Data Integration
❑ Data integration:
❑ combines data from multiple sources into a coherent store
❑ Schema integration:
❑ integrate metadata from different sources
❑ Entity identification problem:
❑ identify real world entities from multiple data sources, e.g., A.cust-id 
B.cust-#
❑ Detecting and resolving data value conflicts
❑ for the same real world entity, attribute values from different sources are
different
❑ possible reasons: different representations, different scales, e.g., metric vs.
British units
Handling Redundancy in Data Integration
❑ Redundant data occur often when integration of multiple
databases
❑ The same attribute may have different names in different
databases
❑ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Data Transformation
❑ Smoothing: remove noise from data
❑ Aggregation: summarization, data cube construction
❑ Generalization: concept hierarchy climbing
❑ Normalization: scaled to fall within a small, specified range
❑ min-max normalization
❑ z-score normalization
❑ normalization by decimal scaling
❑ Attribute/feature construction
❑ New attributes constructed from the given ones
Data Transformation: Normalization
❑ min-max normalization
v − minA
v' = ( new _ maxA − new _ minA) + new _ minA
maxA − minA
❑ z-score normalization
v − mean A
v' =
stand _ devA

❑ normalization by decimal scaling


v
v' = j Where j is the smallest integer such that Max(| v ' |)<1
10
Min-max normalization
Suppose that the minimum and maximum values for the attribute
income are $12,000 and $98,000, respectively. We would like to map
income to the range [0:0;1:0]. By minmax normalization, a value of
$73,600 for income is transformed to?
z-score normalization
z-score normalization Suppose that the mean and standard
deviation of the values for the attribute income are $54,000 and
$16,000, respectively. With z-score normalization, a value of $73,600
for income is transformed to?
Data Preprocessing
❑ Why preprocess the data?

❑ Data cleaning

❑ Data integration and transformation

❑ Data reduction

❑ Discretization and concept hierarchy generation

❑ Summary
Data Reduction Strategies
❑ Why data reduction?
❑ A database/data warehouse may store terabytes of data
❑ Complex data analysis/mining may take a very long time to run on the
complete data set
❑ Data reduction
❑ Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
❑ Data reduction strategies
❑ Data cube aggregation:
❑ Dimensionality reduction — e.g., remove unimportant attributes
❑ Data Compression
❑ Numerosity reduction — e.g., fit data into models
❑ Discretization and concept hierarchy generation
Data Cube Aggregation
❑ The lowest level of a data cube (base cuboid)
❑ the aggregated data for an individual entity of interest
❑ e.g., a customer in a phone calling data warehouse.

❑ Multiple levels of aggregation in data cubes


❑ Further reduce the size of data to deal with

❑ Reference appropriate levels


❑ Use the smallest representation which is enough to solve the task

❑ Queries regarding aggregated information should be answered


using data cube, when possible
Dimensionality Reduction
❑ Purpose:
❑ Avoid curse of dimensionality
❑ Reduce amount of time and memory required by data mining
algorithms
❑ Allow data to be more easily visualized
❑ May help to eliminate irrelevant features or reduce noise

❑ Techniques
❑ Principle Component Analysis
❑ Singular Value Decomposition
❑ Others: supervised and non-linear techniques
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Data Compression
❑ String compression
❑ There are extensive theories and well-tuned algorithms
❑ Typically lossless
❑ But only limited manipulation is possible without expansion
❑ Audio/video compression
❑ Typically lossy compression, with progressive refinement
❑ Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Numerosity Reduction
❑ Parametric methods
❑ Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
❑ Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces
❑ Non-parametric methods
❑ Do not assume models
❑ Major families: histograms, clustering, sampling
Histograms
A popular data reduction
40

technique 35
30
❑ Divide data into buckets and store 25
average (sum) for each bucket 20

❑ Can be constructed optimally in 15

one dimension using dynamic 10

programming 5
0
❑ Related to quantization problems. 10000 30000 50000 70000 90000
Sampling
❑ Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
❑ Choose a representative subset of the data
❑ Every sampling type comes under two broad categories.
▪Probability Sampling - Random selection techniques are used to
select the sample.
▪Non- probability Sampling - Non-random selection techniques
based on certain criteria are used to select the sample.
Sampling
▪Probability Sampling
- Simple Random Sampling
- Systematic Sampling
- Stratified Sampling
-Cluster Sampling
Non- probability Sampling
- Convenience Sampling
- Voluntary Response Sampling
Data Preprocessing
❑ Why preprocess the data?

❑ Data cleaning

❑ Data integration and transformation

❑ Data reduction

❑ Discretization and concept hierarchy generation

❑ Summary
Discretization
❑ Three types of attributes:
❑ Nominal — values from an unordered set e.g., color, profession
❑ Ordinal — values from an ordered set e.g., military or academic rank
❑ Continuous — real numbers e.g., integer or real numbers

❑ Discretization:
❑ divide the range of a continuous attribute into intervals
❑ Some classification algorithms only accept categorical attributes.
❑ Reduce data size by discretization
❑ Prepare for further analysis
Discretization and Concept hierarchy
❑ Discretization
reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to
replace actual data values.

❑ Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as
young, middle-aged, or senior).
Discretization for numeric data

❑ Binning (see sections before)

❑ Histogram analysis (see sections before)

❑ Clustering analysis (see sections before)


Data Preprocessing
❑ Why preprocess the data?

❑ Data cleaning

❑ Data integration and transformation

❑ Data reduction

❑ Discretization and concept hierarchy generation

❑ Summary
Summary
❑ Data preparation is a big issue for both warehousing and mining

❑ Data preparation includes

❑ Data cleaning and data integration

❑ Data reduction and feature selection

❑ Discretization

❑ A lot a methods have been developed but still an active area of research
Summary:
1. Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.
a) Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
b) Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection,
data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and
then various methods are performed to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
Summary:
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:

a) Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

b) Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

c) Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

d) Concept Hierarchy Generation:


Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be
converted to “country”.
Summary:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of
data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

a) Data Cube Aggregation:


Aggregation operation is applied to data for the construction of the data cube.

b) Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one
can use level of significance and p- value of the attribute. the attribute having p-value greater than significance
level can be discarded.

c) Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.

d) Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal
Componenet Analysis).
Thank You!

You might also like