Lecture 6 Data Preprocessing
Lecture 6 Data Preprocessing
Lecture 6 Data Preprocessing
Lecture 6
Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Outline
❑ Why preprocess the data?
❑ Data cleaning
❑ Data integration and transformation
❑ Data reduction
❑ Discretization and concept hierarchy generation
❑ Summary
Why Data Preprocessing?
❑ Data in the real world is dirty
1. Incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
❑ e.g., occupation=“ ”
2. Noisy: containing errors or outliers
❑ e.g., Salary=“-10”
3. Inconsistent: containing discrepancies in codes or names
❑ e.g., Age=“42” Birthday=“03/07/1997”
❑ e.g., Was rating “1,2,3”, now rating “A, B, C”
❑ e.g., discrepancy between duplicate records
Why Data Preprocessing?
❑ No quality data, no quality mining results!
❑ Quality decisions must be based on quality data
❑ Data warehouse needs consistent integration of quality data
Why Is Data Dirty?
❑ Incomplete data may come from
❑ “Not applicable” data value when collected
❑ Different considerations between the time when the data was collected and when
it is analyzed.
❑ Human/hardware/software problems
❑ Noisy data (incorrect values) may come from
❑ Faulty data collection instruments
❑ Human or computer error at data entry
❑ Errors in data transmission
❑ Inconsistent data may come from
❑ Different data sources
❑ Functional dependency violation (e.g., modify some linked data)
❑ Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
❑ No quality data, no quality mining results!
❑ Quality decision must be based on quality data
❑ E.g. duplicate or missing data may cause incorrect or even
misleading statistics
❑ Data warehouse needs consistent integration of quality data
❑ Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
Knowledge Discovery Process, in practice
Multi-Dimensional Measure of Data Quality
❑ A well-accepted multidimensional view:
❑ Accuracy
❑ Completeness
❑ Consistency
❑ Timeliness
❑ Believability
❑ Value added
❑ Interpretability
❑ Accessibility
Major Tasks in Data Preprocessing
❑ Data cleaning
❑ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
❑ Data integration
❑ Integration of multiple databases, data cubes, or files
❑ Data transformation
❑ Normalization and aggregation
Major Tasks in Data Preprocessing
❑ Data reduction
❑ Obtains reduced representation in volume but produces the
same or similar analytical results
❑ Data discretization
❑ Part of data reduction but with particular importance,
especially for numerical data
Forms of data preprocessing
Data Preprocessing
❑ Why preprocess the data?
❑ Data cleaning
❑ Data integration and transformation
❑ Data reduction
❑ Discretization and concept hierarchy generation
❑ Summary
Data Cleaning Steps
❑ Data acquisition and metadata
❑ Missing values
❑ Unified date format
❑ Converting nominal to numeric
❑ Discretization of numeric data
Data Cleaning: Acquisition
❑ Data can be in DBMS
❑ ODBC, JDBC protocols
❑ Data in a flat file
❑ Fixed-column format
❑ Delimited format: tab, comma “,” , other
❑ E.g. Weka “arff” use comma-delimited data
❑ Verify the number of fields before and after
Types of Attributes
❑ There are different types of attributes
❑ Nominal
❑ Examples: ID numbers, eye color, zip codes
❑ Ordinal
❑ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
❑ Interval
❑ Examples: calendar dates, temperatures in Celsius or
❑ Ratio
❑ Examples: temperature, length, time, counts
Data Cleaning
❑ Data cleaning tasks
❑ Clean data
0000000001,199706,1979.833,8014,5722 , ,#000310 ….
,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00
How to Handle Noisy Data?
❑ Binning method:
❑ first sort data and partition into (equi-depth) bins
❑ then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
❑ Clustering
❑ detect and remove outliers
❑ Combined computer and human inspection
❑ detect suspicious values and check by human
❑ Regression
❑ smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
❑ Equal-width (distance) partitioning:
❑ It divides the range into N intervals of equal size: uniform grid
❑ if A and B are the lowest and highest values of the attribute, the width of intervals
will be: W = (B-A)/N.
❑ The most straightforward
❑ But outliers may dominate presentation
❑ Skewed data is not handled well.
❑ Data cleaning
❑ Data reduction
❑ Summary
Data Reduction Strategies
❑ Why data reduction?
❑ A database/data warehouse may store terabytes of data
❑ Complex data analysis/mining may take a very long time to run on the
complete data set
❑ Data reduction
❑ Obtain a reduced representation of the data set that is much smaller in
volume but yet produce the same (or almost the same) analytical results
❑ Data reduction strategies
❑ Data cube aggregation:
❑ Dimensionality reduction — e.g., remove unimportant attributes
❑ Data Compression
❑ Numerosity reduction — e.g., fit data into models
❑ Discretization and concept hierarchy generation
Data Cube Aggregation
❑ The lowest level of a data cube (base cuboid)
❑ the aggregated data for an individual entity of interest
❑ e.g., a customer in a phone calling data warehouse.
❑ Techniques
❑ Principle Component Analysis
❑ Singular Value Decomposition
❑ Others: supervised and non-linear techniques
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
technique 35
30
❑ Divide data into buckets and store 25
average (sum) for each bucket 20
programming 5
0
❑ Related to quantization problems. 10000 30000 50000 70000 90000
Sampling
❑ Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
❑ Choose a representative subset of the data
❑ Every sampling type comes under two broad categories.
▪Probability Sampling - Random selection techniques are used to
select the sample.
▪Non- probability Sampling - Non-random selection techniques
based on certain criteria are used to select the sample.
Sampling
▪Probability Sampling
- Simple Random Sampling
- Systematic Sampling
- Stratified Sampling
-Cluster Sampling
Non- probability Sampling
- Convenience Sampling
- Voluntary Response Sampling
Data Preprocessing
❑ Why preprocess the data?
❑ Data cleaning
❑ Data reduction
❑ Summary
Discretization
❑ Three types of attributes:
❑ Nominal — values from an unordered set e.g., color, profession
❑ Ordinal — values from an ordered set e.g., military or academic rank
❑ Continuous — real numbers e.g., integer or real numbers
❑ Discretization:
❑ divide the range of a continuous attribute into intervals
❑ Some classification algorithms only accept categorical attributes.
❑ Reduce data size by discretization
❑ Prepare for further analysis
Discretization and Concept hierarchy
❑ Discretization
reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals. Interval labels can then be used to
replace actual data values.
❑ Concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as
young, middle-aged, or senior).
Discretization for numeric data
❑ Data cleaning
❑ Data reduction
❑ Summary
Summary
❑ Data preparation is a big issue for both warehousing and mining
❑ Discretization
❑ A lot a methods have been developed but still an active area of research
Summary:
1. Data Cleaning
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of
missing data, noisy data etc.
a) Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the
most probable value.
b) Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection,
data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and
then various methods are performed to complete the task. Each segmented is handled separately. One can replace
all data in a segment by its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear (having one
independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
Summary:
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves
following ways:
a) Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
b) Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
c) Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
c) Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression Models.
d) Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from
compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal
Componenet Analysis).
Thank You!