0% found this document useful (0 votes)

17 views49 pages

CH1-data Preprocessing

The document discusses various tasks involved in data preprocessing including data cleaning, data integration, data reduction, and data transformation. Data cleaning involves tasks like handling missing data, identifying outliers, and correcting inconsistent data. Data integration combines data from multiple sources. Data reduction obtains a reduced representation of data through techniques like sampling and dimensionality reduction. Data transformation maps data values to new values through methods such as conversion, discretization, normalization, and attribute construction.

Uploaded by

selsabilrouahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views49 pages

CH1-data Preprocessing

Uploaded by

selsabilrouahi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Chapter 1: Data Preprocessing

Data Mining

Heger Arfaoui - ENIT - 2023

References

Chapter 2: Data Preprocessing

Outline

• Data Preprocessing: An Overview

• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Why Data preprocessing?
Real-world data

• Data in the real world is dirty:

incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

• GIGO! Garbage In Garbage Out

A multi-dimensional measure of data quality

• A well-accepted multi-dimensional view:

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update?

Believability: how trustable is the data?

Interpretability: how easily the data can be understood?

• Two diﬀerent users may have two diﬀerent assessments of the quality of the data
Major Tasks in Data Preprocessing
Data Cleaning
Data Cleaning tasks

• Fill in missing values

• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available

• Missing data may be due to:

• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• intentional

• Missing data may need to be inferred

How to handle missing data?
• Ignore records with missing values in training data especially large
dataset
few missing records
• Replace missing value with...
• default or special value(e.g.,0,“missing”)
• average/median value for numerics
• most frequent value for nominals

• Try to predict missing values:

• handle missing values as learning problem
• target: attribute which has missing values
• training data: instances where the attribute is present
• test data: instances where the attribute is missing
Missing data: caveats
Note: values may be missing for various reasons
...and, more importantly: at random vs. not at random

• Examples for not random:

– Non-mandatory questions in questionnaires
• e.g., “how often do you drink alcohol?”
– Values only valid for certain data sub-populations
• e.g.,“are you currently pregnant?”
– Sensors failing under certain conditions
• e.g.,at high temperatures

• In those cases, averaging and imputation causes information loss – In other words: “missing” can be
information!
Missing data caveats (ctd)
Missing data caveats (ctd)
Noisy data
• Noise: Random error in a measured variable.

• Incorrect attribute values may be due to:

faulty data collection instruments

data entry problems

data transmission problems

technology limitations

inconsistency in naming convention

How to handle noisy data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
•
• detect suspicious values and check by human (e.g., deal with possible outliers)
Binning method for data smooting

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Detect and remove outliers
Data Integration
Data integration

• Data integration:
• Combines data from multiple sources into a coherent store
• Entity identification problem:
• Identify real-world entities from multiple data sources, e.g., Bill Clinton = William
Clinton

• Detecting and resolving data value conflicts

• For the same real-world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
•
Handling redundant data in data integration

• Redundant data occur often when integrating multiple DBs

• The same attribute may have diﬀerent names in diﬀerent databases
• One attribute may be a “derived” attribute in another table, e.g., annual revenue
• Redundant data may be able to be detected by correlational analysis
S( A - A)( B - B)
rA, B =
(n - 1)s As B
• Careful integration can help reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
•
Data Reduction
Data reduction

• A database/data warehouse may store terabytes of data. Complex data

analysis may take a very long time to run on the complete data set.

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
•
Data reduction strategies

• Numerosity reduction (some simply call it: Data Reduction)

• Histograms, clustering
• Sampling

• Dimensionality reduction: e.g., remove unimportant

attributes

• Principal Components Analysis (PCA)

• Feature subset selection, feature creation

• Data compression
Sampling

• Sampling: obtaining a small sample s to represent the whole data set N

• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data

• Key principle: Choose a representative subset of the data

• Simple random sampling may have very poor performance in the
presence of skew

• Develop adaptive sampling methods, e.g., stratified sampling

Types of sampling

• Simple random sampling

• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data
Dimensionality reduction

• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
Feature selection
Basic Heuristics

• Remove nominal attributes...

• which have more than p% identical values
• example: millionaire=false
• which have more than p% diﬀerent values
• example: names, IDs

• Remove numerical attributes

• which have little variation, i.e., standard deviation <s

• Compute pairwise correlations between attributes and remove highly correlated attributes:
• Naive Bayes requires independent attributes. Will benefit from removing correlated attributes
PCA: Principal Component Analysis

• feature selection methods select a subset of attributes: no new attributes are created

• PCA creates a (smaller set of) new attributes

• artificial linear combinations of existing attributes
• as expressive as possible

• Dates back to the pre-computer age

• invented by Karl Pearson (1857-1936)

•
• also known for Pearson's correlation coeﬃcient
PCA (ctd)
• Idea: transform the coordinate system so that each new coordinate (principal component) is as expressive as possible
• expressivity: variance of the variable
• the 1 , 2 , 3 ... PC should account for as much variance as possible
st nd rd

• further PCs can be neglected

•

Source: https://knowledge.dataiku.com/latest/ml-analytics/statistics/concept-principal-component-analysis-pca.html
Data Transformation
Data transformation

• Transformation: A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of the new values

• Methods:
• Conversion
• Discretization
• Smoothing: remove noise from data (binning, clustering, regression)
• Normalization: scaled to fall within a small, specified range
• Attribute/feature construction New attributes constructed from the given ones
:
Conversion

• Binary to numeric:
eg: student=yes,no
Convert to 0, 1

• Order to numeric: Ordered attributes (e.g. grade) can be converted to

numbers preserving order
– A → 4.0
– A-→3.7
– B+→ 3.3
– B → 3.0

• Nominal to numeric
Normalization

• Variables tend to have ranges that vary greatly from each other
• The measurement unit used can aﬀect the data analysis
• For some data mining algorithms, diﬀerences in ranges will lead to a tendency for the
variable with a greater range to have undue influence on the results

• Data miners should normalize their numeric variables in order to standardize the
scale of eﬀect each variable has on the results

• Algorithms that make use of distance measures (such as k-Nearest Neighbors)

benefit from normalization

• Normalization and scaling are used interchangeably

Min-max normalization

Performs a linear transformation on the original data

Min-max normalization preserves the relationships among the

original data values

Values range between 0 and 1

Min-max normalization will encounter an « out-of-bounds » error if

a future input case for normalization falls outside of the original
data range of X
Z-score normalization

Also called zero-mean normalization

Z-score standardization works by taking the diﬀerence

between the field value and the field mean value, and scaling
this diﬀerence by the standard deviation of the field values:

The z-score normalization is useful when the actual minimum

and maximum of an attribute X are unknown, or when there
are outliers that dominate the min-max normalization
Decimal scaling

• Decimal scaling ensure that every normalized values lies between -1 and 1

• d = number of digits in the data value with the largest absolute value
Normalization - remarks

• Normalization can change the original data quite a bit, especially when using
the z-score normalization or decimal scaling

• It is necessary to save the normalization parameters (e.g., the mean and

standard deviation if using z-score normalization) so that future data can be
normalized in a uniform manner

• The normalization parameters now become model parameters and the same
value should be used when the model is used on new data (e.g. testing data)
Transformations to achieve normality

• Some data mining algorithms and statistical methods require that the
variables be normally distributed

• z-score transformation does not achieve normality

Transformations to achieve normality

• The skewness of a distribution is measured by:

• Most real-world data is right-skewed, especially most financial data.

•
Transformations to achieve normality
• Common transformations to achieve normality:
ln(x)

1
sqrt(x)
1/x, …

• log transformation is suitable for strongly right-skewed data, sqrt transformation is

suitable for slightly right-skewed data
Recap
Recap
• Raw data has many problems:
• missing values
• errors
• high dimensionality

• Good preprocessing is essential for good data mining

• one of the first steps in the pipeline
• often the most time-consuming step of the pipeline
• requires lots of experimentation and fine-tuning

• Data preparation includes:

• Data cleaning, data integration, data reduction, feature selection, normalization,…

• A lot a methods have been developed but still an active area of research

11 Plus GL Assessment Maths Question Booklet
100% (1)
11 Plus GL Assessment Maths Question Booklet
10 pages
Spreader Beam: ASME BTH-1: Inputs
100% (1)
Spreader Beam: ASME BTH-1: Inputs
17 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Normalization
No ratings yet
Normalization
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Week 2
No ratings yet
Week 2
96 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Week2 2
No ratings yet
Week2 2
25 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit I
No ratings yet
Unit I
57 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
CH 2
No ratings yet
CH 2
36 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
18 - List of Tables, Maps and Figures
No ratings yet
18 - List of Tables, Maps and Figures
9 pages
Easycollect 6000 FP
No ratings yet
Easycollect 6000 FP
124 pages
Red Scarf Girl Jiang Jili download
No ratings yet
Red Scarf Girl Jiang Jili download
26 pages
Genral Awareness Mega Capsule 2014 PDF
No ratings yet
Genral Awareness Mega Capsule 2014 PDF
251 pages
1
No ratings yet
1
30 pages
Answer Key - WS-1
100% (1)
Answer Key - WS-1
6 pages
Region Xi
No ratings yet
Region Xi
1 page
STN-190-0010 - Structural Design Specification
No ratings yet
STN-190-0010 - Structural Design Specification
21 pages
Simulado de Inglês 8º Ano.
No ratings yet
Simulado de Inglês 8º Ano.
2 pages
Julián at The Wedding by Jessica Love Activity Kit
No ratings yet
Julián at The Wedding by Jessica Love Activity Kit
5 pages
Sappress Plant Maintenance With Sap
29% (7)
Sappress Plant Maintenance With Sap
27 pages
QSAFE Cosh Exam
100% (8)
QSAFE Cosh Exam
2 pages
Percutaneous Needle Biopsy of The Spine: Acta Radiologica November 2007
No ratings yet
Percutaneous Needle Biopsy of The Spine: Acta Radiologica November 2007
11 pages
ELECTROMEC Ventilare Tehnica CASALS 2005
No ratings yet
ELECTROMEC Ventilare Tehnica CASALS 2005
35 pages
MODULE Class VIII C06
No ratings yet
MODULE Class VIII C06
26 pages
Fly 1
No ratings yet
Fly 1
12 pages
Freewe Planted A Tree Ebook
No ratings yet
Freewe Planted A Tree Ebook
6 pages
Food Production Major Assignment
No ratings yet
Food Production Major Assignment
20 pages
Gravitation Insp Champs 2024
No ratings yet
Gravitation Insp Champs 2024
15 pages
NCM109 Topic Outline2025
No ratings yet
NCM109 Topic Outline2025
4 pages
Module 2 Pipeline Maintenance PDF
100% (1)
Module 2 Pipeline Maintenance PDF
89 pages
Choose Your Own Ever After 1 - How To Get To Rio - Julie Fison
No ratings yet
Choose Your Own Ever After 1 - How To Get To Rio - Julie Fison
165 pages
Feature of Shipbuilding Industry
100% (1)
Feature of Shipbuilding Industry
38 pages
CMTD42M
No ratings yet
CMTD42M
6 pages
Computer-Aided Design (CAD) Is The
No ratings yet
Computer-Aided Design (CAD) Is The
3 pages
Calculation of Renal Tubular Reabsorption of Phosphate: The Algorithm Performs Better Than The Nomogram
No ratings yet
Calculation of Renal Tubular Reabsorption of Phosphate: The Algorithm Performs Better Than The Nomogram
3 pages
Determination of Caffeine
No ratings yet
Determination of Caffeine
3 pages
Topic 5 Stadia Tacheometric
No ratings yet
Topic 5 Stadia Tacheometric
12 pages

CH1-data Preprocessing

Uploaded by

CH1-data Preprocessing

Uploaded by

Chapter 1: Data Preprocessing

Heger Arfaoui - ENIT - 2023

Chapter 2: Data Preprocessing

• Data Preprocessing: An Overview

• Data in the real world is dirty:

incomplete: lacking attribute values, lacking certain attributes of interest, or

noisy: containing errors or outliers

inconsistent: containing discrepancies in codes or names

• GIGO! Garbage In Garbage Out

• A well-accepted multi-dimensional view:

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update?

Believability: how trustable is the data?

Interpretability: how easily the data can be understood?

• Fill in missing values

• Missing data may be due to:

• Missing data may need to be inferred

• Try to predict missing values:

• Examples for not random:

• Incorrect attribute values may be due to:

data entry problems

data transmission problems

inconsistency in naming convention

• Detecting and resolving data value conflicts

• Redundant data occur often when integrating multiple DBs

• A database/data warehouse may store terabytes of data. Complex data

• Numerosity reduction (some simply call it: Data Reduction)

• Dimensionality reduction: e.g., remove unimportant

• Principal Components Analysis (PCA)

• Sampling: obtaining a small sample s to represent the whole data set N

• Key principle: Choose a representative subset of the data

• Develop adaptive sampling methods, e.g., stratified sampling

• Simple random sampling

• Remove nominal attributes...

• Remove numerical attributes

• PCA creates a (smaller set of) new attributes

• Dates back to the pre-computer age

• further PCs can be neglected

• Order to numeric: Ordered attributes (e.g. grade) can be converted to

• Algorithms that make use of distance measures (such as k-Nearest Neighbors)

• Normalization and scaling are used interchangeably

Performs a linear transformation on the original data

Min-max normalization preserves the relationships among the

Values range between 0 and 1

Min-max normalization will encounter an « out-of-bounds » error if

Also called zero-mean normalization

Z-score standardization works by taking the diﬀerence

The z-score normalization is useful when the actual minimum

• It is necessary to save the normalization parameters (e.g., the mean and

• z-score transformation does not achieve normality

• The skewness of a distribution is measured by:

• Most real-world data is right-skewed, especially most financial data.

• log transformation is suitable for strongly right-skewed data, sqrt transformation is

• Good preprocessing is essential for good data mining

• Data preparation includes:

You might also like