0% found this document useful (0 votes)

85 views

Chapter 2 Data Preprocessing

1. Data preprocessing is important for machine learning as algorithms require high quality data to produce quality results. 2. Common issues with real-world data include incompleteness, noise, and inconsistencies which can be addressed through data cleaning routines. 3. Descriptive data summarization through measures of central tendency and dispersion help understand the overall picture of the data at a high level before preprocessing. 4. Key tasks in data preprocessing include data cleaning, integration, transformation, reduction, and discretization to prepare the data for machine learning algorithms.

Uploaded by

liyu agye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Chapter 2 Data Preprocessing

Uploaded by

liyu agye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter-2

Data Preprocessing

1
https://machinelearningmastery.com 2
Why Data Preprocessing?

• No quality data, no quality results!

• Quality decisions must be based on quality data

• ML algorithms required data at high quality

3
3
Measures of Data Quality:Why Data Preprocessing?

• Accuracy: How well does a piece of information reflect reality? [correct/wrong]

• Completeness: Does it fulfill your expectations of what’s comprehensive? [recored/not]

• Consistency: Does information stored in one place match relevant data stored elsewhere?

• Timeliness: Is your information available when you need it?

• Validity: Is information in a specific format, does it follow business rules?

• Uniqueness: Is this the only instance in which this information appears in the dataset?

4
4
Why Data Preprocessing?
• Data in the real world is full of dirty:
– incomplete: lacking attribute values
– noisy: containing errors or outliers that deviate from the expected
– inconsistent: lack of compatibility (e.g Some attributes representing a
given concept may have different names in different databases)
• To minimize such problems, employ data cleaning routines.
• Before starting data preprocessing, it will be adviceable to have overall
picture of the data at high level summary such as
– General property of the data
– Which data values should be considered as noise or outliers
• This can be done with the help of descriptive data summarization
5
5
Descriptive data summarization
• Descriptive summary about data can be generated with the help of measure of central
tendency of the data and dispersion of the data
• Measure of central tendency [computing a typical score on the variable] and it includes
– Mean
– Median
– Mode
– Mid-Range
• Measure of dispersion[computing the degree to which data is distributed around this
central tendency] includes
– range
– Standard deviation 6
6
Graphic display of basic descriptive summaries

• Graphical data presentations tools in statistics for the display of

data summaries and distributions
– bar chart,
– pie chart,
– line graph
– Histograms
– Quantile plot
– Scatter plot and
– Loess curves, etc
7
7
Major Tasks in Data Preprocessing
• Any activity performed prior to feed to the Learning algorithm is called pre-processing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files (heterogeneous data sources)
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical results. Very
important for Big Data Analysis
• Data discretization
– Data discritization refers to transforming the data set which is usually continuous into discrete interval
values.
8
8
Forms of Data Preprocessing

9
9
How to Handle Missing Data
• Ignore the tuple: usually done when class label is missing
• Fill in the missing value manually: tedious and infeasible
• Use a global constant to fill in the missing value: E.g., “unknown”, a new
class?! Simple but not recommended as this constant may form some interesting
pattern and mislead decision process
• Use the attribute mean: for all samples belonging to the same class to fill in the
missing value with the mean value of attributes
• Use the most probable value: fill in the missing values by predicting its value
from correlation of the available values
• Except the first two approach, the rest filled values are incorrect and the last two10
10
are common.
Dataset preparation for Classification

• Proper procedure in some classification system development involves three sets of data :

• Generally, the larger the training data the better the classifier

11
11
Unbalanced data
• Sometimes, classes have very unequal frequency
– medical diagnosis: 90% healthy, 10% disease
– eCommerce: 99% don’t buy, 1% buy
• Majority class classifier can be 97% correct, but useless
• If we have two classes that are very unbalanced, then it will be a
bias to evaluate our classifier method
• With two or more classes, a good approach to make a balance
between the class instances is to build BALANCED train and test
sets.
12
12
Balancing unbalanced data
• With two or more classes, a good approach to make a balance
between the class instances is to build BALANCED train and
test sets
• Approach
– randomly select desired number of minority class instances
– add equal number of randomly selected majority class
– Stratified sample: advanced version of balancing the data
• Make sure that each class is represented with approximately equal proportions
in both subsets

13
Building Classification Model

Results Known
+ Model
+ Training set
-
-
Builder
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N
+
Validation set -
+
- Final Evaluation
+
Final
14
Test Set Final Model -
Building Classification Model: Parameter tuning

• Some learning schemes operate in two stages:

– Stage 1: builds the basic structure
– Stage 2: optimizes parameter settings
• Optimizing the parameter setting refers to adjusting important parameters to
maximize the performance of the system

• The test data can’t be used for parameter tuning!

15
Tips: Dataset size
• Before we start building Classification model, we should check how good is
the size of the dataset we have
• Given balanced dataset, the next most important aspect of goodness is size of
the data set
• The model should be able to converge during learning the parameters from the
dataset
• If not, appropriate measure should be taken and care must be given while
reporting performance
• We will see learning curve analysis that best suit to detect goodness of the size
of the training dataset

16
Tips: Dataset Size
What to do with small data?
• Having small data but balanced can be approached in different ways to
relay on the performance
• Note that the total data set we have will be divided into three for training,
testing and validation
• The following are the techniques to minimize the effect of the dataset size

1. k-fold cross validation: randomly dividing the set of observations into k groups,
or folds, of approximately equal size. The first fold is treated as a test set, and the
method is fit on the remaining k − 1 folds.
2. Data augmentation: techniques used to increase the amount of data by adding
slightly modified copies of already existing data

17
Tips:Dataset Size
What to do with small data: Using K-fold cross validation--10-fold is the
recommended
example:
— Break up data into groups of the same size
—
—

— Hold aside one group for testing and use the rest to build model

— Test
— Repeat

18 18
Feature Selection
• Why we need Feature Selection (FS)?
– to improve performance (in terms of speed, predictive power,
simplicity of the model).
– to visualize the data for model selection.
– To reduce dimensionality and remove noise.

• Feature Selection is a process that chooses an optimal subset of

features according to a certain criterion.
• Given a set of n features, the goal of feature selection is to
select a subset of k features (k < n) in order to minimize the
classification error.
Feature Selection
• FS can be considered as a search problem.
• Search Directions (the two common):
– Sequential Forward selection(SFS): In SFS variant features are sequentially added to
an empty set of features until the addition of extra features does not reduce the criterion.
– Mathematically if the input data in the algorithm is

– Then the output will be :

– Where the selected features are k and K<d.

– In the initialization X is a null set and k=0 (where k is the size of the subset).

– In the termination, the size is k = p where p is the number of desired features.

Feature Selection
• Search Directions (the two common):
– Sequential Backward Selection(SBS): SBS picks all the features from the input data
and combines them in a set and sequentially removes them from the set until the removal
of further features increases the criterion.

– mathematically if the input data is

– The output of the variant will be

– In the initialization X is a subset of features and k=d (where k is the size of the subset).

– In the termination, the size is k = p where p is the number of desired features.

Feature Selection
• Do you think that feature selection is different from
dimensionality reduction?

• Feature Selection:
– When classifying novel patterns, only a small number of features need to be computed (i.e.,faster
classification).
– The measurement units (length, weight, etc.) of the features are preserved.

• Dimensionality Reduction:
– When classifying novel patterns, all features need to be computed.
– The measurement units (length, weight, etc.) of the features are lost.
test accuracy

number of training examples

Final - '22
No ratings yet
Final - '22
2 pages
Chapter 01 - Introducing Computer Systems
No ratings yet
Chapter 01 - Introducing Computer Systems
21 pages
Lab 15 - Remote Method Invocation
No ratings yet
Lab 15 - Remote Method Invocation
12 pages
Algorithms and Complexity Lect 1
No ratings yet
Algorithms and Complexity Lect 1
10 pages
Asra College of Engineering & Technology: Faculty/Course Details
No ratings yet
Asra College of Engineering & Technology: Faculty/Course Details
10 pages
Python Lab
No ratings yet
Python Lab
87 pages
Updated Chapter One - Introduction
No ratings yet
Updated Chapter One - Introduction
24 pages
Introduction To Computing
No ratings yet
Introduction To Computing
17 pages
94047595747
No ratings yet
94047595747
3 pages
Heap Sort
No ratings yet
Heap Sort
29 pages
Syllabus - CS112-Program-Logic-Formulation
No ratings yet
Syllabus - CS112-Program-Logic-Formulation
9 pages
BSCS Final Project Proposal
No ratings yet
BSCS Final Project Proposal
87 pages
Syllabus Algorithms and Complexity AY 2021-2022
100% (1)
Syllabus Algorithms and Complexity AY 2021-2022
8 pages
Object Oriented SAD-3 Requirement Elicitation
No ratings yet
Object Oriented SAD-3 Requirement Elicitation
50 pages
2.basics & Algorithm-Flowchart PDF
No ratings yet
2.basics & Algorithm-Flowchart PDF
51 pages
Orange Visual Programming PDF
No ratings yet
Orange Visual Programming PDF
268 pages
Introduction To Database
No ratings yet
Introduction To Database
12 pages
angular ppt
No ratings yet
angular ppt
39 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
It Security
No ratings yet
It Security
33 pages
CSC 409 Algorithms and Complexity Analysis
No ratings yet
CSC 409 Algorithms and Complexity Analysis
223 pages
Lect1 Intro To Java
No ratings yet
Lect1 Intro To Java
58 pages
Chapter 1 Algorithm
No ratings yet
Chapter 1 Algorithm
46 pages
System Analysis and Design Important Topics
No ratings yet
System Analysis and Design Important Topics
8 pages
Systems Planning and Selection
100% (1)
Systems Planning and Selection
11 pages
Integrative Programming and Technologies (Itec4121)
No ratings yet
Integrative Programming and Technologies (Itec4121)
17 pages
Chapter - 4 - Association Rule Mining
No ratings yet
Chapter - 4 - Association Rule Mining
86 pages
Chapter 1-Introduction To Programming Language
100% (1)
Chapter 1-Introduction To Programming Language
24 pages
Logic Formulation and Introduction To Programming
No ratings yet
Logic Formulation and Introduction To Programming
18 pages
Fundamentals of Programming I
No ratings yet
Fundamentals of Programming I
20 pages
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
No ratings yet
Chimdesa Gedefa Assignment #2 Causal and Entry Consistency
15 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
Data Structures and Algorithms: Asymptotic Notations
No ratings yet
Data Structures and Algorithms: Asymptotic Notations
28 pages
INTRODUCTION TO COMPUTING Part1
No ratings yet
INTRODUCTION TO COMPUTING Part1
9 pages
Java PPT Ib
No ratings yet
Java PPT Ib
97 pages
A Robust Machine Learning Predictive Model For Maternal Health Risk
No ratings yet
A Robust Machine Learning Predictive Model For Maternal Health Risk
6 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
No ratings yet
Chapter Two - Text Operations and Automatic Indexing: 2.1. Text Acquisition Via Crawler
19 pages
Data Flow Diagrams-Hrmo Proposed
0% (1)
Data Flow Diagrams-Hrmo Proposed
8 pages
CSC101 Computer Science and Programming Ezeani Majesty Ignatius
No ratings yet
CSC101 Computer Science and Programming Ezeani Majesty Ignatius
114 pages
System Analysis and Artificial Intelligence: Michael Zgurovsky Nataliya Pankratova
No ratings yet
System Analysis and Artificial Intelligence: Michael Zgurovsky Nataliya Pankratova
468 pages
C++ Exam Past Question and Answer
No ratings yet
C++ Exam Past Question and Answer
8 pages
Gadissa Hailu
No ratings yet
Gadissa Hailu
77 pages
New Capstone Project Manuscript Outline Project Proposal
No ratings yet
New Capstone Project Manuscript Outline Project Proposal
2 pages
01 ELMS Activity 1 (In Computer Engineering As A Discipline)
No ratings yet
01 ELMS Activity 1 (In Computer Engineering As A Discipline)
1 page
Algorithms and Complexity
100% (2)
Algorithms and Complexity
140 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
NN-7
No ratings yet
NN-7
26 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Unit 3
No ratings yet
Unit 3
55 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
1635838720082
No ratings yet
1635838720082
35 pages
Lecture5
No ratings yet
Lecture5
26 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Lec 8
No ratings yet
Lec 8
35 pages
02 - Diagnostics For Machine Learning Model
No ratings yet
02 - Diagnostics For Machine Learning Model
20 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Parametric and Non-Parametric Tests
100% (1)
Parametric and Non-Parametric Tests
2 pages
Machinistas meet randomistas: useful ML tools for empirical researchers Esther Duflo
No ratings yet
Machinistas meet randomistas: useful ML tools for empirical researchers Esther Duflo
71 pages
Complete Download (Ebook) Statistics for the Life Sciences by Myra L. Samuels, Jeffrey A. Witmer, Andrew A. Schaffner ISBN 9780321989581, 0321989589 PDF All Chapters
100% (7)
Complete Download (Ebook) Statistics for the Life Sciences by Myra L. Samuels, Jeffrey A. Witmer, Andrew A. Schaffner ISBN 9780321989581, 0321989589 PDF All Chapters
65 pages
Instant Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
100% (3)
Instant Download An Introduction to Statistical Learning: with Applications in Python Gareth James PDF All Chapters
50 pages
Stats 250 W12 Exam 1 Solutions
No ratings yet
Stats 250 W12 Exam 1 Solutions
7 pages
Caie As Level Mathematics 9709 Statistics 1 65db2c961141875c6285f612 132
No ratings yet
Caie As Level Mathematics 9709 Statistics 1 65db2c961141875c6285f612 132
12 pages
Omega: %OMEGA (Data Filename, Items List
No ratings yet
Omega: %OMEGA (Data Filename, Items List
6 pages
Final Quiz 1 - Attempt Reviewresearch
No ratings yet
Final Quiz 1 - Attempt Reviewresearch
7 pages
HR Analytics-: Data & Analysis Strategies
No ratings yet
HR Analytics-: Data & Analysis Strategies
27 pages
R1 PDF
No ratings yet
R1 PDF
204 pages
Factor Analysis True/False Questions
100% (1)
Factor Analysis True/False Questions
3 pages
Bear Handout Hypothesis Testing
No ratings yet
Bear Handout Hypothesis Testing
12 pages
Arima Time Series Stock Prediction
No ratings yet
Arima Time Series Stock Prediction
23 pages
FDS SYLLABUS AIDS
No ratings yet
FDS SYLLABUS AIDS
2 pages
School of Management Semester 1, Academic Session 2020/2021 Atw 123 Business Statistics Tutorial Chapter 5 Group 4 Pocky
No ratings yet
School of Management Semester 1, Academic Session 2020/2021 Atw 123 Business Statistics Tutorial Chapter 5 Group 4 Pocky
8 pages
CHAPTER 4 With ANOVA
No ratings yet
CHAPTER 4 With ANOVA
11 pages
Topic: Measures of Relative Position: I. Z-Score
No ratings yet
Topic: Measures of Relative Position: I. Z-Score
7 pages
Assignment Due August 10 2019
No ratings yet
Assignment Due August 10 2019
20 pages
chapter 5 (1)
No ratings yet
chapter 5 (1)
18 pages
Midterm Formula Statistics Department Faculty of Science Chiang Mai University 1. or 5. 6. 7
No ratings yet
Midterm Formula Statistics Department Faculty of Science Chiang Mai University 1. or 5. 6. 7
2 pages
Sample Final Exam 1
No ratings yet
Sample Final Exam 1
13 pages
Instant ebooks textbook An Introduction to Model Based Survey Sampling with Applications 1st Edition Ray Chambers download all chapters
100% (3)
Instant ebooks textbook An Introduction to Model Based Survey Sampling with Applications 1st Edition Ray Chambers download all chapters
67 pages
Module 16
No ratings yet
Module 16
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Distribucion Log Normal
No ratings yet
Distribucion Log Normal
52 pages
Bio Statistics Test Unit2
No ratings yet
Bio Statistics Test Unit2
4 pages
T Test
No ratings yet
T Test
20 pages
Skewness/Kurtosis
100% (2)
Skewness/Kurtosis
15 pages
Ebooks File Statistics Robert S. Witte All Chapters
100% (3)
Ebooks File Statistics Robert S. Witte All Chapters
39 pages