0% found this document useful (0 votes)

33 views32 pages

Data Preprocessing and Feature Engineering

Chapter 5 discusses data preprocessing and feature engineering as essential steps in the machine learning pipeline, emphasizing the need to clean and prepare raw data for effective model training. It covers techniques for data cleaning, feature encoding, scaling, transformation, and dimensionality reduction, highlighting their importance in improving model performance and accuracy. The chapter also outlines various methods for feature selection and extraction to enhance model interpretability and reduce overfitting.

Uploaded by

Mariamawit Nejib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views32 pages

Data Preprocessing and Feature Engineering

Uploaded by

Mariamawit Nejib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Chapter-5

Data Preprocessing and

Feature Engineering
Overview of Data Preprocessing and Feature Engineering

❖Data preprocessing and feature engineering are fundamental steps in the

machine learning pipeline.
Real-world data is often messy, incomplete, inconsistent and not directly
suitable for training models.

These processes aim to transform raw data into a clean, usable, and
informative set of features that can improve the performance and
accuracy of machine learning algorithms.
Data Preprocessing
❖Raw data collected from the real world is often messy, incomplete, inconsistent, and not
in the right format or scale for machine learning algorithms.

Why is this crucial?

Garbage In, Garbage Out (GIGO): Models trained on poor-quality data will produce poor
results.

Algorithm Requirements: Many algorithms have specific assumptions about data (e.g.,
normality, scale).

Performance: Clean, well-structured data leads to faster training and better model
accuracy/generalization.
Data preprocessing

❖Crucial step in preparing the dataset for machine learning

modeling.
❖It ensure that the data is in a suitable format for the
machine learning algorithm to learn from.
❖It involves :
Data cleaning,
Label encoding
Feature scaling and transformation.
1. Data Cleaning

❖Identifying and rectifying errors, inconsistencies, inaccuracies, and

imperfections in a dataset. Its primary goal is to improve the quality of the
data, making it reliable for analysis and modeling.
❖Common data problems
Missing value
Outlier
Inconsistent
duplicate
Common Issues and Handling Techniques:

1. Missing values: Values are absent for some observations/features.

❖Handling missing value:
a) Deletion/dropping; Remove rows/columns with excessive missing values.
Row Removal: Remove entire rows with missing values. Use if missing
data is minimal (<5-10%) and random. Risk: Loss of potentially valuable
information.
Column Deletion: Remove the entire feature if it has too many missing
values (>50-70%) or is irrelevant. Risk: Loss of a potentially useful feature.
Cont’d
b) Imputation (Filling): Replace missing values with estimated ones(stastical
measures).

▪ Mean: Replace with the column mean (for numerical data, sensitive to outliers).

▪ Median: Replace with the column median (for numerical data, robust to outliers).

▪ Mode: Replace with the column mode (for categorical data, most frequent value).

▪ Model-Based Imputation: Use algorithms like KNN or regression to predict the

missing value based on other features. More complex but often more accurate.

▪ Constant Value: Use a specific placeholder values, like "unknown“. 1 or -1. (use
with caution, can introduce bias).
Cont’d
2. Outliers: Data points significantly different from the rest of the data. Can be due to errors(eg.,
data entry mistakes) or genuine extreme values..

❖Handling Strategies:

Use statistical methods (Z-score, IQR) or visualization (box plots, Scatter plots) to
identify/detect outliers. Decide whether to remove them, or transform (log, cap values).

Binning: Group numerical values into intervals ("bins") to smooth out noise (e.g., age
groups instead of exact age).

Regression: Fit data to a regression function; use the predicted values.

Clustering: Group data points; identify points falling outside clusters as potential outliers.
Cont’d
3. Inconsistent Data: Discrepancies/Variations in data representation(e.g.., codes, names,
formats, or units.).

❖Examples: "New York" vs "NY", "Male" vs "M", different date formats, temperature in
Celsius and Fahrenheit in the same column.

❖Handling Strategies:

Standardize formats (e.g., dates: YYYY-MM-DD).

Fix typos (e.g., "USA" vs. "U.S.A"), fix casing issue

Data Type Conversion: Convert data to the appropriate format (e.g., string to numeric).

Unit Conversion: Convert all values to a consistent unit of measurement.

Cont’d

4. Remove Duplicate Records:

Duplicates are identical rows in the dataset that can skew analysis and
model performance.

Use .duplicated() and .drop_duplicates() in pandas

Data Preprocessing: Feature Encoding

❖Feature encoding is a technique used to convert categorical

variables(nominal or ordinal data) into numerical values.

❖Machine learning algorithms typically work with numerical data,

so categorical variables need to be converted into a numerical format.
Encoding techniques:
❖One-Hot Encoding: Creates binary columns for each category.
E.g., Color: [Red, Blue] → [1, 0], [0, 1]

❖Ordinal Encoding: Assigns integers based on rank/order.

e.g. Education: [High School, Bachelor’s, Master’s] → [0, 1, 2]

❖Label encoding: converts categorical data (text labels) into numerical

values by assigning a unique integer to each category.
E.g., Red = 0, Green = 1, Blue = 2)

❖Binary Encoding: Converts categories into binary digits and splits them into
columns. E.g. 1st Category → 000, 2nd Category → 001
Data preprocessing: Feature Scaling and Transformation

❖Feature scaling and transformation techniques are applied to numerical

features to standardize their range or change their distribution. This is
crucial for many machine learning algorithms that are sensitive to the scale
of input features.
Why are Feature Scaling and Transformation
Important?
❖Algorithm Performance: Many algorithms, particularly those based on distance
calculations (like K-Nearest Neighbors, K-Means) or gradient descent (like Linear
Regression, Logistic Regression, Neural Networks), perform better and converge
faster when features are on a similar scale. Features with larger magnitudes can
dominate the learning process if not scaled.
❖Avoiding Bias: Scaling ensures that no single feature unduly influences the model
due to its large values.
❖Meeting Algorithm Assumptions: Some algorithms assume that features are
normally distributed or have a specific range. Transformations can help meet
these assumptions.
Data Preprocessing: Feature Scaling

❖Many machine learning models (e.g., SVM, k-NN) are sensitive to the scale
of data.
❖Feature scaling ensures that features contribute equally to the model.
❖For example, if one feature has a range of 0-1 and another has a range of 0-
1000, the second feature may dominate the model's learning process.
❖Feature scaling is important for algorithms that rely on distance
calculations (e.g., k-NN, SVM) or gradient descent optimization (e.g., linear
regression).
Cont’d
❖Feature Scaling/Normalization Techniques
(Min-Max Scaling):
• Scales values to a range between 0 and 1.
• Formula: X_scaled=(x−min(x))/(max(x)−min(x))
• This method is sensitive to outliers.
Standardization (Z-Score Scaling):
• Transforms/scales features to have a mean of 0 and a standard deviation of 1. In
other word, Transforms feature to mean = 0, std = 1
• Formula: X_scaled =(x−mean)/std_dev
• This method is less sensitive to outliers than Min-Max scaling and is suitable when the
data follows a Gaussian-like distribution.
Cont’d

Robust Scaling:
o Scales features using the median and the interquartile range (IQR). This
method is robust to outliers as it uses statistics that are less affected by
extreme values.
o The formula is: X scaled = (X−Median)/IQR
Absolute Maximum Scaling:
o Scales features by dividing by the maximum absolute value.
o This scales the features to a range of [-1, 1]. The formula is:
Xscaled=X/max(∣X∣)
Data Preprocessing: Feature transformation

❖ Modifying the distribution of a feature. It can help to reduce skewness of data, normalize
distributions of data, and make the data more interpretable.

❖ It is often used to make the data more suitable for analysis or modeling.

❖ Feature transformation techniques:

Log Transformation: Applies logarithmic function to reduce skewness in data and can help
in making the distribution more symmetric.
Square Root Transformation: Applies square root to reduce skewness in data.
Box-Cox Transformation: A family of power transformations to stabilize variance and
make data more normal-like distribution.
Methods of Dimensionality reduction
❖Dimensionality reduction reduces the number of features (dimensions)
while preserving important information.
❖This can be done by either feature selection (selecting a subset of the
original features) or feature extraction (creating new features as
combinations of the original ones).
❖Combating the Curse of Dimensionality, in high-dimensional spaces, data
becomes sparse, making it difficult for models to find patterns and
increasing the risk of overfitting.
❖Reducing data to 2 or 3 dimensions allows for plotting and visual
exploration of the data's structure.
How dimensionality reduction improves performance?
Mathematical Example
Methods of Dimensionality reduction
Methods of Dimensionality reduction

• Linear
Discriminant
Analysis
Dimensionality reduction – Feature Selection

❖Feature selection is the process of choosing a subset of relevant features

from the original dataset.
Keep only the most relevant features to reduce overfitting and improve
performance.

❖In feature selection, we are interested in finding k of the total of n features

that give us the most information and we discard the other (n-k)
dimensions.
Importance of features selection
❖Identifying Important Features: Feature selection helps in understanding which features
are most influential in predicting the target variable.
❖Reduces Overfitting: Fewer features mean a simpler model, less likely to fit noise.
❖Improves Accuracy: Irrelevant or redundant features can confuse models.
❖Reduces Training Time: Fewer dimensions mean faster computations.
❖Enhances Interpretability: Models with fewer features are often easier to understand &
interpret.
❖Curse of Dimensionality: As the number of features increases, the amount of data needed to
generalize well grows exponentially.
❖Remove Redundant Features: Features that provide similar information(most correlated)
❖Remove Irrelevant Features: Features that have little to no impact on the target variable.
Feature Selection Techniques:

❖Feature selection methods can be broadly categorized into three types:

Filter Methods:
Wrapper Methods
Embedded Methods
Feature selection: Filter Methods
❖These methods select/evaluate features based on their statistical measure
or their relationship with the target variable, independent of the machine
learning algorithm used. They are computationally less expensive.
Correlation Matrix: Analyzing the correlation between features and the target
variable, as well as the correlation between features themselves (to identify
multicollinearity).
Statistical Tests: Using tests like Chi-squared test (for categorical features),
ANOVA (for numerical features)
Variance Threshold: Remove features with very low variance, as they have
little predictive power.
Feature selection: Wrapper Methods

❖Use a specific machine learning model to evaluate the usefulness of feature subsets.
❖These methods evaluate subsets of features by training and evaluating a machine learning
model on each subset. Treat feature selection as a search problem.
❖ They are computationally more expensive than filter methods but can capture interactions
between features.
Forward Selection: Starting with an empty set of features and iteratively adding the feature
that best improves model performance.
Backward Elimination: Starting with all features and iteratively removing the least significant
feature based on model performance.
Recursive Feature Elimination (RFE): Recursively building a model and removing the least
important features at each iteration.
Feature selection: Embedded Methods

❖These methods perform feature selection as part of the model training process.
The feature selection is built into the algorithm itself.
Lasso Regression (L1 regularization): Adds a penalty to the absolute values of the
coefficients, which can shrink some coefficients to zero, effectively performing
feature selection.
Ridge Regression (L2 regularization): Adds a penalty to the squared values of the
coefficients. While it shrinks coefficients, it doesn't typically force them to zero.
Tree-based models (e.g., Random Forests, Gradient Boosting): These models
inherently provide measures of feature importance, which can be used for selection.
Feature Visualization Techniques for Selection:

❖Feature visualization plays a crucial role in understanding the data and

supporting the feature selection process.
Histograms Plots: Understanding the distribution of individual features,
skewness.
Box Plots: Visualizing the distribution, identifying outliers for numerical
features, compare distributions across categories.
Correlation Matrix Heatmap: Visualize correlations between all pairs of
numerical features. Helps identify highly correlated features (potential
redundancy) and features correlated with the target.
Cont’d
Scatter Plots: Visualize relationships between pairs of numerical features.

Pair Plots (Scatter Plot Matrix): Displaying scatter plots for all pairs of
features, along with histograms of individual features. Useful for exploring
relationships in smaller datasets.

Bar Charts: Visualizing the frequency distribution of categorical features or

the relationship between a categorical feature and a numerical target (e.g.,
mean of the target for each category).
Dimensionality reduction – Feature Extraction
Feature Extraction: PCA
1. Principal Component Analysis (PCA): Unsupervised linear
transformation.

2. Linear Discriminant Analysis (LDA): Supervised linear transformation.

LDA is a supervised dimensionality reduction method that maximizes
class separability.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE): Unsupervised non-

linear technique. t-SNE is a non-linear dimensionality reduction technique
used for data visualization.

Richard M. Ryan, Edward L. Deci - Self-Determination Theory
94% (18)
Richard M. Ryan, Edward L. Deci - Self-Determination Theory
769 pages
Theories of Community Policing
70% (10)
Theories of Community Policing
4 pages
Top 100 RS Repeated Questions PDF
100% (1)
Top 100 RS Repeated Questions PDF
6 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
ML Notes
No ratings yet
ML Notes
44 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Unit 1
No ratings yet
Unit 1
8 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Mining
No ratings yet
Data Mining
33 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Week 10
No ratings yet
Week 10
50 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Module 4
No ratings yet
Module 4
96 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Chap 3
No ratings yet
Chap 3
26 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
Normalization
No ratings yet
Normalization
35 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data
No ratings yet
Data
36 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
ML Da
No ratings yet
ML Da
55 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Final TL Policy Gems Firstpoint School-1
No ratings yet
Final TL Policy Gems Firstpoint School-1
24 pages
Testbank For Nelson Science Perspectives 10 1st Edition Instant Download
No ratings yet
Testbank For Nelson Science Perspectives 10 1st Edition Instant Download
17 pages
Accomplishment Report
100% (2)
Accomplishment Report
3 pages
Stakeholder Register Assignment
No ratings yet
Stakeholder Register Assignment
4 pages
Lecture 1 - 2 Research Methods
No ratings yet
Lecture 1 - 2 Research Methods
25 pages
Book Review
No ratings yet
Book Review
2 pages
Tuesdays With Morrie
No ratings yet
Tuesdays With Morrie
14 pages
To Be Edit
No ratings yet
To Be Edit
11 pages
ModuleTTL2 - L1, L2
No ratings yet
ModuleTTL2 - L1, L2
22 pages
Plant Movement
No ratings yet
Plant Movement
5 pages
Ignou Convocation Form
No ratings yet
Ignou Convocation Form
3 pages
Emergence of Psychology
No ratings yet
Emergence of Psychology
161 pages
The Personality Puzzle Seventh Edition
No ratings yet
The Personality Puzzle Seventh Edition
301 pages
China at War Triumph and Tragedy in the Emergence of the New China Havard UP 2018 Hans Van De Ven [Ven download
100% (1)
China at War Triumph and Tragedy in the Emergence of the New China Havard UP 2018 Hans Van De Ven [Ven download
123 pages
Brain Yoga: The Manual
100% (2)
Brain Yoga: The Manual
48 pages
BTS Navigating Strategy Execution The Case For Business Simulations
No ratings yet
BTS Navigating Strategy Execution The Case For Business Simulations
10 pages
The Main Enterprise of the World: Rethinking Education Philip Kitcher download
No ratings yet
The Main Enterprise of the World: Rethinking Education Philip Kitcher download
146 pages
Calawitan National High School: Schools Division of Bulacan
0% (2)
Calawitan National High School: Schools Division of Bulacan
3 pages
Role of Law Library Science in Legal Education in India
No ratings yet
Role of Law Library Science in Legal Education in India
4 pages
HASHMUKH CHANDUBHAI PATEL-newF
No ratings yet
HASHMUKH CHANDUBHAI PATEL-newF
29 pages
Student: World Genius Search Examination - 2017
No ratings yet
Student: World Genius Search Examination - 2017
1 page
Hamdard University: in Pursuit of Excellence Faculty of Law
No ratings yet
Hamdard University: in Pursuit of Excellence Faculty of Law
3 pages
Pre Test Form 3A English
No ratings yet
Pre Test Form 3A English
1 page
Academic Writing
No ratings yet
Academic Writing
18 pages
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
No ratings yet
High-Performance Accurate and Approximate Multipliers For FPGA-Based Hardware Accelerators
14 pages
Information Management - Strategy & Implementation
No ratings yet
Information Management - Strategy & Implementation
23 pages
Vaishnavi Desai: Professional Experience
No ratings yet
Vaishnavi Desai: Professional Experience
3 pages

Data Preprocessing and Feature Engineering

Uploaded by

Data Preprocessing and Feature Engineering

Uploaded by

Chapter-5

Data Preprocessing and

❖Data preprocessing and feature engineering are fundamental steps in the

Why is this crucial?

❖Crucial step in preparing the dataset for machine learning

❖Identifying and rectifying errors, inconsistencies, inaccuracies, and

1. Missing values: Values are absent for some observations/features.

▪ Model-Based Imputation: Use algorithms like KNN or regression to predict the

Regression: Fit data to a regression function; use the predicted values.

Standardize formats (e.g., dates: YYYY-MM-DD).

Fix typos (e.g., "USA" vs. "U.S.A"), fix casing issue

Unit Conversion: Convert all values to a consistent unit of measurement.

4. Remove Duplicate Records:

Use .duplicated() and .drop_duplicates() in pandas

❖Feature encoding is a technique used to convert categorical

❖Machine learning algorithms typically work with numerical data,

❖Ordinal Encoding: Assigns integers based on rank/order.

❖Label encoding: converts categorical data (text labels) into numerical

❖Feature scaling and transformation techniques are applied to numerical

❖ Feature transformation techniques:

❖Feature selection is the process of choosing a subset of relevant features

❖In feature selection, we are interested in finding k of the total of n features

❖Feature selection methods can be broadly categorized into three types:

❖Feature visualization plays a crucial role in understanding the data and

Bar Charts: Visualizing the frequency distribution of categorical features or

2. Linear Discriminant Analysis (LDA): Supervised linear transformation.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE): Unsupervised non-

You might also like