0% found this document useful (0 votes)
33 views32 pages

Data Preprocessing and Feature Engineering

Chapter 5 discusses data preprocessing and feature engineering as essential steps in the machine learning pipeline, emphasizing the need to clean and prepare raw data for effective model training. It covers techniques for data cleaning, feature encoding, scaling, transformation, and dimensionality reduction, highlighting their importance in improving model performance and accuracy. The chapter also outlines various methods for feature selection and extraction to enhance model interpretability and reduce overfitting.

Uploaded by

Mariamawit Nejib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views32 pages

Data Preprocessing and Feature Engineering

Chapter 5 discusses data preprocessing and feature engineering as essential steps in the machine learning pipeline, emphasizing the need to clean and prepare raw data for effective model training. It covers techniques for data cleaning, feature encoding, scaling, transformation, and dimensionality reduction, highlighting their importance in improving model performance and accuracy. The chapter also outlines various methods for feature selection and extraction to enhance model interpretability and reduce overfitting.

Uploaded by

Mariamawit Nejib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Chapter-5

Data Preprocessing and


Feature Engineering
Overview of Data Preprocessing and Feature Engineering

❖Data preprocessing and feature engineering are fundamental steps in the


machine learning pipeline.
Real-world data is often messy, incomplete, inconsistent and not directly
suitable for training models.

These processes aim to transform raw data into a clean, usable, and
informative set of features that can improve the performance and
accuracy of machine learning algorithms.
Data Preprocessing
❖Raw data collected from the real world is often messy, incomplete, inconsistent, and not
in the right format or scale for machine learning algorithms.

Why is this crucial?


Garbage In, Garbage Out (GIGO): Models trained on poor-quality data will produce poor
results.

Algorithm Requirements: Many algorithms have specific assumptions about data (e.g.,
normality, scale).

Performance: Clean, well-structured data leads to faster training and better model
accuracy/generalization.
Data preprocessing

❖Crucial step in preparing the dataset for machine learning


modeling.
❖It ensure that the data is in a suitable format for the
machine learning algorithm to learn from.
❖It involves :
Data cleaning,
Label encoding
Feature scaling and transformation.
1. Data Cleaning

❖Identifying and rectifying errors, inconsistencies, inaccuracies, and


imperfections in a dataset. Its primary goal is to improve the quality of the
data, making it reliable for analysis and modeling.
❖Common data problems
Missing value
Outlier
Inconsistent
duplicate
Common Issues and Handling Techniques:

1. Missing values: Values are absent for some observations/features.


❖Handling missing value:
a) Deletion/dropping; Remove rows/columns with excessive missing values.
Row Removal: Remove entire rows with missing values. Use if missing
data is minimal (<5-10%) and random. Risk: Loss of potentially valuable
information.
Column Deletion: Remove the entire feature if it has too many missing
values (>50-70%) or is irrelevant. Risk: Loss of a potentially useful feature.
Cont’d
b) Imputation (Filling): Replace missing values with estimated ones(stastical
measures).

▪ Mean: Replace with the column mean (for numerical data, sensitive to outliers).

▪ Median: Replace with the column median (for numerical data, robust to outliers).

▪ Mode: Replace with the column mode (for categorical data, most frequent value).

▪ Model-Based Imputation: Use algorithms like KNN or regression to predict the


missing value based on other features. More complex but often more accurate.

▪ Constant Value: Use a specific placeholder values, like "unknown“. 1 or -1. (use
with caution, can introduce bias).
Cont’d
2. Outliers: Data points significantly different from the rest of the data. Can be due to errors(eg.,
data entry mistakes) or genuine extreme values..

❖Handling Strategies:

Use statistical methods (Z-score, IQR) or visualization (box plots, Scatter plots) to
identify/detect outliers. Decide whether to remove them, or transform (log, cap values).

Binning: Group numerical values into intervals ("bins") to smooth out noise (e.g., age
groups instead of exact age).

Regression: Fit data to a regression function; use the predicted values.

Clustering: Group data points; identify points falling outside clusters as potential outliers.
Cont’d
3. Inconsistent Data: Discrepancies/Variations in data representation(e.g.., codes, names,
formats, or units.).

❖Examples: "New York" vs "NY", "Male" vs "M", different date formats, temperature in
Celsius and Fahrenheit in the same column.

❖Handling Strategies:

Standardize formats (e.g., dates: YYYY-MM-DD).

Fix typos (e.g., "USA" vs. "U.S.A"), fix casing issue

Data Type Conversion: Convert data to the appropriate format (e.g., string to numeric).

Unit Conversion: Convert all values to a consistent unit of measurement.


Cont’d

4. Remove Duplicate Records:


Duplicates are identical rows in the dataset that can skew analysis and
model performance.

Use .duplicated() and .drop_duplicates() in pandas


Data Preprocessing: Feature Encoding

❖Feature encoding is a technique used to convert categorical


variables(nominal or ordinal data) into numerical values.

❖Machine learning algorithms typically work with numerical data,


so categorical variables need to be converted into a numerical format.
Encoding techniques:
❖One-Hot Encoding: Creates binary columns for each category.
E.g., Color: [Red, Blue] → [1, 0], [0, 1]

❖Ordinal Encoding: Assigns integers based on rank/order.


e.g. Education: [High School, Bachelor’s, Master’s] → [0, 1, 2]

❖Label encoding: converts categorical data (text labels) into numerical


values by assigning a unique integer to each category.
E.g., Red = 0, Green = 1, Blue = 2)

❖Binary Encoding: Converts categories into binary digits and splits them into
columns. E.g. 1st Category → 000, 2nd Category → 001
Data preprocessing: Feature Scaling and Transformation

❖Feature scaling and transformation techniques are applied to numerical


features to standardize their range or change their distribution. This is
crucial for many machine learning algorithms that are sensitive to the scale
of input features.
Why are Feature Scaling and Transformation
Important?
❖Algorithm Performance: Many algorithms, particularly those based on distance
calculations (like K-Nearest Neighbors, K-Means) or gradient descent (like Linear
Regression, Logistic Regression, Neural Networks), perform better and converge
faster when features are on a similar scale. Features with larger magnitudes can
dominate the learning process if not scaled.
❖Avoiding Bias: Scaling ensures that no single feature unduly influences the model
due to its large values.
❖Meeting Algorithm Assumptions: Some algorithms assume that features are
normally distributed or have a specific range. Transformations can help meet
these assumptions.
Data Preprocessing: Feature Scaling

❖Many machine learning models (e.g., SVM, k-NN) are sensitive to the scale
of data.
❖Feature scaling ensures that features contribute equally to the model.
❖For example, if one feature has a range of 0-1 and another has a range of 0-
1000, the second feature may dominate the model's learning process.
❖Feature scaling is important for algorithms that rely on distance
calculations (e.g., k-NN, SVM) or gradient descent optimization (e.g., linear
regression).
Cont’d
❖Feature Scaling/Normalization Techniques
(Min-Max Scaling):
• Scales values to a range between 0 and 1.
• Formula: X_scaled=(x−min(x))/(max(x)−min(x))
• This method is sensitive to outliers.
Standardization (Z-Score Scaling):
• Transforms/scales features to have a mean of 0 and a standard deviation of 1. In
other word, Transforms feature to mean = 0, std = 1
• Formula: X_scaled =(x−mean)/std_dev
• This method is less sensitive to outliers than Min-Max scaling and is suitable when the
data follows a Gaussian-like distribution.
Cont’d

Robust Scaling:
o Scales features using the median and the interquartile range (IQR). This
method is robust to outliers as it uses statistics that are less affected by
extreme values.
o The formula is: X scaled = (X−Median)/IQR
Absolute Maximum Scaling:
o Scales features by dividing by the maximum absolute value.
o This scales the features to a range of [-1, 1]. The formula is:
Xscaled=X/max(∣X∣)
Data Preprocessing: Feature transformation

❖ Modifying the distribution of a feature. It can help to reduce skewness of data, normalize
distributions of data, and make the data more interpretable.

❖ It is often used to make the data more suitable for analysis or modeling.

❖ Feature transformation techniques:


Log Transformation: Applies logarithmic function to reduce skewness in data and can help
in making the distribution more symmetric.
Square Root Transformation: Applies square root to reduce skewness in data.
Box-Cox Transformation: A family of power transformations to stabilize variance and
make data more normal-like distribution.
Methods of Dimensionality reduction
❖Dimensionality reduction reduces the number of features (dimensions)
while preserving important information.
❖This can be done by either feature selection (selecting a subset of the
original features) or feature extraction (creating new features as
combinations of the original ones).
❖Combating the Curse of Dimensionality, in high-dimensional spaces, data
becomes sparse, making it difficult for models to find patterns and
increasing the risk of overfitting.
❖Reducing data to 2 or 3 dimensions allows for plotting and visual
exploration of the data's structure.
How dimensionality reduction improves performance?
Mathematical Example
Methods of Dimensionality reduction
Methods of Dimensionality reduction

• Linear
Discriminant
Analysis
Dimensionality reduction – Feature Selection

❖Feature selection is the process of choosing a subset of relevant features


from the original dataset.
Keep only the most relevant features to reduce overfitting and improve
performance.

❖In feature selection, we are interested in finding k of the total of n features


that give us the most information and we discard the other (n-k)
dimensions.
Importance of features selection
❖Identifying Important Features: Feature selection helps in understanding which features
are most influential in predicting the target variable.
❖Reduces Overfitting: Fewer features mean a simpler model, less likely to fit noise.
❖Improves Accuracy: Irrelevant or redundant features can confuse models.
❖Reduces Training Time: Fewer dimensions mean faster computations.
❖Enhances Interpretability: Models with fewer features are often easier to understand &
interpret.
❖Curse of Dimensionality: As the number of features increases, the amount of data needed to
generalize well grows exponentially.
❖Remove Redundant Features: Features that provide similar information(most correlated)
❖Remove Irrelevant Features: Features that have little to no impact on the target variable.
Feature Selection Techniques:

❖Feature selection methods can be broadly categorized into three types:


Filter Methods:
Wrapper Methods
Embedded Methods
Feature selection: Filter Methods
❖These methods select/evaluate features based on their statistical measure
or their relationship with the target variable, independent of the machine
learning algorithm used. They are computationally less expensive.
Correlation Matrix: Analyzing the correlation between features and the target
variable, as well as the correlation between features themselves (to identify
multicollinearity).
Statistical Tests: Using tests like Chi-squared test (for categorical features),
ANOVA (for numerical features)
Variance Threshold: Remove features with very low variance, as they have
little predictive power.
Feature selection: Wrapper Methods

❖Use a specific machine learning model to evaluate the usefulness of feature subsets.
❖These methods evaluate subsets of features by training and evaluating a machine learning
model on each subset. Treat feature selection as a search problem.
❖ They are computationally more expensive than filter methods but can capture interactions
between features.
Forward Selection: Starting with an empty set of features and iteratively adding the feature
that best improves model performance.
Backward Elimination: Starting with all features and iteratively removing the least significant
feature based on model performance.
Recursive Feature Elimination (RFE): Recursively building a model and removing the least
important features at each iteration.
Feature selection: Embedded Methods

❖These methods perform feature selection as part of the model training process.
The feature selection is built into the algorithm itself.
Lasso Regression (L1 regularization): Adds a penalty to the absolute values of the
coefficients, which can shrink some coefficients to zero, effectively performing
feature selection.
Ridge Regression (L2 regularization): Adds a penalty to the squared values of the
coefficients. While it shrinks coefficients, it doesn't typically force them to zero.
Tree-based models (e.g., Random Forests, Gradient Boosting): These models
inherently provide measures of feature importance, which can be used for selection.
Feature Visualization Techniques for Selection:

❖Feature visualization plays a crucial role in understanding the data and


supporting the feature selection process.
Histograms Plots: Understanding the distribution of individual features,
skewness.
Box Plots: Visualizing the distribution, identifying outliers for numerical
features, compare distributions across categories.
Correlation Matrix Heatmap: Visualize correlations between all pairs of
numerical features. Helps identify highly correlated features (potential
redundancy) and features correlated with the target.
Cont’d
Scatter Plots: Visualize relationships between pairs of numerical features.

Pair Plots (Scatter Plot Matrix): Displaying scatter plots for all pairs of
features, along with histograms of individual features. Useful for exploring
relationships in smaller datasets.

Bar Charts: Visualizing the frequency distribution of categorical features or


the relationship between a categorical feature and a numerical target (e.g.,
mean of the target for each category).
Dimensionality reduction – Feature Extraction
Feature Extraction: PCA
1. Principal Component Analysis (PCA): Unsupervised linear
transformation.

2. Linear Discriminant Analysis (LDA): Supervised linear transformation.


LDA is a supervised dimensionality reduction method that maximizes
class separability.

3. t-Distributed Stochastic Neighbor Embedding (t-SNE): Unsupervised non-


linear technique. t-SNE is a non-linear dimensionality reduction technique
used for data visualization.

You might also like