3-Data Preprocessing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Roadmap of ML Journey

Data Pre-Processing/Data Munging/Data Wrangling

• Data preprocessing follows EDA and involves making data modifications based on
the insights gained during EDA.
• It focuses on cleaning, transforming, selecting features and preparing the data to
make it suitable for modelling.
• It aims to ensure data quality, handle missing values, deal with outliers, and convert
data into a format that can be fed into machine learning algorithms.
• The output of data preprocessing is a cleaned and transformed dataset that is ready
for modelling. This often involves creating a structured and processed dataset that is
free from data quality issues.
Data Pre-Processing

• Data preprocessing involves the following tasks:


• Data Cleaning
• Handling Missing Value
• Handling Outliers
• Data Transformation
• Feature Scaling
• Feature Encoding
• Feature Engineering
• Handling Imbalanced Data
• Under Sampling
• Over Sampling
• Data Reduction
• Dimensionality Reduction
• Instance Reduction
• Data Splitting
Data Pre-Processing

• Data Cleaning:
• Handling Missing Data
• Handling Outliers
Data Pre-Processing: Data Cleaning

• Handling Missing Data:


• The presence of missing data can significantly impact the quality and reliability of analyses.
Some common techniques for handling missing values:
• Removal of Missing Values: Remove entire rows with missing values. This is suitable when the missing
values are sporadic and removing a few rows doesn't significantly impact the dataset.
Data Pre-Processing: Data Cleaning

• Handling Missing Data:


• The presence of missing data can significantly impact the quality and reliability of analyses.
Some common techniques for handling missing values:
• Imputation Techniques:
• Mean, Median, or Mode Imputation: Fill missing values with the mean, median, or mode of the non-missing
values in the variable. This method is suitable for numerical data.
• Forward Fill (or Backward Fill): Propagate the last observed value forward (or the next observed value
backward) to fill in missing values in time-series or sequential data.
• K-Nearest Neighbors (KNN) Imputation: Estimate missing values based on the values of their k-nearest
neighbors. This method considers the similarity between data points.
Data Pre-Processing: Data Cleaning

• Handling Missing Data:


• The presence of missing data can significantly impact the quality and reliability of analyses.
Some common techniques for handling missing values:
• Predictive Modelling:
• Use Machine Learning Models: Train a machine learning model on the data with missing values as the target
variable. Predict the missing values using the trained model.
• The choice of method depends on the nature of the data, the extent of missingness, and the
assumptions made about the missing data mechanism. Exploring and understanding the reasons
for missingness is often advisable before choosing an appropriate technique.
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or statistical
methods like the Z-score or the IQR (Interquartile Range) method. Some common techniques to
handle outliers are:
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or statistical
methods like the Z-score or the IQR (Interquartile Range) method. Some common techniques to
handle outliers are:
• Removal of Outlier Values: Remove entire rows with outlier values. This is suitable when the outlier
values are sporadic and removing a few rows doesn't significantly impact the dataset.
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or by using
statistical methods like the Z-score or the IQR (Interquartile Range) method. Some common
techniques to handle outliers are:
• Transformation Techniques: Transformations can help make the data distribution more symmetric and
reduce the impact of outliers. Common transformations include taking the data's logarithm, square
root, or cube root.
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or by using
statistical methods like the Z-score or the IQR (Interquartile Range) method. Some common
techniques to handle outliers are:
• Clipping or Capping: Set a threshold beyond which values are replaced with the threshold value. This
helps to limit the impact of extreme values without removing them entirely.
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or by using
statistical methods like the Z-score or the IQR (Interquartile Range) method. Some common
techniques to handle outliers are:
• Imputation: Replace outliers with the mean or median of the variable. Use machine learning models
to predict and replace outlier values.
Data Pre-Processing: Data Cleaning

• Handling Outliers :
• Outliers are data points that deviate significantly from the rest of the dataset. This can be
identified through visual inspection using box plots, scatter plots, or histograms or by using
statistical methods like the Z-score or the IQR (Interquartile Range) method. Some common
techniques to handle outliers are:
• Ensemble Methods: Use ensemble techniques that combine multiple models to mitigate the impact of
outliers
Data Pre-Processing

• Data Transformation:
• Feature Scaling: Scale the numerical features to a common scale to avoid issues caused by
different ranges or units. Common scaling methods include:
• Standardization / Z-Score Normalization: X_new = (X - mean)/Std. It is not bound to a certain range.
• Normalization / min-max scaling: X_new = (X - X_min)/(X_max - X_min). This scales the range to [0, 1]
Data Pre-Processing

• Data Transformation:
• Feature Encoding: Convert categorical features into numerical representations that machine
learning algorithms can understand.
• One-hot encoding: Convert categorical variables into binary vectors, creating a binary column for each
category.
Data Pre-Processing

• Data Transformation:
• Feature Encoding: Convert categorical features into numerical representations that machine
learning algorithms can understand.
• Label encoding: Label Encoding is appropriate when encoding target variables, especially for
categorical variables with no inherent order. E.g., Red:1, Blue:2, Yellow:3
• Ordinal encoding: Ordinal Encoding is suitable when categorical variables have an inherent order or
ranking. In ordinal encoding, labels are translated to numbers based on their ordinal relationship to
one another. E.g., Small:1, Medium:2, Large:3
Data Pre-Processing

• Data Transformation:
• Feature Engineering: Create new features or derive meaningful information from existing
features to enhance the predictive power of the dataset.
Data Pre-Processing

• Handling Imbalanced Data:


• Address class imbalance issues, where one class is significantly underrepresented compared to
others.
• Under Sampling: Under-sampling is a technique used to balance the class distribution in a dataset by
reducing the number of instances in the majority class.
• Random Under sampling: Randomly remove instances from the majority class until a more balanced
distribution is achieved.
• Cluster-Based Under-sampling: Identify clusters in the majority class and retain only the centroids of these
clusters.
• Tomek Links: Identify pairs of instances (one from the majority class and one from the minority class) that are
nearest neighbors but of different classes. Remove the majority class instance of each pair.
• Edited Nearest Neighbors: Identify instances in the majority class that are misclassified by their nearest
neighbors (based on a specified metric) and remove them.
Data Pre-Processing

• Handling Imbalanced Data:


• Address class imbalance issues, where one class is significantly underrepresented compared to
others.
• Over Sampling: Oversampling is a technique used to balance the class distribution in a dataset by
increasing the number of instances in the minority class.
• Random Oversampling: Randomly duplicate instances from the minority class until a more balanced
distribution is achieved.
• SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic instances for the minority class by
interpolating between existing instances.
• ADASYN (Adaptive Synthetic Sampling): Similar to SMOTE but introduces adaptive weights to generate more
synthetic examples for instances that are harder to learn.
• Borderline-SMOTE: Applies SMOTE selectively to instances near the decision boundary (borderline instances).
Data Pre-Processing

• Data Reduction:
• Data reduction in data preprocessing techniques refers to the process of reducing the volume
but producing the same or similar analytical results.
• The goal is to simplify the dataset while retaining its essential characteristics, making it more
manageable, efficient, and suitable for analysis or modeling.
• Data reduction techniques are particularly useful when dealing with large datasets or datasets
with many features.
• . There are two main types of data reduction:
• Dimensionality Reduction: Reduce the number of input features while preserving important
information.
• Instance Reduction: Reduce the number of rows without losing information.
Data Pre-Processing

• Data Reduction:
• Dimensionality Reduction: Reduce the number of input features while preserving important
information.
• Principal Component Analysis (PCA): PCA is a technique that transforms the original features into a
new set of uncorrelated features called principal components. It allows for reducing the dataset's
dimensionality while retaining as much variance as possible.
• t-distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a nonlinear dimensionality reduction
technique that is particularly effective for visualizing high-dimensional data in lower-dimensional
spaces.
• Linear Discriminant Analysis (LDA): LDA is a supervised dimensionality reduction technique that aims
to maximize the separation between different classes in the dataset.
Data Pre-Processing

• Data Reduction:
• Instance Reduction: Reduce the number of rows without losing information.
• Sampling: Random or systematic sampling can select a subset of instances from the dataset. This can
help reduce the dataset size while maintaining its statistical properties.
• Clustering: Clustering techniques, such as k-means clustering, can be applied to group similar
instances together. Instead of using all instances, representative centroids or cluster prototypes can be
used, reducing the number of instances while preserving the overall structure.
• Data Binning or Discretization: Continuous data can be converted into discrete intervals, reducing the
number of unique values and simplifying the dataset.
• Outlier Removal: Instances that are considered outliers can be removed from the dataset. Outliers
may introduce noise and distort the analysis.
Data Pre-Processing

• Data Splitting:
• Splitting into Training and Test Sets
• Divide the dataset into training and testing subsets.
• The training set trains the machine learning model, while the testing set evaluates its performance on
unseen data.
• Common splits are 70-30, 80-20, or 90-10, depending on the dataset size.
Data Pre-Processing

• Data Splitting:
• Cross-Validation: As data points for training and testing sets change randomly, this will result in
fluctuating accuracy. To prevent this, the concept of cross-validation is used. Common cross-
validation techniques include:
• K-fold cross-validation
• Stratified cross-validation
Data Pre-Processing

• Data Splitting:
• K-fold cross-validation: The most common form of cross-validation is k-fold cross-validation,
where the dataset is divided into k folds, and the model is trained and tested k times.
• E.g., Let us opt for 6-fold cross-validation.
• Then we divide the dataset into 6 parts {a, b, c, d, e, f} and repeat the process 6 times:
• 1st pass: Train: {a, b, c, d, e} Test:{f}
• 2nd pass: Train: {a, b, c, d, f} Test:{e}
• 3rd pass: Train: {a, b, c, f, e} Test:{d}
• 4th pass: Train: {a, b, f, d, e} Test:{c}
• 5th pass: Train: {a, f, c, d, e} Test:{b}
• 6th pass: Train: {f, b, c, d, e} Test:{a}
Data Pre-Processing

• Data Splitting:
• Stratified cross-validation:
• In standard k-fold cross-validation, there is no guarantee that each fold will maintain the same class
distribution as the original dataset.
• This can be problematic when dealing with imbalanced datasets, as it might result in some folds
having a significantly different class distribution from others.
• Stratified cross-validation addresses this issue by ensuring that each fold maintains a similar class
distribution to the original dataset.
Data Pre-Processing

• Feature Selection:
• Feature selection is a crucial step in the data preprocessing phase of the machine learning life
cycle.
• It involves choosing a subset of relevant features from the original set of features to improve
model performance, reduce overfitting, and speed up training.
• The choice of feature selection method depends on the specific characteristics of the dataset
and the problem at hand.
• Sometimes, a combination of multiple techniques may be used for more effective feature
selection.
• There are various techniques for feature selection, and here are some commonly used methods:
Data Pre-Processing

• Feature Selection:
• Filter Method:
• Variance Threshold: If the values of a feature do not vary much from one data point to another, it
indicates low variance. Features with low variance may not provide much information. This method
removes features with variance below a certain threshold.
• Correlation-Based Methods: Features with high correlation are more linearly dependent and hence
have almost the same effect on the dependent variable. Features that are highly correlated with each
other may not contribute much additional information. So, when two features have a high correlation,
we can drop one of the two features.
Data Pre-Processing

• Feature Selection:
• Wrapper Methods:
• The wrapper method considers feature selection a search problem,
where different combinations are prepared, evaluated, and
compared to other combinations.
• The wrapper method assesses the quality of learning with different
subsets of features against the evaluation criterion.
• The output is the model's performance versus different sets of
features.
• The user can then select the optimum set of features for which the
model's performance is optimum
• The wrapper method follows a greedy search approach by
evaluating all the possible combinations of features against the
evaluation criterion. It trains the algorithm by using the subset of
features iteratively.
Data Pre-Processing

• Feature Selection:
• Wrapper Methods:
• Forward Selection: Forward selection is an iterative method in
which we start with having no feature in the model. In each
iteration, we keep adding the feature which best improves our
model until adding a new feature does not improve the model's
performance.
• Backward Elimination: We start with all the features and remove
the least significant feature at each iteration, improving the
model’s performance. We repeat this until no improvement is
observed in the removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm
which aims to find the best performing feature subset. It
repeatedly creates models and keeps aside the best or the worst
performing feature at each iteration. It constructs the next model
with the left features until all the features are exhausted. It then
ranks the features based on the order of their elimination.
Data Pre-Processing

• Feature Selection:
• Embedded Methods: Embedded methods combine the qualities of filter and wrapper methods.
Embedded methods are algorithm-based feature selection methods where feature selection is
integrated with the machine learning algorithm that helps extract relevant features.
• LASSO (Least Absolute Shrinkage and Selection Operator): LASSO adds a penalty term to the linear
regression cost function, promoting sparsity in the model coefficients and automatically selecting
important features.
• Decision Tree-based Methods: Decision trees can be used to evaluate feature importance, and tree-
based ensemble models like Random Forest or Gradient Boosting often have built-in feature
importance measures.
Data Pre-Processing

• Feature Selection:
• Difference between Filter and Wrapper methods.
• Filter methods measure the relevance of features by their correlation with dependent variables, while
wrapper methods measure the usefulness of a subset of features by actually training a model on it.
• Filter methods are much faster than wrapper methods as they do not involve training the models. On
the other hand, wrapper methods are computationally very expensive as well.
• Filter methods use statistical methods to evaluate a subset of features, while wrapper methods use
cross-validation.
• Filter methods might fail to find the best subset of features in many occasions but wrapper methods
can always provide the best subset of features.
• Using the subset of features from the wrapper methods makes the model more prone to overfitting as
compared to using a subset of features from the filter methods.

You might also like