Data Leakage

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

DATA

LEAKAGE
What is Data Leakage?

Data leakage in the context of


Machine Learning refers to the
unintentional or inappropriate
inclusion of information in the training
data that the model would not have
access to during the actual
deployment or prediction phase.
Types of Data Leakage

Train -Test
Target Leakage
Contamination
What is Target leakage?

Target leakage refers to the inadvertent


inclusion of information in the features of a ML
model that either directly represents the
target variable or serves as a strong proxy for
it.

If we summarize, the data used for training


contains information about what is being
predicted is called Target leakage.
What is train_test
Contamination?

train-test contamination typically refers to a


situation where information from the training
dataset inadvertently leaks into the testing
dataset during the machine learning process.

If we summarize, the data used for training


contains information about testing set is called
train-test contamination.
What is the fuzz about
Data leakage?

Data leakage can cause the model to learn


patterns and do not generalize effectively to
unseen data.

Data leakage undermines the integrity of the


training process by exposing the model to
information it wouldn't realistically have
access to when making predictions in real-
world scenarios This can lead to
overoptimistic performance estimates during
model evaluation but result in poor practical
performance.
What Causes
Data leakage?
Leakage typically arises due to mistakes made
by data scientists throughout the various
stages of the model lifecycle.
Before & After Learning
About Data leakage

When getting Good Model Accuracy

Before After
Tips to Detect
Data leakage
Review Features: Check for features revealing target
variable information not available during prediction.

Unexpectedly High Performance: If model performance


seems too good, it may signal data leakage.

Inconsistent Performance: Notice large performance


gaps between training/validation and unseen data.

Model Interpretability: Use feature importance to


identify suspicious features that might indicate data
leakage.
Tips to Mitigate
Data leakage

Remove features that reveal the target column.

Eliminate duplicate rows and columns before splitting the


data.

Split the data before any preprocessing, including


missing value imputation, scaling, normalization,
over/under-sampling. Do not preprocess the holdout set
before the final model evaluation.

Incorporate domain knowledge during feature selection


to ensure top features will be available during
production.
Use pipelines to maintain consistency and minimize
data leakage during preprocessing and validation.

Perform data preprocessing within the cross-validation


loop to prevent information leakage from the validation
set into the training set.

Determine preprocessing parameters using the training


data and apply them consistently to the validation and
test data if needed.

Avoid using the same validation set for model


selection, feature selection, and hyperparameter tuning.

Do not reuse cross-validation splits for multiple tasks.


Treat feature selection, model tuning, and model
selection as separate ML models requiring individual
validation.
Shuffle data properly across folds during cross-validation.

Ensure the validation setup aligns with the problem you


intend to solve with the model.

If using model predictions as features, ensure they come


from out-of-sample data, especially for ensemble
methods like stacked predictions.

Use fit_transform on the training data and only transform


on testing or new data.

Be cautious with time series data; avoid random


sampling. Use TimeSeriesSplit for splitting.

Maintain a data dictionary and understand the meaning of


each column, including unusual values or outliers.
While there are no
foolproof, specific
methods to detect data
leakage in machine
learning, practitioners can
minimize the risk by
adhering to best practices.

Thankyou for
ENGAGING by

You might also like