SMOTE For Imbalanced Classification With Python
SMOTE For Imbalanced Classification With Python
! Navigation
Search... "
Imbalanced classification involves developing predictive models on classification datasets that have
a severe class imbalance.
The challenge of working with imbalanced datasets is that most machine learning techniques will
ignore, and in turn have poor performance on, the minority class, although typically it is
performance on the minority class that is most important.
One approach to addressing imbalanced datasets is to oversample the minority class. The simplest
approach involves duplicating examples in the minority class, although these examples don’t add
any new information to the model. Instead, new examples can be synthesized from the existing
examples. This is a type of data augmentation for the minority class and is referred to as the
Synthetic Minority Oversampling Technique, or SMOTE for short.
In this tutorial, you will discover the SMOTE for oversampling imbalanced classification datasets.
How the SMOTE synthesizes new examples for the minority class.
How to correctly fit and evaluate machine learning models on SMOTE-transformed training
datasets.
How to use extensions of the SMOTE that generate synthetic examples along the class
decision boundary.
Kick-start your project with my new book Imbalanced Classification with Python, including step-
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 1 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
by-step tutorials and the Python source code files for all examples.
Tutorial Overview
This tutorial is divided into five parts; they are:
A problem with imbalanced classification is that there are too few examples of the minority class for
a model to effectively learn the decision boundary.
One way to solve this problem is to oversample the examples in the minority class. This can be
achieved by simply duplicating examples from the minority class in the training dataset prior to
fitting a model. This can balance the class distribution but does not provide any additional
information to the model.
An improvement on duplicating examples from the minority class is to synthesize new examples
from the minority class. This is a type of data augmentation for tabular data and can be very
effective.
Perhaps the most widely used approach to synthesizing new examples is called the Synthetic
Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh
Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-
sampling Technique.”
SMOTE works by selecting examples that are close in the feature space, drawing a line between the
examples in the feature space and drawing a new sample at a point along that line.
Specifically, a random example from the minority class is first chosen. Then k of the nearest
neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a
synthetic example is created at a randomly selected point between the two examples in feature
space.
… SMOTE first selects a minority class instance a at random and finds its k nearest
# minority class neighbors. The synthetic instance is then created by choosing one of the k
nearest neighbors b at random and connecting a and b to form a line segment in the
feature space. The synthetic instances are generated as a convex combination of the two
chosen instances a and b.
This procedure can be used to create as many synthetic examples for the minority class as are
required. As described in the paper, it suggests first using random undersampling to trim the
number of examples in the majority class, then use SMOTE to oversample the minority class to
balance the class distribution.
The combination of SMOTE and under-sampling performs better than plain under-
# sampling.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 3 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
The approach is effective because new synthetic examples from the minority class are created that
are plausible, that is, are relatively close in feature space to existing examples from the minority
class.
Our method of synthetic over-sampling works to cause the classifier to build larger
# decision regions that contain nearby minority class points.
A general downside of the approach is that synthetic examples are created without considering the
majority class, possibly resulting in ambiguous examples if there is a strong overlap for the classes.
Now that we are familiar with the technique, let’s look at a worked example for an imbalanced
classification problem.
Imbalanced-Learn Library
In these examples, we will use the implementations provided by the imbalanced-learn Python
library, which can be installed via pip as follows:
You can confirm that the installation was successful by printing the version of the installed library:
Running the example will print the version number of the installed library; for example:
1 0.5.0
Click to sign-up and also get a free PDF Ebook version of the course.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 4 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
First, we can use the make_classification() scikit-learn function to create a synthetic binary
classification dataset with 10,000 examples and a 1:100 class distribution.
1 ...
2 # define dataset
3 X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
4 n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
We can use the Counter object to summarize the number of examples in each class to confirm the
dataset was created correctly.
1 ...
2 # summarize class distribution
3 counter = Counter(y)
4 print(counter)
Finally, we can create a scatter plot of the dataset and color the examples for each class a different
color to clearly see the spatial nature of the class imbalance.
1 ...
2 # scatter plot of examples by class label
3 for label, _ in counter.items():
4 row_ix = where(y == label)[0]
5 pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
6 pyplot.legend()
7 pyplot.show()
Tying this all together, the complete example of generating and plotting a synthetic binary
classification problem is listed below.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 5 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Running the example first summarizes the class distribution, confirms the 1:100 ratio, in this case
with about 9,900 examples in the majority class and 100 in the minority class.
A scatter plot of the dataset is created showing the large mass of points that belong to the majority
class (blue) and a small number of points spread out for the minority class (orange). We can see
some measure of overlap between the two classes.
Next, we can oversample the minority class using SMOTE and plot the transformed dataset.
We can use the SMOTE implementation provided by the imbalanced-learn Python library in the
SMOTE class.
The SMOTE class acts like a data transform object from scikit-learn in that it must be defined and
configured, fit on a dataset, then applied to create a new transformed version of the dataset.
For example, we can define a SMOTE instance with default parameters that will balance the
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 6 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
minority class and then fit and apply it in one step to create a transformed version of our dataset.
1 ...
2 # transform the dataset
3 oversample = SMOTE()
4 X, y = oversample.fit_resample(X, y)
Once transformed, we can summarize the class distribution of the new transformed dataset, which
would expect to now be balanced through the creation of many new synthetic examples in the
minority class.
1 ...
2 # summarize the new class distribution
3 counter = Counter(y)
4 print(counter)
A scatter plot of the transformed dataset can also be created and we would expect to see many
more examples for the minority class on lines between the original examples in the minority class.
Tying this together, the complete examples of applying SMOTE to the synthetic dataset and then
summarizing and plotting the transformed result is listed below.
Running the example first creates the dataset and summarizes the class distribution, showing the
1:100 ratio.
Then the dataset is transformed using the SMOTE and the new class distribution is summarized,
showing a balanced distribution now with 9,900 examples in the minority class.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 7 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
It shows many more examples in the minority class created along the lines between the original
examples in the minority class.
The original paper on SMOTE suggested combining SMOTE with random undersampling of the
majority class.
The imbalanced-learn library supports random undersampling via the RandomUnderSampler class.
We can update the example to first oversample the minority class to have 10 percent the number of
examples of the majority class (e.g. about 1,000), then use random undersampling to reduce the
number of examples in the majority class to have 50 percent more than the minority class (e.g.
about 2,000).
To implement this, we can specify the desired ratios as arguments to the SMOTE and
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 8 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
1 ...
2 over = SMOTE(sampling_strategy=0.1)
3 under = RandomUnderSampler(sampling_strategy=0.5)
The Pipeline can then be applied to a dataset, performing each transformation in turn and returning
a final dataset with the accumulation of the transform applied to it, in this case oversampling
followed by undersampling.
1 ...
2 steps = [('o', over), ('u', under)]
3 pipeline = Pipeline(steps=steps)
The pipeline can then be fit and applied to our dataset just like a single transform:
1 ...
2 # transform the dataset
3 X, y = pipeline.fit_resample(X, y)
We would expect some SMOTE oversampling of the minority class, although not as much as before
where the dataset was balanced. We also expect fewer examples in the majority class via random
undersampling.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 9 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Running the example first creates the dataset and summarizes the class distribution.
Next, the dataset is transformed, first by oversampling the minority class, then undersampling the
majority class. The final class distribution after this sequence of transforms matches our
expectations with a 1:2 ratio or about 2,000 examples in the majority class and about 1,000
examples in the minority class.
Finally, a scatter plot of the transformed dataset is created, showing the oversampled minority class
and the undersampled majority class.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 10 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Now that we are familiar with transforming imbalanced datasets, let’s look at using SMOTE when
fitting and evaluating classification models.
First, we use our binary classification dataset from the previous section then fit and evaluate a
decision tree algorithm.
The algorithm is defined with any required hyperparameters (we will use the defaults), then we will
use repeated stratified k-fold cross-validation to evaluate the model. We will use three repeats of
10-fold cross-validation, meaning that 10-fold cross-validation is applied three times fitting and
evaluating 30 models on the dataset.
The dataset is stratified, meaning that each fold of the cross-validation split will have the same class
distribution as the original dataset, in this case, a 1:100 ratio. We will evaluate the model using the
ROC area under curve (AUC) metric. This can be optimistic for severely imbalanced datasets but
will still show a relative change with better performing models.
1 ...
2 # define model
3 model = DecisionTreeClassifier()
4 # evaluate pipeline
5 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
6 scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
Once fit, we can calculate and report the mean of the scores across the folds and repeats.
1 ...
2 print('Mean ROC AUC: %.3f' % mean(scores))
We would not expect a decision tree fit on the raw imbalanced dataset to perform very well.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 11 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Running the example evaluates the model and reports the mean ROC AUC.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, we can see that a ROC AUC of about 0.76 is reported.
Now, we can try the same model and the same evaluation method, although use a SMOTE
transformed version of the dataset.
The correct application of oversampling during k-fold cross-validation is to apply the method to the
training dataset only, then evaluate the model on the stratified but non-transformed test set.
This can be achieved by defining a Pipeline that first transforms the training dataset with SMOTE
then fits the model.
1 ...
2 # define pipeline
3 steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())]
4 pipeline = Pipeline(steps=steps)
Tying this together, the complete example of evaluating a decision tree with SMOTE oversampling
on the training dataset is listed below.
Running the example evaluates the model and reports the mean ROC AUC score across the
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 12 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, we can see a modest improvement in performance from a ROC AUC of about 0.76 to
about 0.80.
As mentioned in the paper, it is believed that SMOTE performs better when combined with
undersampling of the majority class, such as random undersampling.
As in the previous section, we will first oversample the minority class with SMOTE to about a 1:10
ratio, then undersample the majority class to achieve about a 1:2 ratio.
1 ...
2 # define pipeline
3 model = DecisionTreeClassifier()
4 over = SMOTE(sampling_strategy=0.1)
5 under = RandomUnderSampler(sampling_strategy=0.5)
6 steps = [('over', over), ('under', under), ('model', model)]
7 pipeline = Pipeline(steps=steps)
1 # decision tree on imbalanced dataset with SMOTE oversampling and random undersampling
2 from numpy import mean
3 from sklearn.datasets import make_classification
4 from sklearn.model_selection import cross_val_score
5 from sklearn.model_selection import RepeatedStratifiedKFold
6 from sklearn.tree import DecisionTreeClassifier
7 from imblearn.pipeline import Pipeline
8 from imblearn.over_sampling import SMOTE
9 from imblearn.under_sampling import RandomUnderSampler
10 # define dataset
11 X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
12 n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
13 # define pipeline
14 model = DecisionTreeClassifier()
15 over = SMOTE(sampling_strategy=0.1)
16 under = RandomUnderSampler(sampling_strategy=0.5)
17 steps = [('over', over), ('under', under), ('model', model)]
18 pipeline = Pipeline(steps=steps)
19 # evaluate pipeline
20 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
21 scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
22 print('Mean ROC AUC: %.3f' % mean(scores))
Running the example evaluates the model with the pipeline of SMOTE oversampling and random
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 13 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, we can see that the reported ROC AUC shows an additional lift to about 0.83.
You could explore testing different ratios of the minority class and majority class (e.g. changing the
sampling_strategy argument) to see if a further lift in performance is possible.
Another area to explore would be to test different values of the k-nearest neighbors selected in the
SMOTE procedure when each new synthetic example is created. The default is k=5, although larger
or smaller values will influence the types of examples created, and in turn, may impact the
performance of the model.
For example, we could grid search a range of values of k, such as values from 1 to 7, and evaluate
the pipeline for each value.
1 ...
2 # values to evaluate
3 k_values = [1, 2, 3, 4, 5, 6, 7]
4 for k in k_values:
5 # define pipeline
6 ...
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 14 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Running the example will perform SMOTE oversampling with different k values for the KNN used in
the procedure, followed by random undersampling and fitting a decision tree on the resulting
training dataset.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.
In this case, the results suggest that a k=3 might be good with a ROC AUC of about 0.84, and k=7
might also be good with a ROC AUC of about 0.85.
This highlights that both the amount of oversampling and undersampling performed
(sampling_strategy argument) and the number of examples selected from which a partner is chosen
to create a synthetic example (k_neighbors) may be important parameters to select and tune for
your dataset.
Now that we are familiar with how to use SMOTE when fitting and evaluating classification models,
let’s look at some extensions of the SMOTE procedure.
In this section, we will review some extensions to SMOTE that are more selective regarding the
examples from the minority class that provide the basis for generating new synthetic examples.
Borderline-SMOTE
A popular extension to SMOTE involves selecting those instances of the minority class that are
misclassified, such as with a k-nearest neighbor classification model.
We can then oversample just those difficult instances, providing more resolution only where it may
be required.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 15 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
The examples on the borderline and the ones nearby […] are more apt to be misclassified
# than the ones far from the borderline, and thus more important for classification.
These examples that are misclassified are likely ambiguous and in a region of the edge or border of
decision boundary where class membership may overlap. As such, this modified to SMOTE is
called Borderline-SMOTE and was proposed by Hui Han, et al. in their 2005 paper titled
“Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning.”
The authors also describe a version of the method that also oversampled the majority class for
those examples that cause a misclassification of borderline instances in the minority class. This is
referred to as Borderline-SMOTE1, whereas the oversampling of just the borderline cases in
minority class is referred to as Borderline-SMOTE2.
We can demonstrate the technique on the synthetic binary classification problem used in the
previous sections.
Instead of generating new synthetic examples for the minority class blindly, we would expect the
Borderline-SMOTE method to only create synthetic examples along the decision boundary between
the two classes.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 16 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Running the example first creates the dataset and summarizes the initial class distribution, showing
a 1:100 relationship.
The Borderline-SMOTE is applied to balance the class distribution, which is confirmed with the
printed class summary.
Finally, a scatter plot of the transformed dataset is created. The plot clearly shows the effect of the
selective approach to oversampling. Examples along the decision boundary of the minority class
are oversampled intently (orange).
The plot shows that those examples far from the decision boundary are not oversampled. This
includes both examples that are easier to classify (those orange points toward the top left of the
plot) and those that are overwhelmingly difficult to classify given the strong class overlap (those
orange points toward the bottom right of the plot).
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 17 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Borderline-SMOTE SVM
Hien Nguyen, et al. suggest using an alternative of Borderline-SMOTE where an SVM algorithm is
used instead of a KNN to identify misclassified examples on the decision boundary.
Their approach is summarized in the 2009 paper titled “Borderline Over-sampling For Imbalanced
Data Classification.” An SVM is used to locate the decision boundary defined by the support
vectors and examples in the minority class that close to the support vectors become the focus for
generating synthetic examples.
… the borderline area is approximated by the support vectors obtained after training a
# standard SVMs classifier on the original training set. New instances will be randomly
created along the lines joining each minority class support vector with a number of its
nearest neighbors using the interpolation
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 18 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
In addition to using an SVM, the technique attempts to select regions where there are fewer
examples of the minority class and tries to extrapolate towards the class boundary.
If majority class instances count for less than a half of its nearest neighbors, new
# instances will be created with extrapolation to expand minority class area toward the
majority class.
This variation can be implemented via the SVMSMOTE class from the imbalanced-learn library.
The example below demonstrates this alternative approach to Borderline SMOTE on the same
imbalanced dataset.
Running the example first summarizes the raw class distribution, then the balanced class
distribution after applying Borderline-SMOTE with an SVM model.
A scatter plot of the dataset is created showing the directed oversampling along the decision
boundary with the majority class.
We can also see that unlike Borderline-SMOTE, more examples are synthesized away from the
region of class overlap, such as toward the top left of the plot.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 19 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
That is, generate more synthetic examples in regions of the feature space where the density of
minority examples is low, and fewer or none where the density is high.
This modification to SMOTE is referred to as the Adaptive Synthetic Sampling Method, or ADASYN,
and was proposed to Haibo He, et al. in their 2008 paper named for the method titled “ADASYN:
Adaptive Synthetic Sampling Approach For Imbalanced Learning.”
ADASYN is based on the idea of adaptively generating minority data samples according
# to their distributions: more synthetic data is generated for minority class samples that are
harder to learn compared to those minority samples that are easier to learn.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 20 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
With online Borderline-SMOTE, a discriminative model is not created. Instead, examples in the
minority class are weighted according to their density, then those examples with the lowest density
are the focus for the SMOTE synthetic example generation process.
We can implement this procedure using the ADASYN class in the imbalanced-learn library.
The example below demonstrates this alternative approach to oversampling on the imbalanced
binary classification dataset.
Running the example first creates the dataset and summarizes the initial class distribution, then the
updated class distribution after oversampling was performed.
A scatter plot of the transformed dataset is created. Like Borderline-SMOTE, we can see that
synthetic sample generation is focused around the decision boundary as this region has the lowest
density.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 21 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Unlike Borderline-SMOTE, we can see that the examples that have the most class overlap have the
most focus. On problems where these low density examples might be outliers, the ADASYN
approach may put too much attention on these areas of the feature space, which may result in
worse model performance.
It may help to remove outliers prior to applying the oversampling procedure, and this might be a
helpful heuristic to use more generally.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Learning from Imbalanced Data Sets, 2018.
Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 22 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Papers
SMOTE: Synthetic Minority Over-sampling Technique, 2002.
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, 2005.
Borderline Over-sampling For Imbalanced Data Classification, 2009.
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning, 2008.
API
imblearn.over_sampling.SMOTE API.
imblearn.over_sampling.SMOTENC API.
imblearn.over_sampling.BorderlineSMOTE API.
imblearn.over_sampling.SVMSMOTE API.
imblearn.over_sampling.ADASYN API.
Articles
Oversampling and undersampling in data analysis, Wikipedia.
Summary
In this tutorial, you discovered the SMOTE for oversampling imbalanced classification datasets.
How the SMOTE synthesizes new examples for the minority class.
How to correctly fit and evaluate machine learning models on SMOTE-transformed training
datasets.
How to use extensions of the SMOTE that generate synthetic examples along the class
decision boundary.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 23 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Markus January 17, 2020 at 10:52 pm #
Hi
For calculatng ROC AUC, the examples make use of the mean function an not roc_auc_score, why?
Thanks
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 24 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee January 18, 2020 at 8:48 am #
The ROC AUC scores are calculated automatically via the cross-validation process in
scikit-learn.
REPLY &
Ram pratapa April 1, 2020 at 6:13 pm #
Hi Jason,
Is there any way to use smote for multilabel problem.
REPLY &
Jason Brownlee April 2, 2020 at 5:44 am #
Yes, you must specify to the smote config which are the positive/negative
clasess and how much to oversample them.
REPLY &
Camara Mamadou January 21, 2020 at 12:52 am #
Hi Jason,
How to get predictions on a holdout data test after getting best results of a classifier by SMOTE
oversampling?
Best regards!
Mamadou.
REPLY &
Jason Brownlee January 21, 2020 at 7:15 am #
Recall SMOTE is only applied to the training set when your model is fit.
REPLY &
Akil February 20, 2020 at 11:47 pm #
Hi Jason,
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 25 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
As you said, SMOTE is applied to training only, won’t that affect the accuracy of the test set?
REPLY &
Jason Brownlee February 21, 2020 at 8:23 am #
Yes, the model will have a better idea of the boundary and perform better on the
test set – at least on some datasets.
Just a clarifying question: As per what Akil mentioned above, and the code
below, i am trying to understand if the SMOTE is NOT being applied to validation
data (during CV) if the model is defined within a pipeline and it is being applied even
on the validation data if I use oversampke.fit_resample(X, y). I want to make sure if
it’s working as expected.
I saw a drastic difference in say, accuracy when I ran SMOTE with and without
pipeline.
# define pipeline
steps = [(‘over’, SMOTE()), (‘model’, DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)
print(‘Mean ROC AUC: %.3f’ % mean(scores))
SMOTE is only applied on the training set, even when used in a pipeline,
even when evaluated via cross-validation.
P.S:
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 26 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
And here:
http://machinelearningmastery.com/machine-learning-performance-improvement-
cheat-sheet/
REPLY &
Rafael Eder January 21, 2020 at 3:17 pm #
Hi !
SMOTE works for imbalanced image datasets too ?
Best Regards;
REPLY &
Jason Brownlee January 22, 2020 at 6:17 am #
REPLY &
Rafael Eder January 22, 2020 at 10:07 am #
Yours books and blog help me a lot ! Thank you very much !
REPLY &
Jason Brownlee January 22, 2020 at 1:55 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 27 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
brian January 31, 2020 at 12:28 am #
Hi Jason, thanks for another series of excellent tutorials. I have encountered an error when
running
X, y = pipeline.fit_resample(X, y)
“ValueError: The specified ratio required to remove samples from the minority class while trying to
generate new samples. Please increase the ratio.”
Thanks.
REPLY &
brian January 31, 2020 at 1:50 am #
as a followup it seems I’ve not understood how SMOTE and undersampling function.
Now I understand I had the ratios for SMOTE() and RandomUnderSampler() “sampling_strategy”
incorrect.
REPLY &
Jason Brownlee January 31, 2020 at 7:57 am #
REPLY &
Volkan Yurtseven July 22, 2020 at 7:32 am #
Hi
When used with a gridsearchcv, does Smote apply the oversampling to whole train set or
does it disregard the validation set?
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 28 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee July 22, 2020 at 7:38 am #
You can use it as part of a Pipeline to ensure that SMOTE is only applied to the
training dataset, not val or test.
Hi Jason,
Why do you first oversample with SMOTE and then undersample the majority class
afterwards in your pipelines? Wouldn’t it be more effective the other way around?
Thanks!
It is an approach that has worked well for me. Perhaps try the reverse on your
dataset and compare the results.
Hi Jason,
I had a question regarding the consequences of applying SMOTE only to the train
set. If we apply SMOTE only to the train set but not to validation set or test set, the
three sets will not be stratified. For example, if the train set transformed to a 50:50
distribution for class 1 and class 2, validation and test sets still maintain their original
distribution 10:90, let’s say. Is this not a concern at all since we just care about
baking the highest-performing MODEL which will be based only on the train set? If
we apply SMOTE to only the train set wouldn’t the model also assume that the real-
world data also assumes a 50:50 distribution between class 1 and class 2?
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 29 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
No, you would stratify the split of the data before resampling. Then use a metric (not
accuracy) that effectively evaluates the capability of natural looking data (val and test
sets).
This is critical. Changing the nature of test and val sets would make the test harness
invalid.
REPLY &
Jason Brownlee January 31, 2020 at 7:55 am #
REPLY &
Jeong miae March 21, 2020 at 10:56 am #
2. After making balanced data with these thechniques, Could I use not machine learning
algorithms but deep learning algorithms such as CNN?
REPLY &
Jason Brownlee March 22, 2020 at 6:47 am #
REPLY &
Jeong miae March 22, 2020 at 5:32 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 30 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
In fact, I’d like to find other method except data augmentation to improve model’s
performance. So, I wanted to try oversampling.
But, as follow as I understand as your answer, I can’t use oversampling such as SMOTE
at image data . Am I right to understand?
Thank you again for your kind answer.
Correct, SMOTE does not make sense for image data, at least off the cuff.
REPLY &
Valdemar February 11, 2020 at 2:06 am #
Hello Jason,
In your ML cheat sheet you have advice to invent more data if you have not enough. Can you
suggest methods or libraries which are good fit to do that?
Imblearn seams to be a good way to balance data. What about if you wish to increase the entire
dataset size as to have more samples and potentially improve model?
REPLY &
Jason Brownlee February 11, 2020 at 5:15 am #
REPLY &
Frank February 28, 2020 at 11:41 pm #
Thank you for the great tutorial, as always super detailed and helpfull.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 31 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
I’m working throught the wine quality dataset(white) and decided to use SMOTE on Output feature
balances are below.
I’ve managed to use a Regression model (KNN) that I belive does the task well but interested to get
your take how to deal with similar class imbalance on multilclass problems as above?
REPLY &
Jason Brownlee February 29, 2020 at 7:13 am #
Yes, SMOTE can be used for multi-class, but you must specify the positive and
negative classes.
REPLY &
Akshay October 15, 2020 at 1:53 am #
What does positive and negative means for multi-class? Based on the
problem/domain, it can vary but let’s say if I identify which classes are positive and which
are negative, what next?
REPLY &
Jason Brownlee October 15, 2020 at 6:15 am #
You can apply SMOTE directly fir multi-class, or you can specify the preferred
balance of the classes to SMOTE.
REPLY &
Thomas March 1, 2020 at 6:33 am #
Therefore isnt that a problem in crossvalscore the sampling will be applied on each validation sets ?
Thanks
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 32 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee March 2, 2020 at 6:07 am #
Sorry, I don’t follow your question. Can you please rephrase or elaborate?
REPLY &
Yong March 1, 2020 at 6:19 pm #
you mentioned that : ” As in the previous section, we will first oversample the minority class
with SMOTE to about a 1:10 ratio, then undersample the majority class to achieve about a 1:2 ratio.”
why? what is the idea behind this operation and why does this operation can inprove the
performance.
REPLY &
Jason Brownlee March 2, 2020 at 6:16 am #
REPLY &
Vijay M March 2, 2020 at 8:37 pm #
Sir Jason,
Can we use the above code for images
REPLY &
Jason Brownlee March 3, 2020 at 5:58 am #
REPLY &
Ernest Montañà March 18, 2020 at 3:40 am #
# define pipeline
steps = [(‘over’, SMOTE()), (‘model’, RandomForestClassifier(n_estimators=100, criterion=’gini’,
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 33 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
max_depth=None, random_state=1))]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=1)
acc = cross_val_score(pipeline, X_new, Y, scoring=’accuracy’, cv=cv, n_jobs=-1)
I assume the SMOTE is performed for each cross validation split, therefore there is no data leaking,
am I correct? Thank you
REPLY &
Jason Brownlee March 18, 2020 at 6:13 am #
REPLY &
AP February 19, 2021 at 1:59 am #
Hello Jason,
Thank you for the post. I have some questions. My dataset consists NaN values and I am
not allowed to drop them due to less no. of records. If I impute values with mean or median
before splitting data or cross validation, there will be information leakage. To solve that
problem, I need to use pipeline including SMOT and a model, and need to apply cross
validation. Now, my question is, what if I have huge data set and I want to apply feature
engineering (PCA or RFE) and want to explore all the steps step by step? If I define every
steps in pipeline, how can I explore, where is the real problem in which method? Also I need
more computation power to do trial and error methods on huge dataset. What is your
suggestion for that?
My second question is, that I do not understand SMOT that you defined initially.
” SMOTE first selects a minority class instance a at random and finds its k nearest minority
class neighbors. The synthetic instance is then created by choosing one of the k nearest
neighbors b at random and connecting a and b to form a line segment in the feature space.
The synthetic instances are generated as a convex combination of the two chosen instances
a and b. ”
I couldn’t imagine what you want to say. Because of that I did not understand borderline
SMOT as well. Could you please rephrase it and if possible could you please explain it with a
small example?
REPLY &
Jason Brownlee February 19, 2021 at 6:04 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 34 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
You must fit the imputer on the train set and apply to train and test within cv, a
pipeline will help.
You can also step the k-fold cv manually and implement the pipeline manually – this
might be preferred to you can keep track of what changes are made and any issues that
might occur.
SMOTE works by drawing lines between close examples in feature space and picking a
random point on the line as the new instance.
REPLY &
David March 29, 2020 at 1:35 am #
Hi! A quick question, SMOTE should be applied before or after data preparation (like
Standardization for example) ? Or it’s irrelevant?
Thank you!
REPLY &
Jason Brownlee March 29, 2020 at 6:01 am #
Probably after.
REPLY &
San April 2, 2020 at 6:13 am #
How to use SMOTE or any other technique related with SMOTE such as ADASYN,
Borderline SMOTE, when a dataset has classes with only a few instances?
Some of the classes in my dataset has only 1 instance & some have 2 instances. When using these
SMOTE techniques I get the error ‘Expected n_neighbors <= n_samples, but n_samples = 2,
n_neighbors = 6'.
Is there any way to overcome this error? With RandomOversampling the code works fine..but it
doesn't seem to give a good performance. And I'm unable to all the SMOTE based oversampling
techniques due to this error.
REPLY &
Jason Brownlee April 2, 2020 at 6:41 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 35 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
I don’t think modeling a problem with one instance or a few instances of a class is
appropriate.
REPLY &
Garv April 8, 2020 at 9:59 pm #
Hello I did tuning of smote parameters( k,sampling strategy) and took roc_auc as scoring on
training data but how along with cross val score my model is evaluated on testing data (that ideally
should not be the one on which smote should apply)
can you help me with how to apply best model on testing data(code required)
#Using Decsion Tree
Xtrain1=Xtrain.copy()
ytrain1=ytrain.copy()
k_val=[i for i in range(2,9)]
p_proportion=[i for i in np.arange(0.2,0.5,0.1)]
k_n=[]
proportion=[]
score_m=[]
score_var=[]
modell=[]
for k in k_val:
for p in p_proportion:
oversample=SMOTE(sampling_strategy=p,k_neighbors=k,random_state=1)
Xtrain1,ytrain1=oversample.fit_resample(Xtrain,ytrain)
model=DecisionTreeClassifier()
cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1)
scores=cross_val_score(model,X1,y1,scoring=’roc_auc’,cv=cv,n_jobs=-1)
k_n.append(k)
proportion.append(p)
score_m.append(np.mean(scores))
score_var.append(np.var(scores))
modell.append(‘DecisionTreeClassifier’)
scorer=pd.DataFrame({‘model’:modell,’k’:k_n,’proportion’:proportion,’scores’:score_m,’score_var’:sc
ore_var})
print(scorer)
models.append(model)
models_score.append(scorer[scorer[‘scores’]==max(scorer[‘scores’])].values[0])
models_var.append(scorer[scorer[‘score_var’]==min(scorer[‘score_var’])].values[0])
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 36 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee April 9, 2020 at 8:02 am #
REPLY &
Kabilan April 10, 2020 at 7:02 am #
Hey Jason,
What kind of an approach can we use to over-sample time series data?
REPLY &
Jason Brownlee April 10, 2020 at 8:38 am #
REPLY &
John White April 10, 2020 at 7:11 pm #
Hello Jason,
Do you currently have any ideas on how to oversample time series data off the top of your
head? I’d like to do some research/experiment on it in the meantime.Thank you!
REPLY &
Jason Brownlee April 11, 2020 at 6:14 am #
REPLY &
Kabilan April 10, 2020 at 11:40 pm #
REPLY &
Jason Brownlee April 11, 2020 at 6:21 am #
You’re welcome.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 37 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Vamshi April 11, 2020 at 7:56 am #
Hi Jason Brownie,
Thank you for the great description over handling imbalanced datasets using SMOTE and its
alternative methods. I know that SMOTE is only for multi Class Dataset but I am curious to know if
you have any idea of of using SMOTE for multi label Datasets?? or Do you have any other method or
ideas apart from SMOTE in order to handle imbalanced multi label datasets.
REPLY &
Jason Brownlee April 11, 2020 at 7:58 am #
Great question!
I’m not aware of an approach off hand for multi-label, perhaps check the literature?
REPLY &
Vamshi April 11, 2020 at 8:10 am #
REPLY &
Jason Brownlee April 11, 2020 at 11:51 am #
REPLY &
Vamshi April 11, 2020 at 8:21 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 38 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee April 11, 2020 at 11:53 am #
Nice.
REPLY &
Jooje April 16, 2020 at 4:23 am #
Hi! Thanks for the great tutorial. Can SMOTE be used with 1. high dimensional embeddings
for text representation? if so, what is any preprocessing/dimensionality reduction required before
applying SMOTE?
REPLY &
Jason Brownlee April 16, 2020 at 6:06 am #
Not sure off the cuff, perhaps experiment to see if this makes sense.
REPLY &
rahul malik April 23, 2020 at 8:49 am #
hi Jason , I am having 3 input Text columns out of 2 are categorical and 1 is unstructured
text. Can you please help me how to do sampling. Output column is categorical and is imbalanced.
REPLY &
Jason Brownlee April 23, 2020 at 1:34 pm #
Perhaps use a label or one hot encoding for the categorical inputs and a bag of words
for the text data.
REPLY &
rahul malik April 23, 2020 at 10:53 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 39 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee April 24, 2020 at 5:43 am #
You may have to experiment, perhaps different smote instances, perhaps run
the pipeline manually, etc.
REPLY &
Iraj April 30, 2020 at 8:28 am #
Hi,
SMOTE requires 6 examples of each class.
I have a dataset if 30 class 0, and 1 class 1 .
Please advise if any solution.
Thank you
REPLY &
Jason Brownlee April 30, 2020 at 11:36 am #
Perhaps try and get more examples from the minority class?
REPLY &
John Sammut May 2, 2020 at 9:25 am #
Hello Jason,
How can one apply the same ratio of oversampling (1:10) followed by under-sampling (1:2) in a
pipeline when there are 3 classes?
The sampling strategy cannot be set to float for multi-class. What would you recommend?
Thank you.
John
REPLY &
Jason Brownlee May 3, 2020 at 6:05 am #
Thanks.
First step is to group classes into positive and negative, then apply the sampling.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 40 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Srisha May 4, 2020 at 12:35 pm #
Could you shed some light on how one could leverage the parameter sampling_strategy in
SMOTE?
REPLY &
Jason Brownlee May 4, 2020 at 1:28 pm #
REPLY &
Mohamad May 7, 2020 at 11:04 pm #
Hi Jason,
Thank you very much for this article, it’s so helpful (as always).
I have an inquiry:
Now my data are highly imbalanced (99.5%:0.05%). I am having over than 40,000 samples with
multiple features (36) for my classification problem. I oversampled with SMOTE to have balanced
data, but the classifier is getting highly biased toward the oversampled data. I assumed that its
because of the “sampling_strategy”. So I tried {0.25, 0.5, 0.75,1} for the “sampling_strategy”. Its
either getting highly biased towards the abundant or the rare class.
REPLY &
Jason Brownlee May 8, 2020 at 6:36 am #
REPLY &
john sen May 11, 2020 at 3:45 am #
please tell me how i can apply two balancing technique first SMOTE and then one class
learning algorithm on same dataset for better result
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 41 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee May 11, 2020 at 6:08 am #
You can apply smote to the training set, then apply the one class classifier directly.
REPLY &
john sen May 12, 2020 at 3:54 am #
sir then what should i try for the best result by using smote and one more algo
which makes an hybrid approch to handle imbalanced data.
REPLY &
Jason Brownlee May 12, 2020 at 6:50 am #
Use trial and error to discover what works well/best for your dataset.
REPLY &
Arnaud May 11, 2020 at 3:47 am #
Hi Jason,
First, thanks for your material, it’s of great value!
I have a supervised classification problem with unbalanced class to predict (Event = 1/100 Non
Event).
I have the intuition that using resampling methods such as SMOTE (or down/up/ROSE) with Naive
Bayes models affect prior probabilities and such lead to lower performance when applied on test
set.
Is that correct?
Thanks.
REPLY &
Jason Brownlee May 11, 2020 at 6:09 am #
You’re welcome!
Yes.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 42 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Teixeira May 12, 2020 at 1:31 am #
Hi Dr.
Could SMOTE be applied to data that will be used for feeding an LSTM? (Since the order matters, it
can interfere with the data right?)
Thanks in advance!
REPLY &
Jason Brownlee May 12, 2020 at 6:46 am #
REPLY &
Teixeira May 13, 2020 at 2:04 am #
First of all, thanks for the response. Sorry, i think i don’t understand. Maybe I am
wrong, but SMOTE could be applied to tabular data, before the transformation into sliding
windows. Even in this case is not recommend to apply SMOTE ?
Thanks!
REPLY &
Jason Brownlee May 13, 2020 at 6:39 am #
Thanks you, Jason. Would you be able to point out an example of those
time-series aware data generation methods?
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 43 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
John D May 14, 2020 at 10:04 am #
Jason,
I have a highly imbalanced binary (yes/no) classification dataset. The dataset currently has appx
0.008% ‘yes’.
I came across 2 method to deal with the imbalance. The following steps after I have run
MinMaxScaler on the variables
sm = SMOTE(random_state=42)
X_sm , y_sm = sm.fit_sample(X_scaled, y)
This increases the number of rows from 2.4million rows to 4.8 million rows and the imbalance is now
50%.
After these steps I need to split data into Train Test datasets….
Should I run the X_test, y_test on unsampled data. This would mean, I split the data and do
upsampling/undersampling only on the train data.
Thanks again.
REPLY &
Jason Brownlee May 14, 2020 at 1:27 pm #
No, the sampling is applied on the training dataset only, not the test set. E.g. split first
then sample.
REPLY &
Shivam May 16, 2020 at 4:50 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 44 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Hello Jason, Great article. One Issue i am facing while using SMOTE-NC for categorical
data. I have only feature for categorization.
sm = SMOTENC(random_state=27,categorical_features=[0,])
X_new = np.array(X_train.values.tolist())
Y_new = np.array(y_train.values.tolist())
print(X_new.shape) # (10500,)
print(Y_new.shape) # (10500,)
X_new = np.reshape(X_new, (-1, 1)) # SMOTE require 2-D Array, Hence changing the shape of
X_mew
print(X_new.shape) # (10500, 1)
sm.fit_sample(X_new, Y_new)
ValueError: Found array with 0 feature(s) (shape=(10500, 0)) while a minimum of 1 is required.
Can you please suggest how to deal with SMOTE if there is only one feature ?
REPLY &
Jason Brownlee May 17, 2020 at 6:31 am #
REPLY &
sukhpal May 26, 2020 at 7:23 pm #
REPLY &
Jason Brownlee May 27, 2020 at 7:45 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 45 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
John D May 27, 2020 at 8:19 am #
What is the criteria to UnderSample the majority class and Upsample the minority class.
OR
What is the criteria to Upsample the minority class only.
REPLY &
Jason Brownlee May 27, 2020 at 1:25 pm #
REPLY &
SUKHPAL May 28, 2020 at 3:51 pm #
REPLY &
Jason Brownlee May 29, 2020 at 6:20 am #
REPLY &
sukhpal May 30, 2020 at 7:01 pm #
REPLY &
Jason Brownlee May 31, 2020 at 6:20 am #
REPLY &
Suyash June 25, 2020 at 10:32 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 46 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
We should apply oversampling only on the training set. Am i right? What should be done to
implement oversampling only on the training set and we also want to use stratified approach?
REPLY &
Jason Brownlee June 26, 2020 at 5:35 am #
In the first example I am getting you used to the API and the affect of the method.
REPLY &
suyash June 27, 2020 at 11:38 pm #
Can you please refer that tutorial to me where we we are implementing smote on
taining data only and evaluating the model? I also want to know that RepeatedStratifiedKfold
works only on the training dataset only.
REPLY &
Jason Brownlee June 28, 2020 at 5:51 am #
Yes the section “SMOTE for Classification” in the above tutorial uses a pipeline
to ensure SMOTE is only applied on training data.
REPLY &
suyash June 28, 2020 at 12:10 am #
cross_val_score oversample the data of training set only and do not oversample the
training data. am i right?
REPLY &
Jason Brownlee June 28, 2020 at 5:52 am #
When using a pipeline the transform is only applied to the training dataset,
which is correct.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 47 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
You’re welcome.
REPLY &
Jose Q June 28, 2020 at 7:29 am #
Hi Jason!
Thank you for such a great post!
I am working with an imbalanced data set (500:1). I want to get the best recall performance and I
have tried with several classification algorithms, hyper parameters, and Over/Under sampling
techniques. I will try SMOTE now !!!
From the last question, I understand that using CV and pipelines you oversample only the training
set, right?
I have another question. My imbalanced data set is about 5 million records from 11 months. It is not
a time series. I used data from the first ten months for training, and data from the eleventh month for
testing in order to explain it easier to my users, but I feel that it is not correct, and I guess I should
use a random test split from the entire data set. Is this correct?
REPLY &
Jason Brownlee June 29, 2020 at 6:22 am #
You’re welcome.
My best advice is to evaluate candidate models under the same conditions you expect to use
them. If there is a temporal element to your data and how you expect to use the model in the
future, try and capture that in your test harness.
REPLY &
Jose Q June 30, 2020 at 4:15 am #
Thank you
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 48 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jose Q July 1, 2020 at 3:09 am #
Hi Jason,
I followed your ideas at:
https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/
I tried oversampling with SMOTE, but my computer just can’t handle it.
Then I tried using Decision Trees and XGB for imbalanced data sets after reading your posts:
https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/
https://machinelearningmastery.com/xgboost-for-imbalanced-classification/
but I still get low values for recall.
I am doing random undersample so I have 1:1 class relationship and my computer can manage it.
Then I am doing XGB/Decision trees varying max_depth and varying weight to give more importance
to the positive class. My assumption is that I won’t overfit the model as soon as I use CV with
several folds and iterations. Is that right?
Thanks
REPLY &
Jason Brownlee July 1, 2020 at 5:55 am #
Perhaps. Assumptions can lead to poor results, test everything you can think of.
REPLY &
Jose Q July 1, 2020 at 3:33 pm #
Thank you
REPLY &
Jason Brownlee July 2, 2020 at 6:14 am #
You’re welcome.
REPLY &
xplorer4us July 8, 2020 at 5:32 pm #
Hi Jason, excellent explanations on SMOTE, very easy to understand and with tons of
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 49 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
examples!
I tried to download the free mini-course on Imbalance Classification, and I didn’t receive the PDF file.
May I please ask for your help with this? Thanks in advance!
REPLY &
Jason Brownlee July 9, 2020 at 6:38 am #
Thanks.
REPLY &
xplorer4us July 9, 2020 at 4:42 pm #
REPLY &
Jason Brownlee July 10, 2020 at 5:50 am #
You’re welcome.
REPLY &
Landry July 13, 2020 at 8:01 pm #
I have one inquiry, I have intuition that SMOTE performs bad on dataset with high dimensionality i.e
when we have many features in our dataset. Is it true ?
REPLY &
Jason Brownlee July 14, 2020 at 6:18 am #
Hmmm, that would be my intuition too, but always test. Intuitions breakdown in high
dimensions, or with ml in general. Test everything.
REPLY &
Volkan Yurtseven July 22, 2020 at 7:34 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 50 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Hi
When used with a gridsearchcv, does Smote apply the oversampling to whole train set or does it
disregard the validation set?
REPLY &
Jason Brownlee July 22, 2020 at 7:38 am #
You can use it as part of a Pipeline to ensure that SMOTE is only applied to the training
dataset, not val or test.
REPLY &
Volkan Yurtseven July 23, 2020 at 6:52 am #
hi jason,
do you mean if i use it in a imblearn’s own Pipeline class, it would be enough? no need for
any parameter?
X_smote,y_smote=pipe.fit_resample(X_train,y_train)
REPLY &
Jason Brownlee July 23, 2020 at 2:36 pm #
Yes.
REPLY &
Diego July 23, 2020 at 12:39 pm #
Hi Jason,
Let’s say you train a pipeline using a train dataset and it has 3 steps: MinMaxScaler, SMOTE and
LogisticRegression.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 51 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Thanks.
REPLY &
Jason Brownlee July 23, 2020 at 2:46 pm #
The pipeline is fit and then the pipeline can be used to make predictions on new data.
Yes, call pipeline.predict() to ensure the data is prepared correctly prior to being passed to the
model.
REPLY &
SAM V July 29, 2020 at 3:52 pm #
Hi Jason, SMOTE sampling is done before / after data cleaning or pre-processing or feature
engineering??? I just want to know when we should do SMOTE sampling and why??
REPLY &
Jason Brownlee July 30, 2020 at 6:16 am #
Probably after.
REPLY &
Gaël August 6, 2020 at 7:26 pm #
Hi, great article! I think there is a typo in section “SMOTE for Balancing Data”: “the large
mass of points that belong to the minority class (blue)” –> should be majority I guess
REPLY &
Jason Brownlee August 7, 2020 at 6:24 am #
Thanks! Fixed.
REPLY &
Maria November 6, 2020 at 2:30 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 52 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
https://stackoverflow.com/questions/58825053/smote-function-not-working-in-
make-pipeline
REPLY &
Jason Brownlee November 6, 2020 at 6:02 am #
REPLY &
Luna August 6, 2020 at 7:30 pm #
Hi Jason,
TypeError: All intermediate steps should be transformers and implement fit and transform or be the
string ‘passthrough’ ‘SMOTE(k_neighbors=5, n_jobs=None, random_state=None,
sampling_strategy=’auto’)’ (type ) doesn’t
REPLY &
Jason Brownlee August 7, 2020 at 6:24 am #
Perhaps confirm the content of your pipeline ends with a predictive model.
REPLY &
george August 12, 2020 at 1:30 pm #
Hi Jason,
if all my predictors are binary, can I still use SMOTE? seems SMOTE only works for predictors are
numeric? Are there any methods other than random undersampling or over sampling? Thanks
REPLY &
Jason Brownlee August 12, 2020 at 1:37 pm #
Great question, I believe you can use an extension of SMOTE for categorical inputs
called SMOTE-NC:
https://imbalanced-
learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTENC.html
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 53 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Franco August 19, 2020 at 7:18 am #
I wonder if we upsampled the minority class from 100 to 9,900 with a bootstrap (with replacement of
course), whether we would get similar results than SMOTE … I put on my to-do list.
REPLY &
Jason Brownlee August 19, 2020 at 1:34 pm #
Thanks!
Probably not, as we are generating entirely new samples with SMOTE. Nevertheless, run the
experiment and compare the results!
REPLY &
Franco August 19, 2020 at 3:52 pm #
REPLY &
Jason Brownlee August 20, 2020 at 6:35 am #
You’re welcome.
REPLY &
SaHaR August 22, 2020 at 12:35 am #
Hi, Jason
Thank you for your great article. It is really informative as always. Recently I read an article about the
classification of a multiclass and imbalanced dataset. They used SMOTE for both training and test
set and I think it was not a correct methodology and the test dataset should not be manipulated.
please tell me if I am wrong and would you recommend a reference about the drawbacks and
challenges of using SMOTE?
Thank you
REPLY &
Jason Brownlee August 22, 2020 at 6:17 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 54 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Thanks!
REPLY &
Vivek August 28, 2020 at 4:04 am #
Hi Jason
Q1. Do we apply SMOTE on the train set after doing train/ test split?
Guess, doing SMOTE first, then splitting, may result in data leak as same instances may be present
in both test and test sets.
Q2. I understand why SMOTE is better instead of random oversampling minority class. But say for a
class imbalance of 1:100, why not just random undersample majority class? Not sure how SMOTE
helps here !
Thanks
Vivek
REPLY &
Jason Brownlee August 28, 2020 at 6:55 am #
Try many methods and discover what works best for your dataset.
REPLY &
Shehab August 29, 2020 at 7:13 am #
Hi Jason,
What if you have an unbalanced dataset that matches the realistic class distribution in production.
Say Class A has 1000 rows, Class B 400 and Class C with 60. What are the negative effects of
having an unbalanced dataset like this. Say I use a classifier like Naive Bayes and since prior
probability is important then by oversampling Class C I mess up the prior probability and stray
farther away from the realistic probabilities in production. Should I try and get more data or augment
the data that I have while maintaining this unbalanced distribution or change the distribution by
oversampling the minority classes?
Thanks
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 55 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee August 29, 2020 at 8:10 am #
I recommend testing a suite of techniques in order to discover what works best for your specific
dataset.
REPLY &
Daniel September 10, 2020 at 7:34 pm #
Hello,
Thanks for your work, it is really useful. I have a question about the combination of SMOTE and
active learning.
I am trying to generate a dataset using active learning techniques. From a pool of unlabelled data I
select the new points to label using the uncertainty in each iteration. My problem is that the classes
repartition is imbalanced (1000:1), my current algorithm can’t find enough points in Yes class. Do you
think I could use SMOTE to generate new points of Yes class?
I am thinking about using borderline-SMOTE to generate new points and then label them. How can I
be sure that the new points are not going to be concentrated in a small region?
I am not sure if I have explained the problem well. I need to find the feasible zone using the labeller
in a smart way because labelling is expensive. Can you give me any advice?
Thanks.
Daniel
REPLY &
Jason Brownlee September 11, 2020 at 5:55 am #
REPLY &
Bilal September 26, 2020 at 9:44 pm #
I do SMOTE on the whole dataset, then normalize the dataset. After that I applied
cross_val_score. Is it right that in cross_val_score, SMOTE will resampling only training set Code is
here:
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 56 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)
normalized = StandardScaler()
normalized_X = normalized.fit_transform(X)
clf_entropy = DecisionTreeClassifier(random_state = 42)
y_pred = cross_val_predict(clf_entropy, normalized_X, Y, cv=15)
REPLY &
Jason Brownlee September 27, 2020 at 6:53 am #
REPLY &
Vidya October 6, 2020 at 1:01 pm #
Hi Jason .
Thanks for your post. I have two Qs regards SMOTE + undersampling example above.
“under = RandomUnderSampler(sampling_strategy=0.5)” . Why would we undersample the majority
class to have 1:2 ratio and not have an equal representation of both class?
2. If I were to have an imbalanced data such that minority class is 50% , wouldn’t I need to use PR
curve AUC as a metric or f1 , instead of ROC AUC ?
“scores = cross_val_score(pipeline, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)”
Thanks !!
REPLY &
Jason Brownlee October 6, 2020 at 1:59 pm #
It is a good idea to try a suite of different rebalancing ratios and see what works. I found
this ratio on this dataset after some trial and error.
REPLY &
Vidya October 7, 2020 at 12:51 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 57 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee October 7, 2020 at 1:52 pm #
Thanks!
REPLY &
Vidya October 7, 2020 at 1:34 pm #
Jason , I am trying out the various balancing methods on imbalanced data . How ever , yet
to feel convinced on how balancing the training data set will allow the algorithm learn and work fairly
well on the imbalanced test data ? Is this then dependent on how good the features are ? Means , if I
see that after various methods of balancing the train data set , the model does not generalise well on
test data , I need to relook at the feature creation ??
Thanks!!
REPLY &
Jason Brownlee October 7, 2020 at 1:53 pm #
Hard to say, the best we can do is used controlled experiments to discover what works
best for a given dataset.
REPLY &
Vidya October 8, 2020 at 3:16 pm #
Thanks !
REPLY &
Jason Brownlee October 9, 2020 at 6:39 am #
You’re welcome.
REPLY &
Sophie October 7, 2020 at 2:09 pm #
Hi Jason,
Thank you so much for your explanation. I have a question when fitting the model with SMOTE:
Why you use .fit_resample instead of .fit_sample? What is the difference between the two functions?
Also, is there any way to know the index for original dataset after SMOTE oversampling? How can I
know what data comes from the original dataset in the SMOTE upsampled dataset?
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 58 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Thanks!
REPLY &
Jason Brownlee October 8, 2020 at 8:19 am #
Sorry, the difference between he function is not clear from the API:
https://imbalanced-
learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
REPLY &
Fatima October 10, 2020 at 7:30 am #
Hi, I applied the SMOTE for Balancing Data Code, firstly, I had 27 features in my data, when
I defined the dataset in make_classification, I wrote the n_features=27 instead of 2, Is It Correct? and
Can I apply the SMOTE for Balancing Data when my goal from the model is Prediction?
Thanks!
REPLY &
Jason Brownlee October 10, 2020 at 8:15 am #
If you have your own data, you don’t need to use make_classification as it is a function
for creating a synthetic dataset.
REPLY &
Fatima October 10, 2020 at 11:07 pm #
Ok, I want to apply the SMOTE, my data contains 1,469 rows, the class label has
Risk= 1219, NoRisk= 250, Imbalanced data, I want to apply the Oversampling (SMOTE) to
let the data balanced.
firstly, I run this code that showed me diagram of the class label then I apllyied the SMOTE,
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 59 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
(Over-sampling: SMOTE):
smote = SMOTE(ratio=’minority’)
X_sm, y_sm = smote.fit_sample(X, y)
It gave me an error:
TypeError: __init__() got an unexpected keyword argument ‘ratio’
REPLY &
Jason Brownlee October 11, 2020 at 6:50 am #
REPLY &
Fatima October 15, 2020 at 4:09 am #
Hi Jason, I applied the SMOTE on my data and I solved the imbalanced data, the next step I
want to start Deep Learning(DL), in DL Do I have to save the new data ( balanced ) and then start DL
algorithms on the new data ??
Thanks!
REPLY &
Jason Brownlee October 15, 2020 at 6:19 am #
Only the training set should be balanced, not the test set.
You can transform the data in memory before fitting your model. Or you can save it if that is
easier for you.
REPLY &
Samuel Smets October 16, 2020 at 6:39 pm #
Dear Jason,
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 60 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
I tried to implement the SMOTE in my project, but the cross_val_score kept returning nan.
I can’t figure out why it returns nan. In you article you describe that you do get an answer for this
code snippet.
Thanks a lot!
Samuel
REPLY &
Jason Brownlee October 17, 2020 at 5:59 am #
That’s surprising, perhaps change the cv to raise an error on nan and inspect the
results.
REPLY &
Amit Pathak March 20, 2021 at 1:11 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 61 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee March 20, 2021 at 5:25 am #
REPLY &
deva October 17, 2020 at 10:33 pm #
cv = StratifiedKFold(n_splits=10,shuffle=True)
classifier = AdaBoostClassifier(n_estimators=200)
y = df['label'].values
X = df
X = X.drop('label',axis=1)
X = X.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 0, stratify = y)
oversampler= sv.CCR()
X_samp, y_samp= oversampler.sample(X_train, y_train)
X_train = X_samp
y_train = y_samp
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
plt.figure(figsize=(10,10))
i=0
# cv.sh
for train, test in cv.split(X_train, y_train):
probas_ = classifier.fit(X_train[train], y_train[train]).predict_proba(X_train[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = metrics.roc_curve(y_train[test], probas_[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = metrics.auc(fpr, tpr)
aucs.append(roc_auc)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 62 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate',fontsize=18)
plt.ylabel('True Positive Rate',fontsize=18)
plt.title('Cross-Validation ROC of ADABOOST',fontsize=18)
plt.legend(loc="lower right", prop={'size': 15})
plt.show()
i am confused cause smote after adaboost for train works good but the test set is not good.
https://ibb.co/yPSrLx2
REPLY &
deva October 17, 2020 at 10:43 pm #
REPLY &
Jason Brownlee October 18, 2020 at 6:09 am #
Well done!
REPLY &
nabila March 24, 2021 at 3:18 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 63 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
hi jason, can i ask? i applied metode smote bagging svm and smote boosting
svm but always eror, can u help me to found the coding in python?
I don’t have the capacity to debug your code sorry, perhaps these
suggestions will help:
https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-
my-code
REPLY &
Karrtik Iyer November 11, 2020 at 8:34 pm #
Hi @jasonBrowniee, Thanks for the above example. Quick Question, for SMOTE you have
used over sampling followed by Random Under Sampling, wondering if we use ADASYN or
SVMSMOTE do you suggest we use random under sampling as we do in case of SMOTE?
REPLY &
Jason Brownlee November 12, 2020 at 6:38 am #
Perhaps try a few different combinations and discover what works well/best for your
specific dataset.
REPLY &
Marlon Lohrbach November 30, 2020 at 4:48 am #
Hi Jason,
I hope you are doing well! Is there a need to upsample with Smote() if I use Stratifiedkfold or
RepeatedStratifiedkfold? I think that my stratified folding already takes care of class imbalance. So is
there a situation where you would prefer Smote over Stratified folding?
Cheers
REPLY &
Jason Brownlee November 30, 2020 at 6:40 am #
SMOTE can be used with or without stratified CV, they address different problems –
sampling the training dataset vs evaluating the model.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 64 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Michael Tamillow December 9, 2020 at 4:41 am #
I don’t believe this technique “actually” works in many cases. You can read Jonas Peters’
work to understand why. It is really an example of Machine Learning Hocus-Pocus, or the creative
side of Data Science which defines “works” as “I tried it and saw an improvement” anecdotal
evidence. It is bad overall to not rigorously evaluate such methods through analytical and logical
approaches.
REPLY &
Jason Brownlee December 9, 2020 at 6:32 am #
REPLY &
Mohammad January 2, 2021 at 8:00 pm #
Hi Jason,
Thanks for all of these heuristic alternatives you suggested for balancing datasets.
REPLY &
Jason Brownlee January 3, 2021 at 5:55 am #
You’re welcome.
REPLY &
Ammar Sani January 3, 2021 at 2:50 am #
Hi Dr Jason. I saw at few articles, authors were compared imbalanced class and overlapped
class. Do you have an article for that?
REPLY &
Jason Brownlee January 3, 2021 at 5:58 am #
Almost all classes overlap – if not the problem would be trivial (e.g. linearly separable).
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 65 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Ammar Sani January 4, 2021 at 1:19 pm #
Thanks Dr.
Then, I started read others just to strengthen and verify my understanding. I found this
article: https://link.springer.com/chapter/10.1007/978-3-642-13059-5_22 telling the
difference between imbalanced and overlap.
Maybe because of my fundamental is not really strong, I’m not really understand what they
thought in this article. So, I came to your blog as usual (it really helps newbie like me), to find
article that share about the different between overlap and imbalance. Unfortunately, I could
not find any.
REPLY &
Jason Brownlee January 4, 2021 at 1:42 pm #
Thanks for sharing, I’m not familiar with the article sorry.
OK Dr. Jason
REPLY &
Ammar Sani January 5, 2021 at 2:01 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 66 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Keith January 26, 2021 at 5:32 pm #
Hi Jason thanks for this very informative post. But just wondering, does it make sense for
me to tune the model hyperparameters on an over/undersampled data set, like this?
REPLY &
Jason Brownlee January 27, 2021 at 6:03 am #
Perhaps.
REPLY &
David February 2, 2021 at 2:13 pm #
Please specify which modules are needed. Took me an hour to find the damn where
attribute from numpy
REPLY &
Jason Brownlee February 3, 2021 at 6:12 am #
This tutorial will show you how to setup your development environment:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-
learning-anaconda/
REPLY &
David February 2, 2021 at 2:15 pm #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 67 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
my above comment looks too negative. THIS IS AWESOME; just please specify which
modules to import.
REPLY &
Jason Brownlee February 3, 2021 at 6:12 am #
The complete code example at the end of each sections has the import statements with
the code.
This will help you copy the code from the tutorial:
https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
REPLY &
JAIKISHAN February 13, 2021 at 1:13 pm #
Hi Jason,
That was a very useful tutorial.
Thank u.
REPLY &
Jason Brownlee February 13, 2021 at 1:20 pm #
You’re welcome!
REPLY &
Aya February 16, 2021 at 10:40 am #
thanks, but i confused with appling smote on training data or on x and y as in examples and
what the difference between them
REPLY &
Jason Brownlee February 16, 2021 at 1:38 pm #
REPLY &
Aya February 17, 2021 at 12:42 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 68 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
ok, that are x and y (feature and target ) but why you applying smote on it? is smote
applying on the training data means x splits into train and test and y as it the
applying smote on xtrain and ytrain
REPLY &
Jason Brownlee February 17, 2021 at 5:29 am #
The above example shows you how to use the SMOTE class and the effect it has – so
you feel comfortable with it and can start using it on your own project.
REPLY &
MS March 3, 2021 at 9:56 pm #
Hi,Jason
Can we implement SMOTENC with FAMD(prince) in a imblearn pipeline? If yes can you provide me
with some reference regarding the approach and code.
REPLY &
Jason Brownlee March 4, 2021 at 5:48 am #
REPLY &
MS March 4, 2021 at 9:30 pm #
Thanks
REPLY &
Ethan March 16, 2021 at 1:14 pm #
Hi Jason, thanks for the great content of SMOTE. I have a categorical variable in my data
which is location. I can use that in resampling thanks to SMOTENC. But is there a way to implement
SMOTE so that I can obtain homogeneity with respect to the minority class in location. So SMOTE
would generate synthetic data in locations that initially have low instances of the minority class.
REPLY &
Jason Brownlee March 17, 2021 at 5:58 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 69 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
You might need to implement the algorithm yourself to have such fine grained control
over where the algorithm chooses to resample.
REPLY &
Ethan March 17, 2021 at 7:19 am #
REPLY &
Jason Brownlee March 17, 2021 at 8:05 am #
You’re welcome.
REPLY &
Anthony March 20, 2021 at 12:50 am #
REPLY &
Jason Brownlee March 20, 2021 at 5:23 am #
REPLY &
hou March 22, 2021 at 12:22 pm #
So how should I do if the testing data is imbalance? I split the date set into 70% training set
and 30% testing set. After I use smote to balance training set and then I want to test the model on
testing set,then AUC will very low due to the imbalance testing set ,how should I do?Thank you very
much!
REPLY &
Jason Brownlee March 23, 2021 at 4:55 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 70 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
sanket March 27, 2021 at 6:06 pm #
Hi Json,
This was very succinct article on imbalance class. Thanks a lot for the article and the links to original
paper.
REPLY &
Jason Brownlee March 29, 2021 at 6:01 am #
Thanks!
REPLY &
Dorian March 31, 2021 at 2:07 am #
Hi, great article, but please do not recommend using sudo privileges when installing python
packages from pip! You are basically giving admin privileges to some random script pulled from the
internet which is really not good practice, and even dangerous. For more references, look here:
https://askubuntu.com/a/802594
Thanks a lot!
REPLY &
Jason Brownlee March 31, 2021 at 6:06 am #
REPLY &
Minh April 1, 2021 at 1:44 pm #
Hello Jason
I’m newbie here. I’m dealing with time series forecasting regression problem. That’s mean the
prediction model is required to learn from the series of past observations to predict the next value in
the sequence.
I’m using the dataset 1998 World Cup Web site (consists of all the requests made to the 1998 World
Cup Web site between April 30, 1998 and July 26, 1998). Here the FTP link:
ftp://ita.ee.lbl.gov/html/contrib/WorldCup.html
I preprocess the dataset by aggregating all logs that occur within the same minute into one
accumulative record.
I want to ask if my dataset imbalanced? and Why?
Thanks for your help.
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 71 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
REPLY &
Jason Brownlee April 2, 2021 at 5:35 am #
No. Typically imbalance is for classification tasks, and you said your problem is
regression (predicting a numerical value).
REPLY &
m.cihat April 15, 2021 at 12:23 am #
sm = SMOTE(random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
You said SMOTE is applied only on training set. So the code above is wrong?
REPLY &
m.cihat April 15, 2021 at 12:23 am #
REPLY &
Jason Brownlee April 15, 2021 at 5:29 am #
I try not to comment on other peoples code – they can do whatever they like.
REPLY &
Jason Brownlee April 15, 2021 at 5:27 am #
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 72 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Yes. Fatally.
REPLY &
Salah May 1, 2021 at 3:16 am #
Hi, i’d like to thank you for your blog. It’s been really a great help for me. as a beginner, I’d
like to ask you a question please. Does applying SMOTE with cross validation results in a biased
model. I mean, when you set the pipeline to apply SMOTE then model fitting, does cross validation
apply the validation process on the original test set or the over sampled test set? I saw on a
stackoverflow post that when we use SMOTE it should be done only on the training set and the
model should be tested only on the original data. Does cross validation meet this criteria too?
Thanks.
REPLY &
Jason Brownlee May 1, 2021 at 6:10 am #
When using SMOTE in a pipeline it is only applied to the training set, never the test set
within a cross-validation evaluation/test harness.
REPLY &
Jainey May 7, 2021 at 1:14 pm #
Hi, first of all, I just wanna say thanks for your contribution. And i have a question
scores = cross_val_score(pipeline, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)
score = mean(scores)
it’s seen mean nothing when you caculate your cross_val_score on your training data, I mean AUC is
matter when you caculate on your testing data. I have high auc cross-validation but 0.5 on testing
data.
REPLY &
Jason Brownlee May 8, 2021 at 6:32 am #
Sorry, I don’t understand your question. Perhaps you could rephrase it?
Leave a Reply
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 73 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
Name (required)
Website
SUBMIT COMMENT
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 74 of 75
SMOTE for Imbalanced Classification with Python 15/5/21, 7:07 PM
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/ Page 75 of 75