Overestimation of OOB score, probable bug in resampling? #655

SvenWarnke · 2019-12-04T12:34:37Z

Description

When I calculate out of bag score the out of bag score is quite high, even if there is no connection between features and labels. I assume, that something goes wrong in keeping track of which samples are out of bag for each tree. Hence, the samples get evaluated on some trees where they where in fact in the bag.

Steps/Code to Reproduce

Example:

import numpy as np
from imblearn import ensemble

X = np.arange(1000).reshape(-1, 1)
y = np.random.binomial(1, 0.5, size=1000)

rf = ensemble.BalancedRandomForestClassifier(oob_score=True)
rf.fit(X, y)
rf.oob_score_

the output is 0.838

Expected Results

Since there is no relationship between the X, y (y are just independent coin flips) OOB should be around 0.5

Actual Results

Something in the range of 0.8, which is very significant on a sample size of 1000.

Versions

Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.5
SciPy 1.3.1
Scikit-Learn 0.21.3
Imbalanced-Learn 0.5.0

The text was updated successfully, but these errors were encountered:

glemaitre · 2019-12-05T10:33:59Z

Indeed, I think that we have an issue there. We resampling during the parallel fit (in your example, it should only correspond to a data shuffling) and then generate the bootstrap. However, the way that OOB score are currently computed, does not take into account the sampling. It means that we have samples in OOB which are actually used by the trees, explaining the over-optimistic score obtained.

We should repimplement (set_oob_score) to take into account the sampling on X>

SvenWarnke · 2019-12-05T13:58:17Z

Thanks @glemaitre for the super fast reply. However, reinstalling (via conda) did not change anything for me and installing from github broke for me:

ModuleNotFoundError Traceback (most recent call last)
in
4 import scipy; print("SciPy", scipy.version)
5 import sklearn; print("Scikit-Learn", sklearn.version)
----> 6 import imblearn; print("Imbalanced-Learn", imblearn.version)

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn_init_.py in
32 Module which allowing to create pipeline with scikit-learn estimators.
33 """
---> 34 from . import combine
35 from . import ensemble
36 from . import exceptions

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_init_.py in
3 """
4
----> 5 from ._smote_enn import SMOTEENN
6 from ._smote_tomek import SMOTETomek
7

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_smote_enn.py in
8 from sklearn.utils import check_X_y
9
---> 10 from ..base import BaseSampler
11 from ..over_sampling import SMOTE
12 from ..over_sampling.base import BaseOverSampler

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\base.py in
14 from sklearn.utils.multiclass import check_classification_targets
15
---> 16 from .utils import check_sampling_strategy, check_target_type
17
18

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_init_.py in
5 from ._docstring import Substitution
6
----> 7 from ._validation import check_neighbors_object
8 from ._validation import check_target_type
9 from ._validation import check_sampling_strategy

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_validation.py in
11
12 from sklearn.base import clone
---> 13 from sklearn.neighbors._base import KNeighborsMixin
14 from sklearn.neighbors import NearestNeighbors
15 from sklearn.utils.multiclass import type_of_target
ModuleNotFoundError: No module named 'sklearn.neighbors._base'

When will this fix be available via conda-forge or PyPi?

glemaitre · 2019-12-05T14:47:18Z

The package has been updated on PyPI, 20 minutes ago. Be aware that running imbalanced-learn 0.6 will require scikit-learn 0.22 (all available on PyPI). Imbalanced-learn will be shortly available in conda-forge (probably at the end of the day).

glemaitre · 2019-12-05T14:48:20Z

I think that scikit-learn is still not available on the conda channel (but soonish as well).

SvenWarnke · 2019-12-06T12:45:13Z

Works. Thanks a lot!

SvenWarnke · 2019-12-06T14:32:40Z

Hi @glemaitre after some investigation it appears that the bug is not completely gone. The oob score is what it should be in the balanced case (example in my initial post) but in an unbalanced case it is still far too high.
`
import numpy as np
from imblearn import ensemble
from sklearn import metrics

X = np.arange(1000).reshape(-1, 1)
y = np.random.binomial(1, 0.1, size=1000)

rf = ensemble.BalancedRandomForestClassifier(oob_score=True, n_estimators=1000)
rf.fit(X, y)

print(rf.oob_score_)

conf_mat = metrics.confusion_matrix(y_true=y, y_pred=rf.oob_decision_function_.argmax(axis=1))

print(conf_mat)
`
gives me
0.972
[[880 26]
[ 2 92]]

The confusion matrix shows that the classifier classifies the minority class perfectly while being ok on the negative class. This should not be possible in this case.

SvenWarnke · 2019-12-06T14:33:07Z

My versions are now
Windows-10-10.0.18362-SP0
Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)]
NumPy 1.17.3
SciPy 1.3.2
Scikit-Learn 0.22
Imbalanced-Learn 0.6.0

glemaitre · 2019-12-06T15:44:51Z

OK found it, some bad interaction with the new parameters max_samples from the trees.

glemaitre · 2019-12-06T16:01:13Z

The fix is there: #661

This was referenced Dec 5, 2019

Balanced Random Forest scikit-learn/scikit-learn#13227

Closed

FIX incorporate resampling when computing OOB score in BRF #656

Merged

glemaitre closed this as completed in #656 Dec 5, 2019

glemaitre reopened this Dec 6, 2019

glemaitre added the Type: Bug Indicates an unexpected problem or unintended behavior label Dec 6, 2019

glemaitre mentioned this issue Dec 6, 2019

FIX max_samples was computed on X instead of X_resampled #661

Merged

glemaitre closed this as completed in #661 Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overestimation of OOB score, probable bug in resampling? #655

Overestimation of OOB score, probable bug in resampling? #655

SvenWarnke commented Dec 4, 2019 •

edited

Loading

glemaitre commented Dec 5, 2019

SvenWarnke commented Dec 5, 2019

glemaitre commented Dec 5, 2019

glemaitre commented Dec 5, 2019

SvenWarnke commented Dec 6, 2019

SvenWarnke commented Dec 6, 2019

SvenWarnke commented Dec 6, 2019

glemaitre commented Dec 6, 2019

glemaitre commented Dec 6, 2019

Overestimation of OOB score, probable bug in resampling? #655

Overestimation of OOB score, probable bug in resampling? #655

Comments

SvenWarnke commented Dec 4, 2019 • edited Loading

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

glemaitre commented Dec 5, 2019

SvenWarnke commented Dec 5, 2019

Thanks @glemaitre for the super fast reply. However, reinstalling (via conda) did not change anything for me and installing from github broke for me:

glemaitre commented Dec 5, 2019

glemaitre commented Dec 5, 2019

SvenWarnke commented Dec 6, 2019

SvenWarnke commented Dec 6, 2019

SvenWarnke commented Dec 6, 2019

glemaitre commented Dec 6, 2019

glemaitre commented Dec 6, 2019

SvenWarnke commented Dec 4, 2019 •

edited

Loading