Skip to content

Overestimation of OOB score, probable bug in resampling? #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SvenWarnke opened this issue Dec 4, 2019 · 9 comments · Fixed by #656 or #661
Closed

Overestimation of OOB score, probable bug in resampling? #655

SvenWarnke opened this issue Dec 4, 2019 · 9 comments · Fixed by #656 or #661
Labels
Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@SvenWarnke
Copy link

SvenWarnke commented Dec 4, 2019

Description

When I calculate out of bag score the out of bag score is quite high, even if there is no connection between features and labels. I assume, that something goes wrong in keeping track of which samples are out of bag for each tree. Hence, the samples get evaluated on some trees where they where in fact in the bag.

Steps/Code to Reproduce

Example:

import numpy as np
from imblearn import ensemble

X = np.arange(1000).reshape(-1, 1)
y = np.random.binomial(1, 0.5, size=1000)

rf = ensemble.BalancedRandomForestClassifier(oob_score=True)
rf.fit(X, y)
rf.oob_score_

the output is 0.838

Expected Results

Since there is no relationship between the X, y (y are just independent coin flips) OOB should be around 0.5

Actual Results

Something in the range of 0.8, which is very significant on a sample size of 1000.

Versions

Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.5
SciPy 1.3.1
Scikit-Learn 0.21.3
Imbalanced-Learn 0.5.0

@glemaitre
Copy link
Member

Indeed, I think that we have an issue there. We resampling during the parallel fit (in your example, it should only correspond to a data shuffling) and then generate the bootstrap. However, the way that OOB score are currently computed, does not take into account the sampling. It means that we have samples in OOB which are actually used by the trees, explaining the over-optimistic score obtained.

We should repimplement (set_oob_score) to take into account the sampling on X>

@SvenWarnke
Copy link
Author

Thanks @glemaitre for the super fast reply. However, reinstalling (via conda) did not change anything for me and installing from github broke for me:

ModuleNotFoundError Traceback (most recent call last)
in
4 import scipy; print("SciPy", scipy.version)
5 import sklearn; print("Scikit-Learn", sklearn.version)
----> 6 import imblearn; print("Imbalanced-Learn", imblearn.version)

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn_init_.py in
32 Module which allowing to create pipeline with scikit-learn estimators.
33 """
---> 34 from . import combine
35 from . import ensemble
36 from . import exceptions

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_init_.py in
3 """
4
----> 5 from ._smote_enn import SMOTEENN
6 from ._smote_tomek import SMOTETomek
7

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_smote_enn.py in
8 from sklearn.utils import check_X_y
9
---> 10 from ..base import BaseSampler
11 from ..over_sampling import SMOTE
12 from ..over_sampling.base import BaseOverSampler

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\base.py in
14 from sklearn.utils.multiclass import check_classification_targets
15
---> 16 from .utils import check_sampling_strategy, check_target_type
17
18

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_init_.py in
5 from ._docstring import Substitution
6
----> 7 from ._validation import check_neighbors_object
8 from ._validation import check_target_type
9 from ._validation import check_sampling_strategy

~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_validation.py in
11
12 from sklearn.base import clone
---> 13 from sklearn.neighbors._base import KNeighborsMixin
14 from sklearn.neighbors import NearestNeighbors
15 from sklearn.utils.multiclass import type_of_target
ModuleNotFoundError: No module named 'sklearn.neighbors._base'

When will this fix be available via conda-forge or PyPi?

@glemaitre
Copy link
Member

The package has been updated on PyPI, 20 minutes ago. Be aware that running imbalanced-learn 0.6 will require scikit-learn 0.22 (all available on PyPI). Imbalanced-learn will be shortly available in conda-forge (probably at the end of the day).

@glemaitre
Copy link
Member

I think that scikit-learn is still not available on the conda channel (but soonish as well).

@SvenWarnke
Copy link
Author

Works. Thanks a lot!

@SvenWarnke
Copy link
Author

Hi @glemaitre after some investigation it appears that the bug is not completely gone. The oob score is what it should be in the balanced case (example in my initial post) but in an unbalanced case it is still far too high.
`
import numpy as np
from imblearn import ensemble
from sklearn import metrics

X = np.arange(1000).reshape(-1, 1)
y = np.random.binomial(1, 0.1, size=1000)

rf = ensemble.BalancedRandomForestClassifier(oob_score=True, n_estimators=1000)
rf.fit(X, y)

print(rf.oob_score_)

conf_mat = metrics.confusion_matrix(y_true=y, y_pred=rf.oob_decision_function_.argmax(axis=1))

print(conf_mat)
`
gives me
0.972
[[880 26]
[ 2 92]]

The confusion matrix shows that the classifier classifies the minority class perfectly while being ok on the negative class. This should not be possible in this case.

@SvenWarnke
Copy link
Author

My versions are now
Windows-10-10.0.18362-SP0
Python 3.8.0 | packaged by conda-forge | (default, Nov 22 2019, 19:04:36) [MSC v.1916 64 bit (AMD64)]
NumPy 1.17.3
SciPy 1.3.2
Scikit-Learn 0.22
Imbalanced-Learn 0.6.0

@glemaitre glemaitre reopened this Dec 6, 2019
@glemaitre glemaitre added the Type: Bug Indicates an unexpected problem or unintended behavior label Dec 6, 2019
@glemaitre
Copy link
Member

OK found it, some bad interaction with the new parameters max_samples from the trees.

@glemaitre
Copy link
Member

The fix is there: #661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
2 participants