-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Overestimation of OOB score, probable bug in resampling? #655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Indeed, I think that we have an issue there. We resampling during the parallel fit (in your example, it should only correspond to a data shuffling) and then generate the bootstrap. However, the way that OOB score are currently computed, does not take into account the sampling. It means that we have samples in OOB which are actually used by the trees, explaining the over-optimistic score obtained. We should repimplement (set_oob_score) to take into account the sampling on X> |
Thanks @glemaitre for the super fast reply. However, reinstalling (via conda) did not change anything for me and installing from github broke for me:ModuleNotFoundError Traceback (most recent call last) ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn_init_.py in ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_init_.py in ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\combine_smote_enn.py in ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\base.py in ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_init_.py in ~\Anaconda3\envs\imblearn\lib\site-packages\imblearn\utils_validation.py in When will this fix be available via conda-forge or PyPi? |
The package has been updated on PyPI, 20 minutes ago. Be aware that running imbalanced-learn 0.6 will require scikit-learn 0.22 (all available on PyPI). Imbalanced-learn will be shortly available in conda-forge (probably at the end of the day). |
I think that scikit-learn is still not available on the conda channel (but soonish as well). |
Works. Thanks a lot! |
Hi @glemaitre after some investigation it appears that the bug is not completely gone. The oob score is what it should be in the balanced case (example in my initial post) but in an unbalanced case it is still far too high. X = np.arange(1000).reshape(-1, 1) rf = ensemble.BalancedRandomForestClassifier(oob_score=True, n_estimators=1000) print(rf.oob_score_) conf_mat = metrics.confusion_matrix(y_true=y, y_pred=rf.oob_decision_function_.argmax(axis=1)) print(conf_mat) The confusion matrix shows that the classifier classifies the minority class perfectly while being ok on the negative class. This should not be possible in this case. |
My versions are now |
OK found it, some bad interaction with the new parameters |
The fix is there: #661 |
Description
When I calculate out of bag score the out of bag score is quite high, even if there is no connection between features and labels. I assume, that something goes wrong in keeping track of which samples are out of bag for each tree. Hence, the samples get evaluated on some trees where they where in fact in the bag.
Steps/Code to Reproduce
Example:
the output is 0.838
Expected Results
Since there is no relationship between the X, y (y are just independent coin flips) OOB should be around 0.5
Actual Results
Something in the range of 0.8, which is very significant on a sample size of 1000.
Versions
Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
NumPy 1.16.5
SciPy 1.3.1
Scikit-Learn 0.21.3
Imbalanced-Learn 0.5.0
The text was updated successfully, but these errors were encountered: