Skip to content

[MRG] Add class_weight parameter to CalibratedClassifierCV #17541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

amascia
Copy link

@amascia amascia commented Jun 9, 2020

Reference Issues/PRs

To the best of my knowledge, no issues/PRs are addressing this added feature.

What does this implement/fix? Explain your changes.

This PR adds the class_weight parameter to the CalibratedClassifierCV class to supplement the already existing sample_weight parameter in its fit method.
Passing class_weight='balanced' to CalibratedClassifierCV will balance the dataset on which the calibration method (sigmoid or isotonic) will be trained (if no sample_weight is passed).

Any other comments?

Please find attached 4 scripts that test this added parameter : 2 for binary classifications with cv=prefit and cv=2 and 2 for multiclass classifications with cv=prefit and cv=2.

class_weighted_calibratedClassifierCV.zip

@adrinjalali
Copy link
Member

Overall this looks okay. Try merging with the master branch and some of the tests should pass.

You can also parameterize and refactor the test.

It would also be nice if you could add a section to the user guide or add an example to the example gallery regarding the effect of this parameter.

@amascia
Copy link
Author

amascia commented Jul 17, 2020

Hi,
Thank you for your comments, I've merged my branch with sklearn's master branch and I've updated the user guide with this added feature in the calibration section. Please tell me if there is anything else to do.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing.

@amascia amascia requested a review from adrinjalali August 3, 2020 12:48
@adrinjalali
Copy link
Member

@joshuacwnewton would you feel comfortable reviewing this one?

Copy link
Contributor

@joshuacwnewton joshuacwnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note: I'm not a core maintainer. I'm leaving my review to help move things along, but two core maintainers will need to approve this PR for it to be merged. Thanks for your patience!

Thanks for the contribution, @amascia! This is a great start. I've left a few suggestions below. 🙂

I've also noticed that some continuous integration tests are failing for this PR. To fix some of the tests, please see my comments below. However, for some of the other tests, I think the FutureWarnings come from #16474, which added a @_deprecate_positional_args decorator to CalibratedClassifierCV. In other words, I believe those other failing tests are unrelated to your changes.

# Conflicts:
#	sklearn/calibration.py
#	sklearn/tests/test_calibration.py
@amascia
Copy link
Author

amascia commented Sep 1, 2020

I have updated the PR with your comments, thank you @joshuacwnewton and @adrinjalali

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thank you, @amascia. This is a nice first time contribution.

@cmarmo
Copy link
Contributor

cmarmo commented Dec 16, 2021

@lucyleeow might be interested in reviewing?

Copy link
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a useful addition, thanks! Just had some questions.

@amascia amascia requested review from lucyleeow and cmarmo January 10, 2022 10:39
Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amascia . I have some comments about documentation. I'm afraid I don't have "acceptance clearance".
Perhaps @adrinjalali who first checked your PR might want to finalize the review? Thanks!

@cmarmo cmarmo added this to the 1.1 milestone Jan 10, 2022
@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2022

Thanks to both the author and the reviewers for the PR. It looks very clean. However, it's not obvious to me as to why passing class_weight="balanced" to the calibrator is a good idea from a mathematical point of view. So before considering merging the I would really like to see an empirical study that shows that passing class_weight="balanced" to the calibrator (especially with cv="prefit") is actually improving the shape of a calibration curve (both with uniform and quantile binning) and the Brier score of a binary classifier it-self fitted with class_weight="balanced".

To make the study interesting the dataset should be large enough with a severe class imbalance (e.g. 1 to 10 or more between the positive and binary class) and not too easy. One way to generate such a dataset would be to use something like:

X, y = make_classification(
    n_samples=100_000,
    weights=[0.95, 0.05],  # ~20x more negative than positive samples
    n_features=10,
    n_informative=10,
    n_redundant=0,
    n_repeated=0,
)
X = X[:, :8]  # hide some informative features to make the problem more difficult

Then one could evaluate estimators such as RandomForestClassifier or make_pipeline(SplineTransformer(), LogisticRegression()).

Ideally, this study can be done with a notebook shared on gist.github.com.

If we fine situations where passing class_weight="balanced" to the calibrator improves the shape of the calibration curve or calibration-sensitive metrics such as the log loss or the Brier score (or threshold-sensitive "hard" metrics such as f1_score), then we can probably proceed to merge the PR as is, maybe after converting the notebook into an example.

If on the other hand with find out that passing class_weight="balanced" is always detrimental to calibration, I think I would still be in favor of accepting the PR but we should consider documenting in the docstring that using class_weight to calibrated classifier is rarely a good way to improve the calibration of classifiers for imbalanced classification problems.

In my experience class_weight="balanced" to the raw classifiers (without CalibratedClassifierCV) themselves does not seem to make a significant change in in the Brier score of such imbalanced datasets with the classifier proposed above (either it's in the CV sampling noise or it is actually detrimental!). So we are already misleading our users with other part of scikit-learn with this class_weight="balanced" option.

Somewhat related discussion: #13227 (in particular the example).

@adrinjalali
Copy link
Member

anybody working on the study? removing from the milestone for now, but happy to be pinged for reviews once the study is done.

@ogrisel could you please write why you think it might not be mathematically sane?

@adrinjalali adrinjalali removed this from the 1.1 milestone Apr 7, 2022
@daustria
Copy link

is this stalled PR available? would like to try and pick up the work , making the study described above and putting @amascia 's changes in the new codebase (ive already taken a look and it seems it can go in without many modifications)

@lucyleeow
Copy link
Member

lucyleeow commented Aug 17, 2024

Hi @daustria, please take a look at #17541 (comment), this PR is stalled but there is uncertainty about whether we add this feature, which needs to be decided first. Thank you.

@daustria
Copy link

daustria commented Aug 17, 2024

yes, i saw the comment and i was thinking of trying my hand and making such a study. unless maybe it is not something for volunteers to do?

edit: no longer working on this, feel free to take over the work

@lucyleeow
Copy link
Member

Please go ahead, look forward to seeing your findings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants