Skip to content

FIX CalibratedClassifierCV to handle correctly sample_weight when ensemble=False #20638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JulienB-78
Copy link
Contributor

Reference Issues/PRs

Fixes #20610

What does this implement/fix? Explain your changes.

sample_weight is now passed to cross_val_predict via the fit_params dict such that the weights are used when training the classifier for each cv split.

The test checking that the calibration is correct when passing sample_weight has been updated to verify that the calibration is actually effective. Previously, the test was only checking that the predictions of CalibratedClassifierCV with and without sample_weight were different but not that the calibration was actually working.

Any other comments?

@JulienB-78
Copy link
Contributor Author

There are currently 4 check failures:

  • 3 in the Azure pipeline
  • 1 related to Changelog

Am I supposed to fix them?
If yes, suggestions are welcome because I don't really know where to start for any of them...

@glemaitre
Copy link
Member

glemaitre commented Jul 31, 2021

The failure regarding OpenMP can be ignored (see #20640).

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:. This will solve the failure with "Check Changelog".

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. I am thinking that we could add an additional check

JulienB-78 and others added 4 commits August 8, 2021 14:40
add stratify=y  when splitting train and test sets

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…ulienB-78/scikit-learn into fix_calibrationclassifier_with_weights
@JulienB-78
Copy link
Contributor Author

The pull request is still open and seems stuck. I asked questions about a month ago regarding some requests that were made and received no answer. (summer break I guess)

What's next?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will tag this PR for 1.0.1 to integrate it in the next bug fix release

@@ -146,6 +146,10 @@ Changelog
as `base_estimator` in ::class:`calibration.CalibratedClassifierCV`.
:pr:`20087` by :user:`Clément Fauchereau <clement-f>`.

- |Fix| Fixed ::class:`calibration.CalibratedClassifierCV` to handle correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to move this fix under a section 1.0.1 because it will be a bug fix that will be integrated in the next bug fix release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.0.1.rst is not created yet.

I should wait for its creation to add:

  • |Fix| Fixed :class:calibration.CalibratedClassifierCV to handle correctly
    sample_weight when ensemble=False.
    :pr:20638 by :user:Julien Bohné <JulienB-78>.

or there is another way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.0.1.rst is not created yet.

It should be in v1.0.rst file. It will only be a new section. You can have a look at the 0.24.rst file to see the 0.24.2 section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created the 1.0.1 section.

Some builds fail but I guess it is unrelated. Should I just wait?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I relaunch the build. It was due to an error at the install of the docker file. This is a random failure on Azure side I assume.

@glemaitre glemaitre added this to the 1.0.1 milestone Sep 16, 2021
@glemaitre glemaitre self-requested a review September 24, 2021 08:40
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so here are a couple of more comments and I found an underlying quite important bug indeed. I will open a subsequent issue.

X, y = make_blobs((100, 1000), center_box=(-1, 1), random_state=42)

# Compute weigths to compensate the unbalance of the dataset
sample_weight = 9 * (y == 0) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you make the same change


@pytest.mark.parametrize("method", ["sigmoid", "isotonic"])
def test_sample_weight_class_imbalanced_ensemble_equivalent(method):
X, y = make_blobs((100, 1000), center_box=(-1, 1), random_state=42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a small docstring mentioning what do we try to achieve here

X, y, sample_weight, stratify=y, random_state=42
)

scaler = StandardScaler()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add a comment as before

scaler = StandardScaler()
X_train = scaler.fit_transform(
X_train
) # compute mean, std and transform training data as well
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment

scaler = StandardScaler()
X_train = scaler.fit_transform(
X_train
) # compute mean, std and transform training data as well
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can new remove this comment as well

@glemaitre glemaitre changed the title Fix CalibratedClassifierCV to handle correctly sample_weight when ensemble=False FIX CalibratedClassifierCV to handle correctly sample_weight when ensemble=False Sep 24, 2021
JulienB-78 and others added 3 commits September 26, 2021 09:56
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
JulienB-78 and others added 3 commits September 26, 2021 09:57
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -166,6 +166,12 @@ def test_sample_weight(data, method, ensemble):
X_train, y_train, sw_train = X[:n_samples], y[:n_samples], sample_weight[:n_samples]
X_test = X[n_samples:]

scaler = StandardScaler()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scaler = StandardScaler()
# FIXME: ideally we should create a `Pipeline` with the `StandardScaler`
# followed by the `LinearSVC`. However, `Pipeline` does not expose
# `sample_weight` and it will be silently ignored.
scaler = StandardScaler()

X, y = make_blobs((100, 1000), center_box=(-1, 1), random_state=42)

# Compute weigths to compensate the unbalance of the dataset
sample_weight = 9 * (y == 0) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight = 9 * (y == 0) + 1
weights = np.array([0.9, 0.1])
sample_weight = weights[(y == 1).astype(np.int64)]


# Compute weights to compensate for the unbalance of the dataset
weights = np.array([0.9, 0.1])
sample_weight = weights[(y == 1).astype(int)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight = weights[(y == 1).astype(int)]
sample_weight = weights[(y == 1).astype(np.int64)]


# Compute weights to compensate for the unbalance of the dataset
weights = np.array([0.9, 0.1])
sample_weight = weights[(y == 1).astype(int)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sample_weight = weights[(y == 1).astype(int)]
sample_weight = weights[(y == 1).astype(np.int64)]

Comment on lines +234 to +238
scaler = StandardScaler()
X_train = scaler.fit_transform(
X_train
) # compute mean, std and transform training data as well
X_test = scaler.transform(X_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scaler = StandardScaler()
X_train = scaler.fit_transform(
X_train
) # compute mean, std and transform training data as well
X_test = scaler.transform(X_test)
# FIXME: ideally we should create a `Pipeline` with the `StandardScaler`
# followed by the `LinearSVC`. However, `Pipeline` does not expose
# `sample_weight` and it will be silently ignored.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

@glemaitre
Copy link
Member

@JulienB-78 While reviewing your PR, I found some other bugs and to find those bugs, I wrote a couple of tests that were better suited to test the sample_weight behaviour. Since it takes your work, I added your changelog entry and I added you as a co-author while committing: #21179

I will close this PR since we need your fix at the same time as the fix in the sigmoid calibration. I will close this PR then but you will get acknowledged in the other PR.

We will have to open another PR most probably to check the behaviour with the Pipeline.

@glemaitre glemaitre closed this Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalibratedClassifierCV does not handle well sample_weight when ensemble=False
2 participants