FIX Draw indices using sample_weight in Bagging #31414

antoinebaker · 2025-05-22T15:34:05Z

Part of #16298 and alternative to #31165.

What does this implement/fix? Explain your changes.

In Bagging estimators, sample_weight is now used to draw the samples and no longer forwarded to the underlying estimators. Bagging estimators now pass the statistical repeated/weighted equivalence test when bootstrap=True (the default, ie draw with replacement).

Compared to #31165, it better decouples two different usages of sample_weight:

sample_weight in bagging_estimator.fit are used as probabilities to draw the indices/rows
sample_weight in base_estimator.fit are used to represent the indices (more memory efficient than indexing), this is possible only if base_estimator.fit supports sample_weight (through metadata routing or natively).

#31165 introduced a new sampling_strategy argument to choose indexing/weighting for row sampling, but it would be better to do this in a dedicated follow up PR.

cc @ogrisel @GaetandeCast

github-actions · 2025-05-22T15:35:08Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f449770. Link to the linter CI: here}

antoinebaker · 2025-05-22T15:45:35Z

BaggingRegressor(estimator=Ridge(), max_samples=100) now passes the statistical repeated/weighted equivalence test

Idem for BaggingClassifier(estimator=LogisticRegression(), max_samples=100)) and varying max_samples.

antoinebaker · 2025-05-22T15:49:46Z

However it fails (as expected) for bootstrap=False (draw without replacement), for example BaggingRegressor(estimator=Ridge(), bootstrap=False, max_samples=10)

ogrisel · 2025-05-23T14:14:34Z

However it fails (as expected) for bootstrap=False (draw without replacement).

Could you please document this known limitation, both in the docstring of the __init__ method for the bootstrap parameter and in the docstring of the fit method for the sample_weight parameter?

Something like: "Note that the expected frequency semantics for the sample_weight parameter are only fulfilled when sampling with replacement bootstrap=True".

~~Maybe we should raise a warning when calling BaggingClassifier(bootstrap=False, max_samples=0.5).fit(X, y, sample_weight=sample_weight) with sample_weight is not None.~~ The warning is already implemented and tested: https://github.com/scikit-learn/scikit-learn/pull/31414/files#diff-b7c01e77fe68ded1e41868f4a7e142190f935261624d4abdb299913ef944cbbbR676-R682.

ogrisel

Here is a pass of review. Could you please add a non-regression test using a small dataset with specifically engineered weights? For instance, you could have a dataset with 100 datapoints, with 98 data points with a null weight, 1 data point, with a weight of 1 and 1 with a weight of 2:

X = np.arange(100).reshape(-1, 1)
y = (X < 99).astype(np.int32)
sample_weight = np.zeros(shape=X.shape[0])
sample_weight[0] = 1
sample_weight[-1] = 2

Then you could fit a BaggingRegressor and a BaggleClassifier with a fake test estimator that just records the values passed as X, y and sample_weight as fitted attribute to be able to write assertions in the test.

Ideally this test should pass both with metadata routing enabled and disabled.

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst

sklearn/ensemble/_bagging.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst

ogrisel · 2025-05-28T08:00:32Z

BTW @antoinebaker once this PR has been finalized with tests, it would be great to open a similar PR for random forests. I suppose their bad handling of sample weights stems from the same root cause and a similar fix should be applicable.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel

Thanks, @antoinebaker. Here is another pass of feedback but otherwise LGTM.

sklearn/ensemble/tests/test_bagging.py

ogrisel · 2025-06-03T15:44:25Z

but otherwise LGTM.

Actually no: I tried to run https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb against this branch and I still get a p-value lower than 1e-33 for this branch. It's an improvement over the < 1e-54 I measured on main but still, the bug does not seem fixed for classifiers.

I confirm the bug is fixed for the regressor though. So I must be missing something.

antoinebaker · 2025-06-03T15:52:51Z

I confirm the bug is fixed for the regressor though. So I must be missing something.

Did you specify max_samples as an integer eg max_samples=10 ? Otherwise you might get different number of samples in the repeated/weighted datasets #31165 (comment)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2025-06-03T15:56:47Z

I forgot about the max_samples thing. Let me try again.

EDIT: I confirm this works as expected.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker · 2025-06-04T07:43:05Z

The CI test failure seems unrelated to this PR (forbidden request in fetch_california_housing)
https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=77010&view=logs&j=78a0bf4f-79e5-5387-94ec-13e67d216d6e&t=255f4aab-5c1b-556f-e9b7-bc126d168add&l=587

ogrisel · 2025-06-10T15:06:50Z

@antoinebaker I pushed 28a2bde to make the sample weight semantics consistent between max_samples passed as absolute or relative values. I re-ran the statistical test, and they now always pass, whatever the value of max_samples.

I had to change the code a bit to raise ValueError with explicit messages for degenerate cases, and updated the tests accordingly. I think I prefer this behavior.

Let's see if the CI is green after this commit and I will do a proper review of the PR.

ogrisel

Assuming CI is green, the diff LGTM besides the following details:

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst

sklearn/ensemble/_bagging.py

ogrisel · 2025-06-10T15:39:48Z

Maybe @jeremiedbb and @snath-xoc would like to review this PR.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker · 2025-06-10T16:25:58Z

@antoinebaker I pushed 28a2bde to make the sample weight semantics consistent between max_samples passed as absolute or relative values. I re-ran the statistical test, and they now always pass, whatever the value of max_samples.

I like the new semantic and raising an error if sw_sum < 1.

draw indices using sample_weight

0092620

github-actions bot added the module:ensemble label May 22, 2025

changelog

289a3d2

antoinebaker added 2 commits May 23, 2025 09:41

move consumes sw

d0f3ddb

test warning

804d863

antoinebaker mentioned this pull request May 23, 2025

FIX Use sample weight to draw samples in Bagging estimators #31165

Closed

antoinebaker marked this pull request as ready for review May 23, 2025 13:15

ogrisel added this to Losses and solvers May 23, 2025

ogrisel mentioned this pull request May 23, 2025

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

docstring

f01c278

ogrisel reviewed May 26, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

Apply suggestions from code review

bc4e499

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel reviewed May 27, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst Outdated Show resolved Hide resolved

antoinebaker and others added 2 commits June 2, 2025 09:08

Update doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst

cfb7844

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

add test

318941b

ogrisel reviewed Jun 3, 2025

View reviewed changes

Apply suggestions from code review

d33374a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker and others added 2 commits June 3, 2025 17:58

Apply suggestions from code review

8ee4723

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

test shapes

0acfe42

antoinebaker and others added 2 commits June 6, 2025 18:49

Merge branch 'main' into bagging_sample_weight_in_choice

f676935

Merge branch 'main' into bagging_sample_weight_in_choice

0a2b38d

Make max_samples a fraction of sample_weight.sum() instead of X.shape[0]

28a2bde

ogrisel moved this to In Progress in Losses and solvers Jun 10, 2025

ogrisel approved these changes Jun 10, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst Outdated Show resolved Hide resolved

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

ogrisel added the Waiting for Reviewer label Jun 10, 2025

Apply suggestions from code review

f449770

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Uh oh!

FIX Draw indices using sample_weight in Bagging #31414

Are you sure you want to change the base?

FIX Draw indices using sample_weight in Bagging #31414

Uh oh!

Conversation

antoinebaker commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

antoinebaker commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented May 22, 2025

Uh oh!

ogrisel commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented May 28, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Jun 3, 2025

Uh oh!

ogrisel commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Jun 4, 2025

Uh oh!

ogrisel commented Jun 10, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Jun 10, 2025

Uh oh!

antoinebaker commented Jun 10, 2025

Uh oh!

Uh oh!

antoinebaker commented May 22, 2025 •

edited

Loading

github-actions bot commented May 22, 2025 •

edited

Loading

antoinebaker commented May 22, 2025 •

edited

Loading

ogrisel commented May 23, 2025 •

edited

Loading

ogrisel commented Jun 3, 2025 •

edited

Loading

ogrisel commented Jun 3, 2025 •

edited

Loading