FIX Use sample weight to draw samples in Bagging estimators #31165

antoinebaker · 2025-04-09T14:28:00Z

Part of #16298.

What does this implement/fix? Explain your changes.

In Bagging estimators, sample_weight is now used to draw the samples and are no longer forwarded to the underlying estimators.

github-actions · 2025-04-09T14:29:38Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 575a6bd. Link to the linter CI: here}

antoinebaker · 2025-04-09T14:32:41Z

sklearn/ensemble/tests/test_bagging.py

-def test_bagging_sample_weight_unsupported_but_passed():
-    estimator = BaggingClassifier(DummyZeroEstimator())
-    rng = check_random_state(0)
-
-    estimator.fit(iris.data, iris.target).predict(iris.data)
-    with pytest.raises(ValueError):
-        estimator.fit(
-            iris.data,
-            iris.target,
-            sample_weight=rng.randint(10, size=(iris.data.shape[0])),
-        )


This test is no longer needed I think. It is now okay to have a base estimator that does not support sample_weight.

antoinebaker · 2025-04-09T14:41:26Z

BaggingRegressor(estimator=Ridge()) now passes the statistical test for sample weight equivalence with repeated data.

But BaggingClassifier(estimator=LogisticRegression()) doesn't :(

Any idea @snath-xoc on why it fails on the classifier ?

antoinebaker · 2025-04-09T15:50:16Z

I think the statistical sample weight equivalence is only guaranteed when samples are drawn with replacement (bootstrap=True). We should maybe warn the user if they use sample_weight with bootstrap=False ?

antoinebaker · 2025-04-11T07:56:05Z

I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏

test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'>

Now BaggingRegressor.fit is supposed to consumes sample_weight for drawing samples and not route/pass sample_weight to underlying estimators.

Is the test adapted for this case ? If not should we write a test for the new intended behavior ?

snath-xoc · 2025-04-11T08:39:19Z

Hi @antoinebaker, strange that logisticregression is giving such an output, is it still the case? If so I can take a quick look.

antoinebaker · 2025-04-11T08:53:58Z

Hi @antoinebaker, strange that logisticregression is giving such an output, is it still the case? If so I can take a quick look.

Ah, good news, I think I found the reason 😃

The culprit is max_samples, which by default is max_samples=1.0 (meaning taking n_samples).
For the repeated/weighted estimators to be equivalent we need to draw

the same number of samples
with replacement

Condition 2 requires using bootstrap=True
Condition 1 requires to use an integer max_samples. Using a float max_samples (relative size) will lead to different number of samples between weighted/repeated.

Now all these estimators pass the statistical test:

# max_samples: int (here max_samples ~ n_samples)
BaggingRegressor(estimator=Ridge(), max_samples=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=100)
# subsampling (max_samples ~ 0.1 n_samples) and n_estimators=100
BaggingRegressor(estimator=Ridge(), max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=10, n_estimators=100)

And these fail as they should:

# bootstrap=False (drawing without replacement)
BaggingRegressor(estimator=Ridge(), bootstrap=False, max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), bootstrap=False, max_samples=10, n_estimators=100)

antoinebaker · 2025-04-11T09:25:54Z

To illustrate, BaggingRegressor(estimator=Ridge(), max_samples=100)

BaggingClassifier(estimator=LogisticRegression(), max_samples=100)

antoinebaker · 2025-04-11T09:29:43Z

Apart from the test failures #31165 (comment) that require metadata experts, I think this is ready for review.

doc/whats_new/upcoming_changes/sklearn.ensemble/31165.fix.rst

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>

StefanieSenger

I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏
test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'>
Now` BaggingRegressor.fit is supposed to consumes sample_weight for drawing samples and not route/pass sample_weight to underlying estimators.
Is the test adapted for this case ? If not should we write a test for the new intended behavior ?

Thanks for tagging, @antoinebaker. I have done some research and I think I have found the cause in test_metaestimators_metadata_routing.py , but I'd like to double-check with someone if I got it right and about what the solution could be. I will describe what I found (sorry it is a bit lengthy):

test_error_on_missing_requests_for_sub_estimator fails, because now sample_weight is not added as a key to fit_params in BaseBagging.fit() anymore (aka is not automatically routed anymore) and as a result the process_routing() function returns an EmptyRequest() and doesn't have a chance to raise the UnsetMetadataPassedError that we are waiting for in this test. Other metadata routing tests (I had looked into test_setting_request_on_sub_estimator_removes_error) take the same shortcut and pass without having run a real test.

With routing enabled (as well as without) sample_weight, when passed, is removed from fit_params and handled inside BaseBagging to draw samples for fitting the sub-estimators. This results in a param named sample_weight not being able to be routed anymore and probably this is just what this PR is supposed to do?

In this case, only the tests would need fixing. Not sure how yet, but I was thinking about not using sample_weight as a param name in the metadata testing file anymore, because the name clashes create bad shortcuts and also make it hard to see if a test had actually tested what it was supposed to test. Does that make sense, @adrinjalali?

An alternative would be to re-add the sample_weight into fit_params in BaseBagging._fit()`:

        if _routing_enabled():
            if sample_weight is not None:
                fit_params["sample_weight"] = sample_weight
            routed_params = process_routing(self, "fit", **fit_params)

The tests then pass, BUT when I understand correctly then sample_weight is taken into account twice (once while weighting the samples for fitting the sub-estimators, and a second time within the sub-estimators), and we don't want that, right?

(Another little thing (out of scope for this PR):
In process_routing() there is a a comment before early return of EmptyRequest() stating that this was meant for cases when routing is not enabled, but now this branch is executed when the routing is enabled (as the test failure shows), and we should fix this comment.)

sklearn/ensemble/_bagging.py

adrinjalali · 2025-04-14T12:33:39Z

@StefanieSenger 's analysis is mostly correct. The test expects all metadata to be always routed if a method is declared as "routing", and now with this PR it's no longer true.

I propose the following patch to fix / improve the tests:

diff --git a/sklearn/tests/test_metaestimators_metadata_routing.py b/sklearn/tests/test_metaestimators_metadata_routing.py
index ae2a186a3c..6a84d3b8dd 100644
--- a/sklearn/tests/test_metaestimators_metadata_routing.py
+++ b/sklearn/tests/test_metaestimators_metadata_routing.py
@@ -330,7 +330,7 @@ METAESTIMATORS: list = [
         "y": y,
         "preserves_metadata": False,
         "estimator_routing_methods": [
-            "fit",
+            ("fit", ["metadata"]),
             "predict",
             "predict_proba",
             "predict_log_proba",
@@ -349,7 +349,7 @@ METAESTIMATORS: list = [
         "X": X,
         "y": y,
         "preserves_metadata": False,
-        "estimator_routing_methods": ["fit", "predict"],
+        "estimator_routing_methods": [("fit", ["metadata"]), "predict"],
     },
     {
         "metaestimator": RidgeCV,
@@ -459,7 +459,13 @@ The keys are as follows:
 - X: X-data to fit and predict
 - y: y-data to fit
 - estimator_routing_methods: list of all methods to check for routing metadata
-  to the sub-estimator
+  to the sub-estimator. Each value is either a str or a tuple:
+    - str: the name of the method, all metadata in this method must be routed to the
+      sub-estimator
+    - tuple: the name of the method, the second element is a list of metadata keys
+      to be passed to the sub-estimator. This is useful if certain metadata such as
+      `sample_weight` are never routed and only consumed, such as in `BaggingClassifier`
+      and `BaggingRegressor`.
 - preserves_metadata:
     - True (default): the metaestimator passes the metadata to the
       sub-estimator without modification. We check that the values recorded by
@@ -561,6 +567,22 @@ def get_init_args(metaestimator_info, sub_estimator_consumes):
         (cv, cv_registry),
     )
 
+def process_routing_methods(estimator_routing_methods):
+    """Process estimator_routing_methods and return a dict.
+
+    The dictionary is of the form {"method": ["metadata", ...]}.
+    By default, the list includes `sample_weight` and `metadata`.
+    """
+    res = dict()
+    for method_spec in estimator_routing_methods:
+        if isinstance(method_spec, str):
+            method = method_spec
+            metadata = ["sample_weight", "metadata"]
+        else:
+            method, metadata = method_spec
+        res[method] = metadata
+    return res
+
 
 def set_requests(obj, *, method_mapping, methods, metadata_name, value=True):
     """Call `set_{method}_request` on a list of methods from the sub-estimator.
@@ -662,10 +684,10 @@ def test_error_on_missing_requests_for_sub_estimator(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
 
-    for method_name in routing_methods:
-        for key in ["sample_weight", "metadata"]:
+    for method_name, metadata_keys in routing_methods.items():
+        for key in metadata_keys:
             kwargs, (estimator, _), (scorer, _), *_ = get_init_args(
                 metaestimator, sub_estimator_consumes=True
             )
@@ -721,12 +743,12 @@ def test_setting_request_on_sub_estimator_removes_error(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods =   process_routing_methods(metaestimator["estimator_routing_methods"])
     method_mapping = metaestimator.get("method_mapping", {})
     preserves_metadata = metaestimator.get("preserves_metadata", True)
 
-    for method_name in routing_methods:
-        for key in ["sample_weight", "metadata"]:
+    for method_name, metadata_keys in routing_methods.items():
+        for key in metadata_keys:
             val = {"sample_weight": sample_weight, "metadata": metadata}[key]
             method_kwargs = {key: val}
 
@@ -797,7 +819,7 @@ def test_non_consuming_estimator_works(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
 
     for method_name in routing_methods:
         kwargs, (estimator, _), (_, _), (_, _) = get_init_args(

ogrisel · 2025-04-15T12:35:06Z

I think the statistical sample weight equivalence is only guaranteed when samples are drawn with replacement (bootstrap=True). We should maybe warn the user if they use sample_weight with bootstrap=False ?

I think we should at least document this in the docstring entries for the bootstrap parameter in __init__ and in the docstring for sample_weight in the fit method.

We could also raise a warning when bootstrap=False and sample_weight is not None.

ogrisel · 2025-04-15T13:07:07Z

I wonder if we shouldn't keep the code of main as an option, since it is typically 2x more memory efficient than the new code.

Maybe we could introduce a new hyper-parameter resampling that takes values in {"concrete", "using_sample_weight", "auto"}.

"concrete" would use the new code of this PR;
"using_sample_weight" would use the old code (bincount of the sampled indices);
"auto" (the default) would:
- branch to "concrete" if the Bagging model is itself provided non-None sample_weight or if the base estimator does not accept sample_weight.
- branch to "using_sample_weight" otherwise, for the sake of keeping memory efficient code by default.

Unfortunately, it would be expensive to test that BaggingClassifier(resampling="using_sample_weight").fit(X, y, sample_weight=None) and BaggingClassifier(resampling="concrete", sample_weight=None).fit(X, y) are identically distributed.

But we could at least check that they do qualitatively as part of the review process of this PR using the following plot:

import matplotlib.pyplot as plt
import sklearn

fig, ax = plt.subplots()
shared = dict(
    param_name="n_estimators",
    param_range=[1, 5, 10, 50],
    scoring="neg_log_loss",
    cv=sklearn.model_selection.ShuffleSplit(n_splits=10),
    fit_params={"sample_weight": None},  # to be explicit that we would not expect the curves to match otherwise
    ax=ax,
)
sklearn.inspection.ValidationCurveDisplay.from_estimator(
    sklearn.ensemble.BaggingClassifier(resampling="using_sample_weight"), X, y, **shared
)

sklearn.inspection.ValidationCurveDisplay.from_estimator(
    sklearn.ensemble.BaggingClassifier(resampling="concrete"), X, y, **shared
)
plt.legend()

and check that the 2 sets of curves (train and test for each estimator variant) and error bars overlap. If not, increase n_splits and retry to check that any discrepancy disappears in the limits of large n_splits values.

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>

antoinebaker · 2025-04-17T08:32:55Z

I wonder if we shouldn't keep the code of main as an option, since it is typically 2x more memory efficient than the new code.

@ogrisel Could you elaborate why the code of main is more memory efficient ? Along the same lines I did not grasp this comment in main:

scikit-learn/sklearn/ensemble/_bagging.py

Lines 162 to 164 in 5059058

    
           # Note: Row sampling can be achieved either through setting sample_weight or 
        
           # by indexing. The former is more efficient. Therefore, use this method 
        
           # if possible, otherwise use indexing.

In my understanding, the current sampling with weights, when max_samples < n_samples, will pass smaller X_ y_ arrays to the subestimators and should be more compute efficient. I guess _safe_indexing merely creates views, so it should not impact the memory ? I feel I am missing something ;)

antoinebaker · 2025-04-17T08:36:24Z

Thanks @StefanieSenger for the detailed analysis and @adrinjalali for the patch, it did fix the metadata routing test :)

ogrisel · 2025-04-17T12:53:38Z

In my understanding, the current sampling with weights, when max_samples < n_samples, will pass smaller X_ y_ arrays to the subestimators and should be more compute efficient.

I agree that when max_samples << n_samples, using indexing is more time efficient than using weighting (and the memory overhead is limited if max_samples is small enough).

I guess _safe_indexing merely creates views, so it should not impact the memory ?

No, it's not possible to create views with fancy (integer) indexing: NumPy is forced to allocate memory to store the result of the indexing operation (contrary to slicing):

>>> import numpy as np
>>> a = np.random.randn(12, 3)
>>> idx = np.asarray([3, 5, 3, 0])
>>> b = a.take(idx, axis=0)  # integer indexing does not create a view
>>> np.may_share_memory(a, b)
False
>>> c = a[3:5]  # slicing creates a view
>>> np.may_share_memory(a, c)
True

antoinebaker · 2025-04-17T13:36:52Z

No, it's not possible to create views with fancy (integer) indexing: NumPy is forced to allocate memory to store the result of the indexing operation (contrary to slicing)

Ah, thanks for the explanation !

sklearn/tests/test_metaestimators_metadata_routing.py

antoinebaker · 2025-05-23T13:15:05Z

Closing in favor of #31414

use sample weight for sampling

df282ad

github-actions bot added the module:ensemble label Apr 9, 2025

antoinebaker commented Apr 9, 2025

View reviewed changes

antoinebaker added 4 commits April 9, 2025 16:55

fix call in IsolationForest

ecf3bcf

changelog

1b63920

validate sample weight

084f64d

harmonize kwargs order

db1c1e8

antoinebaker added 4 commits April 9, 2025 18:09

add more samples

85c7e8c

oops forgot print statements

df49db3

doc

7863d5e

doc

75afc73

antoinebaker mentioned this pull request Apr 11, 2025

MNT Clean-up deprecations for 1.7: sample_weight as positional arg when not consumed #31119

Merged

antoinebaker marked this pull request as ready for review April 11, 2025 09:29

StefanieSenger reviewed Apr 11, 2025

View reviewed changes

doc/whats_new/upcoming_changes/sklearn.ensemble/31165.fix.rst Outdated Show resolved Hide resolved

antoinebaker and others added 2 commits April 11, 2025 14:55

Update doc/whats_new/upcoming_changes/sklearn.ensemble/31165.fix.rst

6a1af69

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>

Update doc/whats_new/upcoming_changes/sklearn.ensemble/31165.fix.rst

55ce891

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>

StefanieSenger reviewed Apr 11, 2025

View reviewed changes

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Apr 15, 2025

Plotting code snath-xoc/sample-weight-audit-nondet#11

Closed

Update sklearn/ensemble/_bagging.py

1de2b07

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>

antoinebaker added 3 commits April 16, 2025 10:50

fix test metadata

29c1944

add warnings

c4541fd

Merge remote-tracking branch 'upstream/main' into bagging_sample_weight

58c70ba

fix doctest

a8368bd

StefanieSenger reviewed Apr 17, 2025

View reviewed changes

sklearn/tests/test_metaestimators_metadata_routing.py Show resolved Hide resolved

StefanieSenger reviewed Apr 17, 2025

View reviewed changes

sklearn/tests/test_metaestimators_metadata_routing.py Show resolved Hide resolved

snath-xoc mentioned this pull request Apr 18, 2025

Add plotting notebook snath-xoc/sample-weight-audit-nondet#24

Merged

antoinebaker added 4 commits April 22, 2025 10:40

renaming

b118ffe

define sampling_strategy

bf6977d

Merge remote-tracking branch 'upstream/main' into bagging_sample_weight

7f425fb

bullet list ident

575a6bd

antoinebaker marked this pull request as draft May 22, 2025 15:09

antoinebaker mentioned this pull request May 22, 2025

FIX Draw indices using sample_weight in Bagging #31414

Merged

antoinebaker closed this May 23, 2025

Uh oh!

FIX Use sample weight to draw samples in Bagging estimators #31165

FIX Use sample weight to draw samples in Bagging estimators #31165

Uh oh!

Conversation

antoinebaker commented Apr 9, 2025

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

antoinebaker Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebaker commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Apr 9, 2025

Uh oh!

antoinebaker commented Apr 11, 2025

Uh oh!

snath-xoc commented Apr 11, 2025

Uh oh!

antoinebaker commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Apr 11, 2025

Uh oh!

antoinebaker commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

StefanieSenger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali commented Apr 14, 2025

Uh oh!

ogrisel commented Apr 15, 2025

Uh oh!

ogrisel commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Apr 17, 2025

Uh oh!

antoinebaker commented Apr 17, 2025

Uh oh!

ogrisel commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoinebaker commented Apr 17, 2025

Uh oh!

Uh oh!

Uh oh!

antoinebaker commented May 23, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2025 •

edited

Loading

antoinebaker commented Apr 9, 2025 •

edited

Loading

antoinebaker commented Apr 11, 2025 •

edited

Loading

antoinebaker commented Apr 11, 2025 •

edited

Loading

StefanieSenger left a comment •

edited

Loading

ogrisel commented Apr 15, 2025 •

edited

Loading

ogrisel commented Apr 17, 2025 •

edited

Loading