Skip to content

FIX Use sample weight to draw samples in Bagging estimators #31165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

antoinebaker
Copy link
Contributor

Part of #16298.

What does this implement/fix? Explain your changes.

In Bagging estimators, sample_weight is now used to draw the samples and are no longer forwarded to the underlying estimators.

Copy link

github-actions bot commented Apr 9, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b118ffe. Link to the linter CI: here

Comment on lines -601 to -611
def test_bagging_sample_weight_unsupported_but_passed():
estimator = BaggingClassifier(DummyZeroEstimator())
rng = check_random_state(0)

estimator.fit(iris.data, iris.target).predict(iris.data)
with pytest.raises(ValueError):
estimator.fit(
iris.data,
iris.target,
sample_weight=rng.randint(10, size=(iris.data.shape[0])),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is no longer needed I think. It is now okay to have a base estimator that does not support sample_weight.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Apr 9, 2025

BaggingRegressor(estimator=Ridge()) now passes the statistical test for sample weight equivalence with repeated data.

image

But BaggingClassifier(estimator=LogisticRegression()) doesn't :(

image

Any idea @snath-xoc on why it fails on the classifier ?

@antoinebaker
Copy link
Contributor Author

I think the statistical sample weight equivalence is only guaranteed when samples are drawn with replacement (bootstrap=True). We should maybe warn the user if they use sample_weight with bootstrap=False ?

@antoinebaker
Copy link
Contributor Author

I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏

test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'>

Now BaggingRegressor.fit is supposed to consumes sample_weight for drawing samples and not route/pass sample_weight to underlying estimators.

Is the test adapted for this case ? If not should we write a test for the new intended behavior ?

@snath-xoc
Copy link
Contributor

Hi @antoinebaker, strange that logisticregression is giving such an output, is it still the case? If so I can take a quick look.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Apr 11, 2025

Hi @antoinebaker, strange that logisticregression is giving such an output, is it still the case? If so I can take a quick look.

Ah, good news, I think I found the reason 😃

The culprit is max_samples, which by default is max_samples=1.0 (meaning taking n_samples).
For the repeated/weighted estimators to be equivalent we need to draw

  1. the same number of samples
  2. with replacement

Condition 2 requires using bootstrap=True
Condition 1 requires to use an integer max_samples. Using a float max_samples (relative size) will lead to different number of samples between weighted/repeated.

Now all these estimators pass the statistical test:

# max_samples: int (here max_samples ~ n_samples)
BaggingRegressor(estimator=Ridge(), max_samples=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=100)
# subsampling (max_samples ~ 0.1 n_samples) and n_estimators=100
BaggingRegressor(estimator=Ridge(), max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=10, n_estimators=100)

And these fail as they should:

# bootstrap=False (drawing without replacement)
BaggingRegressor(estimator=Ridge(), bootstrap=False, max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), bootstrap=False, max_samples=10, n_estimators=100)

@antoinebaker
Copy link
Contributor Author

To illustrate, BaggingRegressor(estimator=Ridge(), max_samples=100)

image

BaggingClassifier(estimator=LogisticRegression(), max_samples=100)

image

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Apr 11, 2025

Apart from the test failures #31165 (comment) that require metadata experts, I think this is ready for review.

@antoinebaker antoinebaker marked this pull request as ready for review April 11, 2025 09:29
antoinebaker and others added 2 commits April 11, 2025 14:55
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏
test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'>
Now` BaggingRegressor.fit is supposed to consumes sample_weight for drawing samples and not route/pass sample_weight to underlying estimators.
Is the test adapted for this case ? If not should we write a test for the new intended behavior ?

Thanks for tagging, @antoinebaker. I have done some research and I think I have found the cause in test_metaestimators_metadata_routing.py , but I'd like to double-check with someone if I got it right and about what the solution could be. I will describe what I found (sorry it is a bit lengthy):

test_error_on_missing_requests_for_sub_estimator fails, because now sample_weight is not added as a key to fit_params in BaseBagging.fit() anymore (aka is not automatically routed anymore) and as a result the process_routing() function returns an EmptyRequest() and doesn't have a chance to raise the UnsetMetadataPassedError that we are waiting for in this test. Other metadata routing tests (I had looked into test_setting_request_on_sub_estimator_removes_error) take the same shortcut and pass without having run a real test.

With routing enabled (as well as without) sample_weight, when passed, is removed from fit_params and handled inside BaseBagging to draw samples for fitting the sub-estimators. This results in a param named sample_weight not being able to be routed anymore and probably this is just what this PR is supposed to do?

In this case, only the tests would need fixing. Not sure how yet, but I was thinking about not using sample_weight as a param name in the metadata testing file anymore, because the name clashes create bad shortcuts and also make it hard to see if a test had actually tested what it was supposed to test. Does that make sense, @adrinjalali?

An alternative would be to re-add the sample_weight into fit_params in BaseBagging._fit()`:

        if _routing_enabled():
            if sample_weight is not None:
                fit_params["sample_weight"] = sample_weight
            routed_params = process_routing(self, "fit", **fit_params)

The tests then pass, BUT when I understand correctly then sample_weight is taken into account twice (once while weighting the samples for fitting the sub-estimators, and a second time within the sub-estimators), and we don't want that, right?

(Another little thing (out of scope for this PR):
In process_routing() there is a a comment before early return of EmptyRequest() stating that this was meant for cases when routing is not enabled, but now this branch is executed when the routing is enabled (as the test failure shows), and we should fix this comment.)

@adrinjalali
Copy link
Member

@StefanieSenger 's analysis is mostly correct. The test expects all metadata to be always routed if a method is declared as "routing", and now with this PR it's no longer true.

I propose the following patch to fix / improve the tests:

diff --git a/sklearn/tests/test_metaestimators_metadata_routing.py b/sklearn/tests/test_metaestimators_metadata_routing.py
index ae2a186a3c..6a84d3b8dd 100644
--- a/sklearn/tests/test_metaestimators_metadata_routing.py
+++ b/sklearn/tests/test_metaestimators_metadata_routing.py
@@ -330,7 +330,7 @@ METAESTIMATORS: list = [
         "y": y,
         "preserves_metadata": False,
         "estimator_routing_methods": [
-            "fit",
+            ("fit", ["metadata"]),
             "predict",
             "predict_proba",
             "predict_log_proba",
@@ -349,7 +349,7 @@ METAESTIMATORS: list = [
         "X": X,
         "y": y,
         "preserves_metadata": False,
-        "estimator_routing_methods": ["fit", "predict"],
+        "estimator_routing_methods": [("fit", ["metadata"]), "predict"],
     },
     {
         "metaestimator": RidgeCV,
@@ -459,7 +459,13 @@ The keys are as follows:
 - X: X-data to fit and predict
 - y: y-data to fit
 - estimator_routing_methods: list of all methods to check for routing metadata
-  to the sub-estimator
+  to the sub-estimator. Each value is either a str or a tuple:
+    - str: the name of the method, all metadata in this method must be routed to the
+      sub-estimator
+    - tuple: the name of the method, the second element is a list of metadata keys
+      to be passed to the sub-estimator. This is useful if certain metadata such as
+      `sample_weight` are never routed and only consumed, such as in `BaggingClassifier`
+      and `BaggingRegressor`.
 - preserves_metadata:
     - True (default): the metaestimator passes the metadata to the
       sub-estimator without modification. We check that the values recorded by
@@ -561,6 +567,22 @@ def get_init_args(metaestimator_info, sub_estimator_consumes):
         (cv, cv_registry),
     )
 
+def process_routing_methods(estimator_routing_methods):
+    """Process estimator_routing_methods and return a dict.
+
+    The dictionary is of the form {"method": ["metadata", ...]}.
+    By default, the list includes `sample_weight` and `metadata`.
+    """
+    res = dict()
+    for method_spec in estimator_routing_methods:
+        if isinstance(method_spec, str):
+            method = method_spec
+            metadata = ["sample_weight", "metadata"]
+        else:
+            method, metadata = method_spec
+        res[method] = metadata
+    return res
+
 
 def set_requests(obj, *, method_mapping, methods, metadata_name, value=True):
     """Call `set_{method}_request` on a list of methods from the sub-estimator.
@@ -662,10 +684,10 @@ def test_error_on_missing_requests_for_sub_estimator(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
 
-    for method_name in routing_methods:
-        for key in ["sample_weight", "metadata"]:
+    for method_name, metadata_keys in routing_methods.items():
+        for key in metadata_keys:
             kwargs, (estimator, _), (scorer, _), *_ = get_init_args(
                 metaestimator, sub_estimator_consumes=True
             )
@@ -721,12 +743,12 @@ def test_setting_request_on_sub_estimator_removes_error(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods =   process_routing_methods(metaestimator["estimator_routing_methods"])
     method_mapping = metaestimator.get("method_mapping", {})
     preserves_metadata = metaestimator.get("preserves_metadata", True)
 
-    for method_name in routing_methods:
-        for key in ["sample_weight", "metadata"]:
+    for method_name, metadata_keys in routing_methods.items():
+        for key in metadata_keys:
             val = {"sample_weight": sample_weight, "metadata": metadata}[key]
             method_kwargs = {key: val}
 
@@ -797,7 +819,7 @@ def test_non_consuming_estimator_works(metaestimator):
     metaestimator_class = metaestimator["metaestimator"]
     X = metaestimator["X"]
     y = metaestimator["y"]
-    routing_methods = metaestimator["estimator_routing_methods"]
+    routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
 
     for method_name in routing_methods:
         kwargs, (estimator, _), (_, _), (_, _) = get_init_args(

@ogrisel
Copy link
Member

ogrisel commented Apr 15, 2025

I think the statistical sample weight equivalence is only guaranteed when samples are drawn with replacement (bootstrap=True). We should maybe warn the user if they use sample_weight with bootstrap=False ?

I think we should at least document this in the docstring entries for the bootstrap parameter in __init__ and in the docstring for sample_weight in the fit method.

We could also raise a warning when bootstrap=False and sample_weight is not None.

@ogrisel
Copy link
Member

ogrisel commented Apr 15, 2025

I wonder if we shouldn't keep the code of main as an option, since it is typically 2x more memory efficient than the new code.

Maybe we could introduce a new hyper-parameter resampling that takes values in {"concrete", "using_sample_weight", "auto"}.

  • "concrete" would use the new code of this PR;
  • "using_sample_weight" would use the old code (bincount of the sampled indices);
  • "auto" (the default) would:
    • branch to "concrete" if the Bagging model is itself provided non-None sample_weight or if the base estimator does not accept sample_weight.
    • branch to "using_sample_weight" otherwise, for the sake of keeping memory efficient code by default.

Unfortunately, it would be expensive to test that BaggingClassifier(resampling="using_sample_weight").fit(X, y, sample_weight=None) and BaggingClassifier(resampling="concrete", sample_weight=None).fit(X, y) are identically distributed.

But we could at least check that they do qualitatively as part of the review process of this PR using the following plot:

import matplotlib.pyplot as plt
import sklearn

fig, ax = plt.subplots()
shared = dict(
    param_name="n_estimators",
    param_range=[1, 5, 10, 50],
    scoring="neg_log_loss",
    cv=sklearn.model_selection.ShuffleSplit(n_splits=10),
    fit_params={"sample_weight": None},  # to be explicit that we would not expect the curves to match otherwise
    ax=ax,
)
sklearn.inspection.ValidationCurveDisplay.from_estimator(
    sklearn.ensemble.BaggingClassifier(resampling="using_sample_weight"), X, y, **shared
)

sklearn.inspection.ValidationCurveDisplay.from_estimator(
    sklearn.ensemble.BaggingClassifier(resampling="concrete"), X, y, **shared
)
plt.legend()

and check that the 2 sets of curves (train and test for each estimator variant) and error bars overlap. If not, increase n_splits and retry to check that any discrepancy disappears in the limits of large n_splits values.

Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
@antoinebaker
Copy link
Contributor Author

I wonder if we shouldn't keep the code of main as an option, since it is typically 2x more memory efficient than the new code.

@ogrisel Could you elaborate why the code of main is more memory efficient ? Along the same lines I did not grasp this comment in main:

# Note: Row sampling can be achieved either through setting sample_weight or
# by indexing. The former is more efficient. Therefore, use this method
# if possible, otherwise use indexing.

In my understanding, the current sampling with weights, when max_samples < n_samples, will pass smaller X_ y_ arrays to the subestimators and should be more compute efficient. I guess _safe_indexing merely creates views, so it should not impact the memory ? I feel I am missing something ;)

@antoinebaker
Copy link
Contributor Author

Thanks @StefanieSenger for the detailed analysis and @adrinjalali for the patch, it did fix the metadata routing test :)

@ogrisel
Copy link
Member

ogrisel commented Apr 17, 2025

In my understanding, the current sampling with weights, when max_samples < n_samples, will pass smaller X_ y_ arrays to the subestimators and should be more compute efficient.

I agree that when max_samples << n_samples, using indexing is more time efficient than using weighting (and the memory overhead is limited if max_samples is small enough).

I guess _safe_indexing merely creates views, so it should not impact the memory ?

No, it's not possible to create views with fancy (integer) indexing: NumPy is forced to allocate memory to store the result of the indexing operation (contrary to slicing):

>>> import numpy as np
>>> a = np.random.randn(12, 3)
>>> idx = np.asarray([3, 5, 3, 0])
>>> b = a.take(idx, axis=0)  # integer indexing does not create a view
>>> np.may_share_memory(a, b)
False
>>> c = a[3:5]  # slicing creates a view
>>> np.may_share_memory(a, c)
True

@antoinebaker
Copy link
Contributor Author

No, it's not possible to create views with fancy (integer) indexing: NumPy is forced to allocate memory to store the result of the indexing operation (contrary to slicing)

Ah, thanks for the explanation !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants