-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Use sample weight to draw samples in Bagging estimators #31165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Use sample weight to draw samples in Bagging estimators #31165
Conversation
def test_bagging_sample_weight_unsupported_but_passed(): | ||
estimator = BaggingClassifier(DummyZeroEstimator()) | ||
rng = check_random_state(0) | ||
|
||
estimator.fit(iris.data, iris.target).predict(iris.data) | ||
with pytest.raises(ValueError): | ||
estimator.fit( | ||
iris.data, | ||
iris.target, | ||
sample_weight=rng.randint(10, size=(iris.data.shape[0])), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is no longer needed I think. It is now okay to have a base estimator that does not support sample_weight
.
But Any idea @snath-xoc on why it fails on the classifier ? |
I think the statistical sample weight equivalence is only guaranteed when samples are drawn with replacement ( |
I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏 test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'> Now Is the test adapted for this case ? If not should we write a test for the new intended behavior ? |
Hi @antoinebaker, strange that logisticregression is giving such an output, is it still the case? If so I can take a quick look. |
Ah, good news, I think I found the reason 😃 The culprit is
Condition 2 requires using Now all these estimators pass the statistical test: # max_samples: int (here max_samples ~ n_samples)
BaggingRegressor(estimator=Ridge(), max_samples=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=100)
# subsampling (max_samples ~ 0.1 n_samples) and n_estimators=100
BaggingRegressor(estimator=Ridge(), max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), max_samples=10, n_estimators=100) And these fail as they should: # bootstrap=False (drawing without replacement)
BaggingRegressor(estimator=Ridge(), bootstrap=False, max_samples=10, n_estimators=100)
BaggingClassifier(estimator=LogisticRegression(), bootstrap=False, max_samples=10, n_estimators=100) |
Apart from the test failures #31165 (comment) that require metadata experts, I think this is ready for review. |
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might need your help @adrinjalali @StefanieSenger for fixing the metadata test failure 🙏
test_error_on_missing_requests_for_sub_estimator[BaggingRegressor]
Failed: DID NOT RAISE <class 'sklearn.exceptions.UnsetMetadataPassedError'>
Now` BaggingRegressor.fit is supposed to consumes sample_weight for drawing samples and not route/pass sample_weight to underlying estimators.
Is the test adapted for this case ? If not should we write a test for the new intended behavior ?
Thanks for tagging, @antoinebaker. I have done some research and I think I have found the cause in test_metaestimators_metadata_routing.py
, but I'd like to double-check with someone if I got it right and about what the solution could be. I will describe what I found (sorry it is a bit lengthy):
test_error_on_missing_requests_for_sub_estimator
fails, because now sample_weight
is not added as a key to fit_params
in BaseBagging.fit()
anymore (aka is not automatically routed anymore) and as a result the process_routing()
function returns an EmptyRequest()
and doesn't have a chance to raise the UnsetMetadataPassedError
that we are waiting for in this test. Other metadata routing tests (I had looked into test_setting_request_on_sub_estimator_removes_error
) take the same shortcut and pass without having run a real test.
With routing enabled (as well as without) sample_weight
, when passed, is removed from fit_params
and handled inside BaseBagging
to draw samples for fitting the sub-estimators. This results in a param named sample_weight
not being able to be routed anymore and probably this is just what this PR is supposed to do?
In this case, only the tests would need fixing. Not sure how yet, but I was thinking about not using sample_weight
as a param name in the metadata testing file anymore, because the name clashes create bad shortcuts and also make it hard to see if a test had actually tested what it was supposed to test. Does that make sense, @adrinjalali?
An alternative would be to re-add the sample_weight
into fit_params
in BaseBagging._fit()`:
if _routing_enabled():
if sample_weight is not None:
fit_params["sample_weight"] = sample_weight
routed_params = process_routing(self, "fit", **fit_params)
The tests then pass, BUT when I understand correctly then sample_weight
is taken into account twice (once while weighting the samples for fitting the sub-estimators, and a second time within the sub-estimators), and we don't want that, right?
(Another little thing (out of scope for this PR):
In process_routing()
there is a a comment before early return of EmptyRequest()
stating that this was meant for cases when routing is not enabled, but now this branch is executed when the routing is enabled (as the test failure shows), and we should fix this comment.)
@StefanieSenger 's analysis is mostly correct. The test expects all metadata to be always routed if a method is declared as "routing", and now with this PR it's no longer true. I propose the following patch to fix / improve the tests: diff --git a/sklearn/tests/test_metaestimators_metadata_routing.py b/sklearn/tests/test_metaestimators_metadata_routing.py
index ae2a186a3c..6a84d3b8dd 100644
--- a/sklearn/tests/test_metaestimators_metadata_routing.py
+++ b/sklearn/tests/test_metaestimators_metadata_routing.py
@@ -330,7 +330,7 @@ METAESTIMATORS: list = [
"y": y,
"preserves_metadata": False,
"estimator_routing_methods": [
- "fit",
+ ("fit", ["metadata"]),
"predict",
"predict_proba",
"predict_log_proba",
@@ -349,7 +349,7 @@ METAESTIMATORS: list = [
"X": X,
"y": y,
"preserves_metadata": False,
- "estimator_routing_methods": ["fit", "predict"],
+ "estimator_routing_methods": [("fit", ["metadata"]), "predict"],
},
{
"metaestimator": RidgeCV,
@@ -459,7 +459,13 @@ The keys are as follows:
- X: X-data to fit and predict
- y: y-data to fit
- estimator_routing_methods: list of all methods to check for routing metadata
- to the sub-estimator
+ to the sub-estimator. Each value is either a str or a tuple:
+ - str: the name of the method, all metadata in this method must be routed to the
+ sub-estimator
+ - tuple: the name of the method, the second element is a list of metadata keys
+ to be passed to the sub-estimator. This is useful if certain metadata such as
+ `sample_weight` are never routed and only consumed, such as in `BaggingClassifier`
+ and `BaggingRegressor`.
- preserves_metadata:
- True (default): the metaestimator passes the metadata to the
sub-estimator without modification. We check that the values recorded by
@@ -561,6 +567,22 @@ def get_init_args(metaestimator_info, sub_estimator_consumes):
(cv, cv_registry),
)
+def process_routing_methods(estimator_routing_methods):
+ """Process estimator_routing_methods and return a dict.
+
+ The dictionary is of the form {"method": ["metadata", ...]}.
+ By default, the list includes `sample_weight` and `metadata`.
+ """
+ res = dict()
+ for method_spec in estimator_routing_methods:
+ if isinstance(method_spec, str):
+ method = method_spec
+ metadata = ["sample_weight", "metadata"]
+ else:
+ method, metadata = method_spec
+ res[method] = metadata
+ return res
+
def set_requests(obj, *, method_mapping, methods, metadata_name, value=True):
"""Call `set_{method}_request` on a list of methods from the sub-estimator.
@@ -662,10 +684,10 @@ def test_error_on_missing_requests_for_sub_estimator(metaestimator):
metaestimator_class = metaestimator["metaestimator"]
X = metaestimator["X"]
y = metaestimator["y"]
- routing_methods = metaestimator["estimator_routing_methods"]
+ routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
- for method_name in routing_methods:
- for key in ["sample_weight", "metadata"]:
+ for method_name, metadata_keys in routing_methods.items():
+ for key in metadata_keys:
kwargs, (estimator, _), (scorer, _), *_ = get_init_args(
metaestimator, sub_estimator_consumes=True
)
@@ -721,12 +743,12 @@ def test_setting_request_on_sub_estimator_removes_error(metaestimator):
metaestimator_class = metaestimator["metaestimator"]
X = metaestimator["X"]
y = metaestimator["y"]
- routing_methods = metaestimator["estimator_routing_methods"]
+ routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
method_mapping = metaestimator.get("method_mapping", {})
preserves_metadata = metaestimator.get("preserves_metadata", True)
- for method_name in routing_methods:
- for key in ["sample_weight", "metadata"]:
+ for method_name, metadata_keys in routing_methods.items():
+ for key in metadata_keys:
val = {"sample_weight": sample_weight, "metadata": metadata}[key]
method_kwargs = {key: val}
@@ -797,7 +819,7 @@ def test_non_consuming_estimator_works(metaestimator):
metaestimator_class = metaestimator["metaestimator"]
X = metaestimator["X"]
y = metaestimator["y"]
- routing_methods = metaestimator["estimator_routing_methods"]
+ routing_methods = process_routing_methods(metaestimator["estimator_routing_methods"])
for method_name in routing_methods:
kwargs, (estimator, _), (_, _), (_, _) = get_init_args( |
I think we should at least document this in the docstring entries for the We could also raise a warning when |
I wonder if we shouldn't keep the code of Maybe we could introduce a new hyper-parameter
Unfortunately, it would be expensive to test that But we could at least check that they do qualitatively as part of the review process of this PR using the following plot: import matplotlib.pyplot as plt
import sklearn
fig, ax = plt.subplots()
shared = dict(
param_name="n_estimators",
param_range=[1, 5, 10, 50],
scoring="neg_log_loss",
cv=sklearn.model_selection.ShuffleSplit(n_splits=10),
fit_params={"sample_weight": None}, # to be explicit that we would not expect the curves to match otherwise
ax=ax,
)
sklearn.inspection.ValidationCurveDisplay.from_estimator(
sklearn.ensemble.BaggingClassifier(resampling="using_sample_weight"), X, y, **shared
)
sklearn.inspection.ValidationCurveDisplay.from_estimator(
sklearn.ensemble.BaggingClassifier(resampling="concrete"), X, y, **shared
)
plt.legend() and check that the 2 sets of curves (train and test for each estimator variant) and error bars overlap. If not, increase |
Co-authored-by: Stefanie Senger <91849487+StefanieSenger@users.noreply.github.com>
@ogrisel Could you elaborate why the code of scikit-learn/sklearn/ensemble/_bagging.py Lines 162 to 164 in 5059058
In my understanding, the current sampling with weights, when |
Thanks @StefanieSenger for the detailed analysis and @adrinjalali for the patch, it did fix the metadata routing test :) |
I agree that when
No, it's not possible to create views with fancy (integer) indexing: NumPy is forced to allocate memory to store the result of the indexing operation (contrary to slicing): >>> import numpy as np
>>> a = np.random.randn(12, 3)
>>> idx = np.asarray([3, 5, 3, 0])
>>> b = a.take(idx, axis=0) # integer indexing does not create a view
>>> np.may_share_memory(a, b)
False
>>> c = a[3:5] # slicing creates a view
>>> np.may_share_memory(a, c)
True |
Ah, thanks for the explanation ! |
Part of #16298.
What does this implement/fix? Explain your changes.
In
Bagging
estimators,sample_weight
is now used to draw the samples and are no longer forwarded to the underlying estimators.