-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Draw indices using sample_weight in Bagging #31414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Draw indices using sample_weight in Bagging #31414
Conversation
Could you please document this known limitation, both in the docstring of the Something like: "Note that the expected frequency semantics for the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a pass of review. Could you please add a non-regression test using a small dataset with specifically engineered weights? For instance, you could have a dataset with 100 datapoints, with 98 data points with a null weight, 1 data point, with a weight of 1 and 1 with a weight of 2:
X = np.arange(100).reshape(-1, 1)
y = (X < 99).astype(np.int32)
sample_weight = np.zeros(shape=X.shape[0])
sample_weight[0] = 1
sample_weight[-1] = 2
Then you could fit a BaggingRegressor
and a BaggleClassifier
with a fake test estimator that just records the values passed as X
, y
and sample_weight
as fitted attribute to be able to write assertions in the test.
Ideally this test should pass both with metadata routing enabled and disabled.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
BTW @antoinebaker once this PR has been finalized with tests, it would be great to open a similar PR for random forests. I suppose their bad handling of sample weights stems from the same root cause and a similar fix should be applicable. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @antoinebaker. Here is another pass of feedback but otherwise LGTM.
Actually no: I tried to run https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb against this branch and I still get a p-value lower than 1e-33 for this branch. It's an improvement over the < 1e-54 I measured on I confirm the bug is fixed for the regressor though. So I must be missing something. |
Did you specify |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I forgot about the EDIT: I confirm this works as expected. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
The CI test failure seems unrelated to this PR (forbidden request in |
@antoinebaker I pushed 28a2bde to make the sample weight semantics consistent between I had to change the code a bit to raise Let's see if the CI is green after this commit and I will do a proper review of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI is green, the diff LGTM besides the following details:
Maybe @jeremiedbb and @snath-xoc would like to review this PR. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I like the new semantic and raising an error if sw_sum < 1. |
From an initial pass LGTM, all tests pass for me as well. Will wait for @jeremiedbb to review but otherwise looks good to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a nitpick and a remark for which we can't really do anything about :)
if sample_weight is not None: | ||
sample_weight = _check_sample_weight(sample_weight, X, dtype=None) | ||
if sample_weight is not None and not self.bootstrap: | ||
warn( | ||
f"When fitting {self.__class__.__name__} with sample_weight " | ||
f"it is recommended to use bootstrap=True, got {self.bootstrap}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sample_weight is not None: | |
sample_weight = _check_sample_weight(sample_weight, X, dtype=None) | |
if sample_weight is not None and not self.bootstrap: | |
warn( | |
f"When fitting {self.__class__.__name__} with sample_weight " | |
f"it is recommended to use bootstrap=True, got {self.bootstrap}." | |
) | |
if sample_weight is not None: | |
sample_weight = _check_sample_weight(sample_weight, X, dtype=None) | |
if not self.bootstrap: | |
warn( | |
f"When fitting {self.__class__.__name__} with sample_weight " | |
f"it is recommended to use bootstrap=True, got {self.bootstrap}." | |
) |
sw_sum = np.sum(sample_weight) | ||
if sw_sum <= 1: | ||
raise ValueError( | ||
f"The total sum of sample weights is {sw_sum}, which prevents " | ||
"resampling with a fractional value for max_samples=" | ||
f"{max_samples}. Either pass max_samples as an integer or " | ||
"use a larger sample_weight." | ||
) | ||
max_samples = max(int(max_samples * sw_sum), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something doesn't feel right about this approach.
If max_samples
is a float in [0,1], I'd interpret it as a fraction of the sum of the weights and so to draw a number of samples that sums on average to max_samples * sw_sum
. For instance if max_samples=0.5, I'd expect to draw samples such that the sum of their weight is on average half the total sum of the weights.
This is not the case here since we're turning it into an int being the number of samples to draw. That's why there's this issue with small weights in particular.
That being said I don't have any alternative to propose. At least the docstring is clear about how max_samples
is related to the actual number samples drawn. So I guess this is good enough for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the case here since we're turning it into an int being the number of samples to draw. That's why there's this issue with small weights in particular.
I am not sure, I follow. Does your comment specifically refer to the edge case where sw_sum >= 1
but int(max_samples * sw_sum) == 0
in which case the max
operator uses 1 instead? I think this is really an edge case and we can. We could raise a warning, but the user wouldn't be able to do anything about it. Furthermore, I expect this case to be very rare in practice.
Besides this extreme edge case, I think your expectation that we "draw samples such that the sum of their weight is on average half the total sum of the weights." should be met, no?
Part of #16298 and alternative to #31165.
What does this implement/fix? Explain your changes.
In Bagging estimators,
sample_weight
is now used to draw the samples and no longer forwarded to the underlying estimators. Bagging estimators now pass the statistical repeated/weighted equivalence test whenbootstrap=True
(the default, ie draw with replacement).Compared to #31165, it better decouples two different usages of
sample_weight
:sample_weight
inbagging_estimator.fit
are used as probabilities to draw the indices/rowssample_weight
inbase_estimator.fit
are used to represent the indices (more memory efficient than indexing), this is possible only ifbase_estimator.fit
supportssample_weight
(through metadata routing or natively).#31165 introduced a new
sampling_strategy
argument to choose indexing/weighting for row sampling, but it would be better to do this in a dedicated follow up PR.cc @ogrisel @GaetandeCast