-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Draw indices using sample_weight in Bagging #31414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Draw indices using sample_weight in Bagging #31414
Conversation
Could you please document this known limitation, both in the docstring of the Something like: "Note that the expected frequency semantics for the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a pass of review. Could you please add a non-regression test using a small dataset with specifically engineered weights? For instance, you could have a dataset with 100 datapoints, with 98 data points with a null weight, 1 data point, with a weight of 1 and 1 with a weight of 2:
X = np.arange(100).reshape(-1, 1)
y = (X < 99).astype(np.int32)
sample_weight = np.zeros(shape=X.shape[0])
sample_weight[0] = 1
sample_weight[-1] = 2
Then you could fit a BaggingRegressor
and a BaggleClassifier
with a fake test estimator that just records the values passed as X
, y
and sample_weight
as fitted attribute to be able to write assertions in the test.
Ideally this test should pass both with metadata routing enabled and disabled.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
BTW @antoinebaker once this PR has been finalized with tests, it would be great to open a similar PR for random forests. I suppose their bad handling of sample weights stems from the same root cause and a similar fix should be applicable. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @antoinebaker. Here is another pass of feedback but otherwise LGTM.
Actually no: I tried to run https://github.com/snath-xoc/sample-weight-audit-nondet/blob/main/reports/sklearn_estimators_sample_weight_audit_report.ipynb against this branch and I still get a p-value lower than 1e-33 for this branch. It's an improvement over the < 1e-54 I measured on I confirm the bug is fixed for the regressor though. So I must be missing something. |
Did you specify |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I forgot about the EDIT: I confirm this works as expected. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
The CI test failure seems unrelated to this PR (forbidden request in |
@antoinebaker I pushed 28a2bde to make the sample weight semantics consistent between I had to change the code a bit to raise Let's see if the CI is green after this commit and I will do a proper review of the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI is green, the diff LGTM besides the following details:
Maybe @jeremiedbb and @snath-xoc would like to review this PR. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
I like the new semantic and raising an error if sw_sum < 1. |
Part of #16298 and alternative to #31165.
What does this implement/fix? Explain your changes.
In Bagging estimators,
sample_weight
is now used to draw the samples and no longer forwarded to the underlying estimators. Bagging estimators now pass the statistical repeated/weighted equivalence test whenbootstrap=True
(the default, ie draw with replacement).Compared to #31165, it better decouples two different usages of
sample_weight
:sample_weight
inbagging_estimator.fit
are used as probabilities to draw the indices/rowssample_weight
inbase_estimator.fit
are used to represent the indices (more memory efficient than indexing), this is possible only ifbase_estimator.fit
supportssample_weight
(through metadata routing or natively).#31165 introduced a new
sampling_strategy
argument to choose indexing/weighting for row sampling, but it would be better to do this in a dedicated follow up PR.cc @ogrisel @GaetandeCast