-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
FIX Draw indices using sample_weight in Bagging #31414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ogrisel
merged 16 commits into
scikit-learn:main
from
antoinebaker:bagging_sample_weight_in_choice
Jun 16, 2025
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
0092620
draw indices using sample_weight
antoinebaker 289a3d2
changelog
antoinebaker d0f3ddb
move consumes sw
antoinebaker 804d863
test warning
antoinebaker f01c278
docstring
antoinebaker bc4e499
Apply suggestions from code review
antoinebaker cfb7844
Update doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst
antoinebaker 318941b
add test
antoinebaker d33374a
Apply suggestions from code review
antoinebaker 8ee4723
Apply suggestions from code review
antoinebaker 0acfe42
test shapes
antoinebaker f676935
Merge branch 'main' into bagging_sample_weight_in_choice
antoinebaker 0a2b38d
Merge branch 'main' into bagging_sample_weight_in_choice
ogrisel 28a2bde
Make max_samples a fraction of sample_weight.sum() instead of X.shape[0]
ogrisel f449770
Apply suggestions from code review
antoinebaker 0318a51
Update sklearn/ensemble/_bagging.py
antoinebaker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
7 changes: 7 additions & 0 deletions
7
doc/whats_new/upcoming_changes/sklearn.ensemble/31414.fix.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
- :class:`ensemble.BaggingClassfier`, :class:`ensemble.BaggingRegressor` | ||
and :class:`ensemble.IsolationForest` now use `sample_weight` to draw | ||
the samples instead of forwarding them multiplied by a uniformly sampled | ||
mask to the underlying estimators. Furthermore, `max_samples` is now | ||
interpreted as a fraction of `sample_weight.sum()` instead of `X.shape[0]` | ||
when passed as a float. | ||
By :user:`Antoine Baker <antoinebaker>`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something doesn't feel right about this approach.
If
max_samples
is a float in [0,1], I'd interpret it as a fraction of the sum of the weights and so to draw a number of samples that sums on average tomax_samples * sw_sum
. For instance if max_samples=0.5, I'd expect to draw samples such that the sum of their weight is on average half the total sum of the weights.This is not the case here since we're turning it into an int being the number of samples to draw. That's why there's this issue with small weights in particular.
That being said I don't have any alternative to propose. At least the docstring is clear about how
max_samples
is related to the actual number samples drawn. So I guess this is good enough for us.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, I follow. Does your comment specifically refer to the edge case where
sw_sum >= 1
butint(max_samples * sw_sum) == 0
in which case themax
operator uses 1 instead? I think this is really an edge case and we can. We could raise a warning, but the user wouldn't be able to do anything about it. Furthermore, I expect this case to be very rare in practice.Besides this extreme edge case, I think your expectation that we "draw samples such that the sum of their weight is on average half the total sum of the weights." should be met, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not about this edge case.
Take for instance
sw = np.array([1.2, 3.4, 4.7, 5.6, 2.2, 2.9])
. We havesw.sum()=20
. If I setmax_samples=0.5
, intuitively I'd expect to draw samples such that the sum of their weight is close to 10 on average. But heremax_samples * sw_sum = 10
so we'll sample 10 points and on average the sum of their weights is10 * sw.mean() = 33.33
so more than 3 times my expectation.On the opposite, if the samples weights sum to a value less than n_samples, we'll draw points such that the sum of their weight is less than the expected. Actually I think the expected sum of weights is
int(max_samples * sw_sum) * sw_mean
, so only equals toint(max_samples * sw_sum)
ifsw_mean=1
. To get the expected sum of weights we should then drawint(max_samples * n_samples)
points, which leads to an average sum of weights ofmax_samples * sw_sum
.But this was the previous implementation and used to break the equivalence between weighted and repeated.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's what this PR does. What we do is:
indices
(with replacement) withmax_samples * sw_sum ~= 10
elements with replacement (sklearn/ensemble/_bagging.py:90
);indices_as_sample_weight = np.bincount(indices)
(sklearn/ensemble/_bagging.py:172
) as thesample_weight
param of the base estimator to simulate fitting on this resampling using sample weights. Note that we do not reuse thesample_weight
values passed by the user a second time for this step. This avoids double accounting.Personally, I don't think there is a problem in the current state of the PR and the statistical tests seem to confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said I don't have a better alternative to offer so I'm okay with this.
The issue for me comes from the fact that a parameter is tied to n_samples and not the weight sum. That's why we're able to have the equivalence between weighted and repeated but a lot harder with a rescaling of the weights. Here you'd get an error if you normalize your weights in advance (then sw_sum = 1), which feels like a non-optimal behavior to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of invariance to weight re-scaling (multiplication by a positive constant) is probably one of the properties that distinguishes the "frequency" weight semantics from other kinds of weight semantics. I am personally still not clear if this is a bug or a feature. One way to decide would be to review the known downstream uses of scikit-learn
sample_weight
(for instance: #30564 (comment)) to see if any of them would break because of lack of invariance to weight rescaling.