-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v1.1.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
@@ -16,16 +16,21 @@ | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"strategy, expected", | |||
"strategy, expected, sample_weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test will need to be updated with the change that I asked earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test cases that I previously pushed. Unfortunately, I'm doubting on their relevance. Would they be acceptable as is, or should I find more specific cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have 2 specific checks that check the real behaviour of passing weights:
- Check that passing None and an array of 1 will be equivalent.
- Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I fail to understand why, the case where sample_weight=None
for n_bins=[2, 3, 3, 3]
and the case where sample_weight=[1, 1, 1, 1]
for n_bins=[2, 3, 3, 3]
are not equivalent (in test method test_fit_transform_n_bins_array
).
The behaviors of np.percentile
and _weighted_percentile
with sample_weight = [1, 1, 1, 1]
are supposed to be equivalent, aren't they ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I just checked:
>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile
>>> data = np.random.randn(100)
>>> np.percentile(data, [25, 50, 75])
array([-0.76701739, 0.0604021 , 0.79485777])
>>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]]
[-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]
It seems that this problem has been known for a while but not yet fixed:
- _weighted_percentile does not lead to the same result than np.median #17370
- Make _weighted_percentile more robust #6189
Old attempts to fix this problem or related problems:
- [WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377
- ENH improve _weighted_percentile to provide several interpolation #17768
So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented out the faulty test case in 30edabc (l. 139-151)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting this PR as reviewed.
… use of already existing functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good now. I would 2 new tests to check specifically the behaviour of using sample_weigtht
:
- Check that passing
None
and an array of1
will be equivalent. - Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…acing of conditions to avoid repetitions
In commit e7e9eee : I added new test cases according to suggestions in #22048 (review) and made a few corrections according to suggestions of @glemaitre Unfortunately, I fail to understand why, the case where The behaviors of |
… statement and adding a test case for the exception raise on this same check
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -16,16 +16,21 @@ | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"strategy, expected", | |||
"strategy, expected, sample_weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I just checked:
>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile
>>> data = np.random.randn(100)
>>> np.percentile(data, [25, 50, 75])
array([-0.76701739, 0.0604021 , 0.79485777])
>>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]]
[-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]
It seems that this problem has been known for a while but not yet fixed:
- _weighted_percentile does not lead to the same result than np.median #17370
- Make _weighted_percentile more robust #6189
Old attempts to fix this problem or related problems:
- [WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377
- ENH improve _weighted_percentile to provide several interpolation #17768
So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.
Hi @Seladus will you be continuing this PR ? |
@benlawson I gave up on this because I didn't remember what was wrong with my code. but seeing that some are interested in the feature is motivating. I'll try to dive back into it as soon as i can. |
Hello. Are you working on this @Seladus? |
Hello I am not currently working on this (can't find time to do it properly) |
/take |
Reference Issues/PRs
Fixes #20758
What does this implement/fix?
Adding support for sample_weight in KBinsDiscretizer(strategy="quantile"). That is to say : I added a parameter in the
fit
method of KBinsDiscretizer. This parameter,sample_weight
can be set to control the weights associated to each sample only when the strategy is set to"quantile"