Skip to content

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Seladus
Copy link
Contributor

@Seladus Seladus commented Dec 21, 2021

Reference Issues/PRs

Fixes #20758

What does this implement/fix?

Adding support for sample_weight in KBinsDiscretizer(strategy="quantile"). That is to say : I added a parameter in the fit method of KBinsDiscretizer. This parameter, sample_weight can be set to control the weights associated to each sample only when the strategy is set to "quantile"

@glemaitre glemaitre changed the title [MRG] Fix : Support sample_weight in KBinsDiscretizer(strategy="quantile") ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") Dec 22, 2021
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an entry to the change log at doc/whats_new/v1.1.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@@ -16,16 +16,21 @@


@pytest.mark.parametrize(
"strategy, expected",
"strategy, expected, sample_weight",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test will need to be updated with the change that I asked earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test cases that I previously pushed. Unfortunately, I'm doubting on their relevance. Would they be acceptable as is, or should I find more specific cases ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have 2 specific checks that check the real behaviour of passing weights:

  • Check that passing None and an array of 1 will be equivalent.
  • Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I fail to understand why, the case where sample_weight=None for n_bins=[2, 3, 3, 3] and the case where sample_weight=[1, 1, 1, 1] for n_bins=[2, 3, 3, 3] are not equivalent (in test method test_fit_transform_n_bins_array).

The behaviors of np.percentileand _weighted_percentile with sample_weight = [1, 1, 1, 1] are supposed to be equivalent, aren't they ?

Copy link
Member

@ogrisel ogrisel Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I just checked:

>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile
>>> data = np.random.randn(100)
>>> np.percentile(data, [25, 50, 75])
array([-0.76701739,  0.0604021 ,  0.79485777])
>>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]]
[-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]

It seems that this problem has been known for a while but not yet fixed:

Old attempts to fix this problem or related problems:

So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented out the faulty test case in 30edabc (l. 139-151)

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this PR as reviewed.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good now. I would 2 new tests to check specifically the behaviour of using sample_weigtht:

  • Check that passing None and an array of 1 will be equivalent.
  • Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.

Seladus and others added 2 commits December 23, 2021 16:50
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@Seladus
Copy link
Contributor Author

Seladus commented Dec 23, 2021

In commit e7e9eee :

I added new test cases according to suggestions in #22048 (review) and made a few corrections according to suggestions of @glemaitre

Unfortunately, I fail to understand why, the case where sample_weight=None for n_bins=[2, 3, 3, 3] and the case where sample_weight=[1, 1, 1, 1] for n_bins=[2, 3, 3, 3] are not equivalent (in test method test_fit_transform_n_bins_array).

The behaviors of np.percentileand _weighted_percentile with sample_weight = [1, 1, 1, 1] are equivalent, aren't they ?

@Seladus Seladus requested a review from glemaitre January 1, 2022 17:35
Seladus and others added 2 commits January 5, 2022 15:04
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -16,16 +16,21 @@


@pytest.mark.parametrize(
"strategy, expected",
"strategy, expected, sample_weight",
Copy link
Member

@ogrisel ogrisel Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I just checked:

>>> import numpy as np
>>> from sklearn.utils.stats import _weighted_percentile
>>> data = np.random.randn(100)
>>> np.percentile(data, [25, 50, 75])
array([-0.76701739,  0.0604021 ,  0.79485777])
>>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]]
[-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]

It seems that this problem has been known for a while but not yet fixed:

Old attempts to fix this problem or related problems:

So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.

@SangamSwadiK
Copy link
Contributor

Hi @Seladus will you be continuing this PR ?

@benlawson
Copy link
Contributor

what else needs to be done for this? it looks like a couple checks failed (Azure Pipelines) but the logs are gone. Can these be re-run?

I'm very excited for this feature and happy to help if needed

Screenshot_2022-08-12_15-02-21

@Seladus
Copy link
Contributor Author

Seladus commented Aug 12, 2022

what else needs to be done for this? it looks like a couple checks failed (Azure Pipelines) but the logs are gone. Can these be re-run?

I'm very excited for this feature and happy to help if needed

Screenshot_2022-08-12_15-02-21

@benlawson I gave up on this because I didn't remember what was wrong with my code. but seeing that some are interested in the feature is motivating. I'll try to dive back into it as soon as i can.

@DeaMariaLeon
Copy link
Contributor

Hello. Are you working on this @Seladus?

@Seladus
Copy link
Contributor Author

Seladus commented Nov 1, 2022

Hello. Are you working on this @Seladus?

Hello I am not currently working on this (can't find time to do it properly)

@DeaMariaLeon
Copy link
Contributor

DeaMariaLeon commented Nov 2, 2022

/take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support sample_weight in KBinsDiscretizer(strategy="quantile")
7 participants