ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

Seladus · 2021-12-21T15:35:12Z

Reference Issues/PRs

Fixes #20758

What does this implement/fix?

Adding support for sample_weight in KBinsDiscretizer(strategy="quantile"). That is to say : I added a parameter in the fit method of KBinsDiscretizer. This parameter, sample_weight can be set to control the weights associated to each sample only when the strategy is set to "quantile"

…Discretizer

glemaitre

Please add an entry to the change log at doc/whats_new/v1.1.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

sklearn/preprocessing/_discretization.py

glemaitre · 2021-12-22T17:06:57Z

sklearn/preprocessing/tests/test_discretization.py

@@ -16,16 +16,21 @@


 @pytest.mark.parametrize(
-    "strategy, expected",
+    "strategy, expected, sample_weight",


The test will need to be updated with the change that I asked earlier.

I updated the test cases that I previously pushed. Unfortunately, I'm doubting on their relevance. Would they be acceptable as is, or should I find more specific cases ?

I would like to have 2 specific checks that check the real behaviour of passing weights:

Check that passing None and an array of 1 will be equivalent.

Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.

Unfortunately, I fail to understand why, the case where sample_weight=None for n_bins=[2, 3, 3, 3] and the case where sample_weight=[1, 1, 1, 1] for n_bins=[2, 3, 3, 3] are not equivalent (in test method test_fit_transform_n_bins_array).

The behaviors of np.percentileand _weighted_percentile with sample_weight = [1, 1, 1, 1] are supposed to be equivalent, aren't they ?

Indeed, I just checked:

>>> import numpy as np >>> from sklearn.utils.stats import _weighted_percentile >>> data = np.random.randn(100) >>> np.percentile(data, [25, 50, 75]) array([-0.76701739, 0.0604021 , 0.79485777]) >>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]] [-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]

It seems that this problem has been known for a while but not yet fixed:

_weighted_percentile does not lead to the same result than np.median #17370

Make _weighted_percentile more robust #6189

Old attempts to fix this problem or related problems:

[WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377

ENH improve _weighted_percentile to provide several interpolation #17768

So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.

I commented out the faulty test case in 30edabc (l. 139-151)

glemaitre

Just noting this PR as reviewed.

… use of already existing functions.

…ith black

glemaitre

The implementation looks good now. I would 2 new tests to check specifically the behaviour of using sample_weigtht:

Check that passing None and an array of 1 will be equivalent.
Check that passing some zero weight would be equivalent to ignoring the sample and check the bins in these conditions.

sklearn/preprocessing/_discretization.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…acing of conditions to avoid repetitions

Seladus · 2021-12-23T16:18:43Z

In commit e7e9eee :

I added new test cases according to suggestions in #22048 (review) and made a few corrections according to suggestions of @glemaitre

Unfortunately, I fail to understand why, the case where sample_weight=None for n_bins=[2, 3, 3, 3] and the case where sample_weight=[1, 1, 1, 1] for n_bins=[2, 3, 3, 3] are not equivalent (in test method test_fit_transform_n_bins_array).

The behaviors of np.percentileand _weighted_percentile with sample_weight = [1, 1, 1, 1] are equivalent, aren't they ?

sklearn/preprocessing/_discretization.py

… statement and adding a test case for the exception raise on this same check

sklearn/preprocessing/_discretization.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

sklearn/preprocessing/_discretization.py

ogrisel · 2022-01-05T16:20:40Z

sklearn/preprocessing/tests/test_discretization.py

@@ -16,16 +16,21 @@


 @pytest.mark.parametrize(
-    "strategy, expected",
+    "strategy, expected, sample_weight",


Indeed, I just checked:

>>> import numpy as np >>> from sklearn.utils.stats import _weighted_percentile >>> data = np.random.randn(100) >>> np.percentile(data, [25, 50, 75]) array([-0.76701739, 0.0604021 , 0.79485777]) >>> [_weighted_percentile(data, np.ones_like(data), p) for p in [25, 50, 75]] [-0.7855144180141034, 0.05197426163774039, 0.7586492302901591]

It seems that this problem has been known for a while but not yet fixed:

_weighted_percentile does not lead to the same result than np.median #17370

Make _weighted_percentile more robust #6189

Old attempts to fix this problem or related problems:

[WIP] BUG make _weighted_percentile(data, ones, 50) consistent with numpy.median(data) #17377

ENH improve _weighted_percentile to provide several interpolation #17768

So for now, we can just comment out this case with a TODO comment with a link to #17370 for explain why this test is commented out.

SangamSwadiK · 2022-07-04T16:52:39Z

Hi @Seladus will you be continuing this PR ?

benlawson · 2022-08-12T19:16:16Z

what else needs to be done for this? it looks like a couple checks failed (Azure Pipelines) but the logs are gone. Can these be re-run?

I'm very excited for this feature and happy to help if needed

Seladus · 2022-08-12T19:51:30Z

what else needs to be done for this? it looks like a couple checks failed (Azure Pipelines) but the logs are gone. Can these be re-run?

I'm very excited for this feature and happy to help if needed

@benlawson I gave up on this because I didn't remember what was wrong with my code. but seeing that some are interested in the feature is motivating. I'll try to dive back into it as soon as i can.

DeaMariaLeon · 2022-11-01T13:13:49Z

Hello. Are you working on this @Seladus?

Seladus · 2022-11-01T20:32:53Z

Hello. Are you working on this @Seladus?

Hello I am not currently working on this (can't find time to do it properly)

DeaMariaLeon · 2022-11-02T08:42:49Z

/take

Seladus added 5 commits December 21, 2021 13:58

Adding sample_weight parameter support in fit method of KBinsDiscretizer

3522d2d

Adding support for sample_weights in the case of an array-like n_bins

4366bdc

Adding test cases for sample_weights parameter in fit method in KBins…

17e429b

…Discretizer

Black formatting and clarifications in documentation

fe6076d

Minor fix for PEP8 compatibility

e7d5003

github-actions bot added the module:preprocessing label Dec 21, 2021

glemaitre changed the title ~~[MRG] Fix : Support sample_weight in KBinsDiscretizer(strategy="quantile")~~ ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") Dec 22, 2021

glemaitre reviewed Dec 22, 2021

View reviewed changes

glemaitre requested changes Dec 22, 2021

View reviewed changes

Seladus added 5 commits December 23, 2021 04:22

Adding fix to correct misunderstanding of the task and to make better…

85fe2b7

… use of already existing functions.

Adding parameter copy=True in check for sample_weights + formatting w…

83a0bba

…ith black

Removing unused imports

524cb73

Adding entry to the changelog

dbe2eb0

Adding dtype in bin edges construction

6bff3a3

glemaitre reviewed Dec 23, 2021

View reviewed changes

Seladus and others added 2 commits December 23, 2021 16:50

Update sklearn/preprocessing/_discretization.py

ea5f446

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Application of suggestions : sooner check for valid strategy + interl…

e7e9eee

…acing of conditions to avoid repetitions

glemaitre reviewed Dec 24, 2021

View reviewed changes

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

Seladus and others added 3 commits December 24, 2021 17:14

Movins subsample check for other strategy than quantile in its own if…

e78c263

… statement and adding a test case for the exception raise on this same check

Change for linter

1a417f4

Merge branch 'main' into support_sample_weight_in_kbinsdiscretizer

b5677e3

Seladus requested a review from glemaitre January 1, 2022 17:35

glemaitre reviewed Jan 4, 2022

View reviewed changes

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_discretization.py Outdated Show resolved Hide resolved

Seladus and others added 2 commits January 5, 2022 15:04

Update sklearn/preprocessing/_discretization.py

b16321c

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update _discretization.py

0d3919b

ogrisel reviewed Jan 5, 2022

View reviewed changes

Seladus added 2 commits January 6, 2022 15:39

Adding TODO comment in tests and removing uselesss call to np.array

30edabc

Changes for linter

b600a0d

lorentzenchr added the Stalled label Apr 14, 2022

lorentzenchr mentioned this pull request Jul 4, 2022

Support sample_weight in KBinsDiscretizer(strategy="quantile") #20758

Closed

This was referenced Nov 15, 2022

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") DeaMariaLeon/scikit-learn#1

Closed

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #24935

Merged

glemaitre closed this in #24935 Dec 13, 2022

Uh oh!

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #22048

Uh oh!

Conversation

Seladus commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix?

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre Dec 22, 2021

Choose a reason for hiding this comment

Uh oh!

Seladus Dec 23, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Jan 4, 2022

Choose a reason for hiding this comment

Uh oh!

Seladus Jan 5, 2022

Choose a reason for hiding this comment

Uh oh!

ogrisel Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Seladus Jan 6, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Seladus commented Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Jan 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SangamSwadiK commented Jul 4, 2022

Uh oh!

benlawson commented Aug 12, 2022

Uh oh!

Seladus commented Aug 12, 2022

Uh oh!

DeaMariaLeon commented Nov 1, 2022

Uh oh!

Seladus commented Nov 1, 2022

Uh oh!

DeaMariaLeon commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Seladus commented Dec 21, 2021 •

edited

Loading

ogrisel Jan 5, 2022 •

edited

Loading

Seladus commented Dec 23, 2021 •

edited

Loading

ogrisel Jan 5, 2022 •

edited

Loading

DeaMariaLeon commented Nov 2, 2022 •

edited

Loading