Skip to content

Fix sample weight passing in KBinsDiscretizer #29907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 92 commits into from
Feb 8, 2025

Conversation

snath-xoc
Copy link
Contributor

@snath-xoc snath-xoc commented Sep 22, 2024

Fixes #29906 and cross-linked to meta-issue #16298

Sample weight was previously not passed through to resample throwing an error in KbinsDiscretizer when subsample was specified. Sample weight is now passed through and handled in _indexing/resample to return both resamples X and sample weight values.

Here is a notebook testing:

  1. When we perform weighted resampling but don't pass weights to the bin estimator
  2. When we perform uniform resampling and pass weights to the bin estimator

To do:

  • Implement the symmetrized version of _weighted_percentile (in KBinsDiscretizer only) and enable it when quantile_method="average_inverted_cdf"
  • Modify tests to test for averaged_inverted_cdf, consequences are far-reaching with new failures popping up.
  • Fix _averaged_weighted_percentile tests failing
  • Error with _percentile_dispatcher, I do not get it locally so am wondering where it is coming from (perhaps version issue with np.percentile?)
  • Add to changelog: 29907.enhancement.rst under sklearn.utils already exists (due to resample change), shall I just add the _averaged_inverted_cdf here? Likewise, addition of quantile_method param should be added to 29907.enhancement.rst ontop of existing enhancement of sample_weight support under strategy="uniform"
  • Remove test_KBD_inverse_transform_Xt_deprecation?
    • @ogrisel: let's better do that in a dedicated PR instead.
  • Need review of changelog due to the many changes
  • Add reference to gist notebooks somewhere?

Copy link

github-actions bot commented Sep 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 16c2dbd. Link to the linter CI: here

@snath-xoc
Copy link
Contributor Author

snath-xoc commented Sep 22, 2024

This hopefully helps sample weight handling in the non-deterministic case (i.e., when subsample is specified). Nevertheless, in the deterministic case when subsample is at the default 200_000 value we get an error when specifying certain n_bins, e.g. n_bins = 4

from sklearn.datasets import make_blobs
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

rng = np.random.RandomState(42)

# Four centres 
centres = np.array([[0, 0], [0, 5], [3, 1], [2, 4], [8, 8]])
X, _ = make_blobs(
            n_samples=100,
            cluster_std=0.5,
            centers=centres,
            random_state=10,
        )
#X = X[0]

# Create dataset with repetitions and corresponding sample weights
sample_weight = rng.randint(0, 10, size=X.shape[0])
X_resampled_by_weights = np.repeat(X, sample_weight, axis=0)

#Cannot check for uniform strategy with sample weights

kbins_weighted = KBinsDiscretizer(n_bins=4,
                                  strategy='quantile',
                                    random_state=10).fit(X, sample_weight=sample_weight)
kbins_repeated = KBinsDiscretizer(n_bins=4,
                                  strategy='quantile',
                                  random_state=10).fit(X_resampled_by_weights)

np.testing.assert_allclose(np.hstack((kbins_weighted.bin_edges_)), 
                           np.hstack((kbins_repeated.bin_edges_)))

Which returns the error

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 10 (10%)
Max absolute difference among violations: 0.00820405
Max relative difference among violations: 0.03938403
 ACTUAL: array([-0.988864,  0.200105,  2.255357,  3.353415,  9.112525, -0.871686,
        0.764596,  4.028643,  5.232663,  8.390888])
 DESIRED: array([-0.988864,  0.208309,  2.255357,  3.353415,  9.112525, -0.871686,
        0.764596,  4.028643,  5.232663,  8.390888])

This is not the case for n_bins = 5 and the rtol is low so it may be trivial.

@snath-xoc
Copy link
Contributor Author

I performed more tests and added a "use_weights_in_resampling" so we can easily check both before making a decision. I also realised that it is sensitive to the subsampling portion, before I was using 50 in the tests but now have changed it to 200 which lead to different results, attached are some plots summarising the results. The standalone notebook can be found on this gist: https://gist.github.com/snath-xoc/ef5dbe53ca4f02655fe99173e9ab1198

image
image
image
image

@glevv
Copy link
Contributor

glevv commented Oct 10, 2024

So, basically, for small number of bins there is no statistical difference between the methods.

But I would say that there is no statistical test that will have enough power to find statistically significant difference between 10 samples.

@jeremiedbb
Copy link
Member

@jeremiedbb any idea why the KMeans warning occurs?

With the weighted resampling it may happen that the number of distinct points we get after resampling is smaller than the number of clusters we ask for KMeans. It's more likely for instance if some weight is very big compared to others such that the corresponding point is picked many times, leading to a lot of duplicates after sampling. I would expect the same behavior if we use repeated points instead of sample weights.

@snath-xoc
Copy link
Contributor Author

So, basically, for small number of bins there is no statistical difference between the methods.

But I would say that there is no statistical test that will have enough power to find statistically significant difference between 10 samples.

Yes not so notably, although it is worth mentioning that the t-test is carried out independently for each bin once iterating through 500 seeds. So we calculate the t-test on 500 samples for each bin (titles on the plots only display the average)

A summary table can be seen in the gist and it seems like for some bins the t-test shows significant differences when using weighted quantile/kmeans as compared to when simply using weighted resampling.

@jeremiedbb
Copy link
Member

@snath-xoc I played a bit with your notebook and designed a dataset which I suspected that it would show the discrepancy more reliably. The idea is to have a small number of points but with a large weight that matter a lot for the computation of bin edges.

I made a 1d array with 4 blobs containing 1000 points of weight 1 each and a fifth blob far from the others containing only 10 points but with a weight of 100 (so 1000 effective points). It looks like the following without taking weights into account
image
and like the following when taking weights into account
image

Then for the statistical test, I only checked resampling + quantiles computations, outside of the KBinsDiscretizer to iterate faster, but I think we could adapt it for the full KBinsDiscretizer now. The results show a significant difference between the 2 options, clearly visible qualitatively in the plots

  • weighted resampling
    image
  • weighted quantiles
    image
    And quantitatively in the results of the statistical tests
    image
    Looking at the individual p values per bin edge, we can see that the last edge is the main problematic one, as expected due the construction of the dataset
    image

The code to reproduce this is available here https://gist.github.com/jeremiedbb/be5ee6d91ab0aa0c589a5517aaf74be8

Based on these results and on the fact that we haven't found a case where the other option is clearly better, I'm more confident that doing a weighted resampling followed by unweighted quantiles is the correct way to deal with sample weights here.

@ogrisel
Copy link
Member

ogrisel commented Oct 14, 2024

Based on these results and on the fact that we haven't found a case where the other option is clearly better, I'm more confident that doing a weighted resampling followed by unweighted quantiles is the correct way to deal with sample weights here.

Thank you very much for the follow-up analysis @jeremiedbb. I agree with this conclusion.

@snath-xoc
Copy link
Contributor Author

Agreed as well thanks @jeremiedbb, I suppose if this is the case we need to check the impact on HistGradietBoostingRegressor since it also uses resampling (and if so we may have to adapt sample weight passing etc.). What do you guys think?

@jeremiedbb
Copy link
Member

I suppose if this is the case we need to check the impact on HistGradietBoostingRegressor

Sure but let's fix one estimator at a time ;)
One thing to consider for future estimators to make it easier for us to settle on the best solution is to focus on edge case datasets, for which the statistical test is more likely to have strong results.

@snath-xoc
Copy link
Contributor Author

Yes of course... I meant that if we incorporate the changes i.e., add weighted resampling into the resample function then this may cause HistGradientBoostingRegressor to break so not sure if we should do it now or first do further tests?

@glevv
Copy link
Contributor

glevv commented Oct 14, 2024

Quick search gives QuantileTransformer, HalvingGridSearch, HistGradientBoosting and KBinsDiscretizer among the users of the resample function.

These changes will break backwards compatibility for sure, and I think bunch of functions in the test suite.

Should this change be discussed among the core?

@jeremiedbb
Copy link
Member

I think we can make the change such that it doesn't impact other estimators for now. In other estimators we won't pass sample_weight to resample for now so the behavior shouldn't change

@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2024

These changes will break backwards compatibility for sure, and I think bunch of functions in the test suite.

This is a bugfix. So backward compatibility is not really a concern, I believe. But we do need to document the fix in the changelog.

@ogrisel
Copy link
Member

ogrisel commented Oct 17, 2024

I meant that if we incorporate the changes i.e., add weighted resampling into the resample function then this may cause HistGradientBoostingRegressor to break so not sure if we should do it now or first do further tests?

The behavior of resample when using the default sample_weight=None should not be impacted. So other estimators will not be any more or less broken by the merge of this PR as long as they are not updated to pass sample_weight != None to their internal calls to resample.

But I agree with @jeremiedbb: let's only resolve the problem for KBinsDiscretizer in this PR. Once this is finalized, reviewed and merged, let's generalize the propagation of the sample_weight argument in any estimator that relies on the resample function in their own dedicated PR to ease the review process.

@ogrisel ogrisel changed the title Fix sample weight passing in KbinsDiscretizer Fix sample weight passing in KBinsDiscretizer Oct 24, 2024
@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2024

@snath-xoc I did another quick pass of review.

There was an old comment by @jeremiedbb that was not addressed. Please look at the diff view of the PR to catch them all:

https://github.com/scikit-learn/scikit-learn/pull/29907/files

​Once all review comments are addressed / answered, please don't forget to press the "Ready for review" button to remove the "Draft" status of this PR and don't hesitate to ping people who already interacted with this PR previously (or related PRs/issues) to get them to review it.

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2024

We also need to expand the tests for the resample function to test it with weights. You could do a small KS-test on the mean of repeated/reweighted data with fixed random seeds and check that the pvalue is small large enough, with a toy dataset such as:

data = np.asarray([-1, 0, 1])
sample_weight = np.asarray([0, 100, 1])

for seed in range(100):
    mean_repeated = resample(
        data.repeat(sample_weight), replace=True, random_state=seed
    ).mean()
    mean_reweighted = resample(
        data, sample_weight=sample_weight, replace=True, random_state=seed
    ).mean()
    # collect all the means for all seed values and compare the distributions with KS.
    ...

@snath-xoc
Copy link
Contributor Author

snath-xoc commented Oct 25, 2024

We also need to expand the tests for the resample function to test it with weights. You could do a small KS-test on the mean of repeated/reweighted data with a fixed random seed and check that the pvalue is small enough, with a toy dataset such as:

data = np.asarray([-1, 0, 1])
sample_weight = np.asarray([0, 100, 1])

for seed in range(100):
    mean_repeated = resample(
        data.repeat(sample_weight), replace=True, random_state=seed
    ).mean()
    mean_reweighted = resample(
        data, sample_weight=sample_weight, replace=True, random_state=seed
    ).mean()
    # collect all the means for each seed and compare the distributions with KS.
    ...

O.K. sounds good I can add something to the test_indexing.py files

@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2024

We also need a changelog entry.

Since this PR is contributing both an enhancement of the resample function (part of the public API in itself) and a fix for KBinsDiscretizer it's going to be challenging to do that in a single PR.

We should probably split it in two: one for the enhancement of the resample function and a subsequently merged PR for the KBinsDiscretizer (we can keep this one for that).

Each PR will have its own file in the doc/whats_new/upcoming_changes/ folder, see: https://github.com/scikit-learn/scikit-learn/tree/main/doc/whats_new/upcoming_changes#changelog-instructions

@snath-xoc
Copy link
Contributor Author

snath-xoc commented Oct 25, 2024

I played around with the tests and have pushed the one that seems to work best. So far the kstest is not passing and we get extremely low p-values 1.0245697148897385e-13, reproducer here:

from scipy.stats import kstest, ttest_ind

import sklearn
from sklearn.utils import resample
import numpy as np

data = np.array([-1, 0, 1, 100])
sample_weight = np.asarray([0, 100, 1, 20])

mean_repeated = []
mean_reweighted = []

for seed in range(100):
      mean_repeated.append(
          resample(data.repeat(sample_weight), replace=True, random_state=seed).mean()
      )
      mean_reweighted.append(
          resample(
              data, sample_weight=sample_weight, replace=True, random_state=seed
          ).mean()
      )
mean_repeated = np.asarray(mean_repeated)
mean_reweighted = np.asarray(mean_reweighted)

test_result = kstest(mean_repeated, mean_reweighted)

assert test_result.pvalue > 0.05

The same goes for if I use

data = np.array([-1, 0, 1])
sample_weight = np.asarray([0, 100, 1])

Instead, the ttest works overall though and gives less unstable p-values from experimentation so I am using that for now.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks and doc comments. Otherwise LGTM.

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
@jeremiedbb
Copy link
Member

jeremiedbb commented Feb 6, 2025

@snath-xoc, this PR is almost ready to be merged. There's just this #29907 (comment) that hasn't been addressed. We need an additional changelog entry that you should put in a file named doc/whats_new/upcoming_changes/changed-models/29907.fix.rst. It should say that sample weight support has been fixed in KBinsDiscretizer which may change the results even when not using sample weights (without changing the statistical properties).

@snath-xoc
Copy link
Contributor Author

@snath-xoc, this PR is almost ready to be merged. There's just this #29907 (comment) that hasn't been addressed. We need an additional changelog entry that you should put in a file named doc/whats_new/upcoming_changes/changed-models/29907.fix.rst. It should say that sample weight support has been fixed in KBinsDiscretizer which may change the results even when not using sample weights (without changing the statistical properties).

Whoops sorry, I think the PR was so long I lost track of some of the comments, thank you for the reminder. I have updated the changelog, please check if it makes sense now.

@ogrisel
Copy link
Member

ogrisel commented Feb 7, 2025

@snath-xoc there are still open suggestions and unresolved threads above, for instance: #29907 (review)

More generally, it's good to always scan the diff view of the PR: https://github.com/scikit-learn/scikit-learn/pull/29907/files (and press the "Load diff" links in case some files with large diffs are collapsed) to review all the open discussion threads.

If some open comment threads have not embedded code suggestions, but they have been addressed manually in another commit, reply in the open thread with the commit hash that addresses them, and then mark the thread as resolved by pressing "Resolved conversation".

@ogrisel
Copy link
Member

ogrisel commented Feb 7, 2025

And for comments with explicit code suggestion, feel free to use the github button to accept the suggested changes (e.g. "Commit suggestion" to accept a single suggestion, or "Add suggestion to batch" many times to group several suggestions into a single commit and then "Commit batched suggestion" in the end).

And then don't forget to pull those local commits into your local branch before working on further code changes locally to keep your local branch in sync with your PR.

@ogrisel
Copy link
Member

ogrisel commented Feb 7, 2025

For reference, I ran our statistical test notebook against this branch, and it passes at the 5% level:

✅ KBinsDiscretizer: (min_p_value: 0.071, mean_p_value=0.492)

Note that we run a 30-way test, so we should really compare the p-value to 0.05/30 = 0.0017, although this might be a bit conservative as the individual tests are not independent.

Still, it's a good sign that this statistical test failed to find an obvious bug in this PR.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed the last suggestion. It was a really minor nitpick.

LGTM ! Thanks @snath-xoc for this PR that was a lot more complicated than expected 😄

@jeremiedbb jeremiedbb enabled auto-merge (squash) February 7, 2025 17:41
@jeremiedbb jeremiedbb merged commit 1b7dea1 into scikit-learn:main Feb 8, 2025
31 checks passed
@snath-xoc
Copy link
Contributor Author

@snath-xoc there are still open suggestions and unresolved threads above, for instance: #29907 (review)

More generally, it's good to always scan the diff view of the PR: https://github.com/scikit-learn/scikit-learn/pull/29907/files (and press the "Load diff" links in case some files with large diffs are collapsed) to review all the open discussion threads.

If some open comment threads have not embedded code suggestions, but they have been addressed manually in another commit, reply in the open thread with the commit hash that addresses them, and then mark the thread as resolved by pressing "Resolved conversation".

Great yes, sorry about that I think I committed the actively suggested reviews, left some longer threads hanging though, hopefully nothing went missing.

@ogrisel
Copy link
Member

ogrisel commented Feb 8, 2025

Hum, we forgot to do a full doc build to see the list of impacted examples that need to be updated to avoid raising the warning. Apparently there are 6 of them:

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/65115/workflows/909d8adf-1ad2-46e7-9030-8b4cb93b717c/jobs/297476?invite=true#step-105-645838_102

Comment on lines +4 to +5
- :func: `_averaged_weighted_percentile` now added which implements
an averaged inverted cdf calculation of percentiles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snath-xoc I missed that in the review, but we should not mention private API related names in the changelog. It's confusing: the primary users of the changelog are library users, and they should not be incentivized to use private API in their code otherwise, since it can break without deprecation cycles when upgrading scikit-learn.

Can you please open a new PR to delete those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O.K. will do, thanks for letting me know!

ogrisel added a commit to ogrisel/scikit-learn that referenced this pull request Feb 8, 2025
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect sample weight handling in KBinsDiscretizer
6 participants