-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix sample weight passing in KBinsDiscretizer
#29907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This hopefully helps sample weight handling in the non-deterministic case (i.e., when subsample is specified). Nevertheless, in the deterministic case when subsample is at the default 200_000 value we get an error when specifying certain n_bins, e.g. n_bins = 4 from sklearn.datasets import make_blobs
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
rng = np.random.RandomState(42)
# Four centres
centres = np.array([[0, 0], [0, 5], [3, 1], [2, 4], [8, 8]])
X, _ = make_blobs(
n_samples=100,
cluster_std=0.5,
centers=centres,
random_state=10,
)
#X = X[0]
# Create dataset with repetitions and corresponding sample weights
sample_weight = rng.randint(0, 10, size=X.shape[0])
X_resampled_by_weights = np.repeat(X, sample_weight, axis=0)
#Cannot check for uniform strategy with sample weights
kbins_weighted = KBinsDiscretizer(n_bins=4,
strategy='quantile',
random_state=10).fit(X, sample_weight=sample_weight)
kbins_repeated = KBinsDiscretizer(n_bins=4,
strategy='quantile',
random_state=10).fit(X_resampled_by_weights)
np.testing.assert_allclose(np.hstack((kbins_weighted.bin_edges_)),
np.hstack((kbins_repeated.bin_edges_))) Which returns the error
This is not the case for n_bins = 5 and the rtol is low so it may be trivial. |
I performed more tests and added a "use_weights_in_resampling" so we can easily check both before making a decision. I also realised that it is sensitive to the subsampling portion, before I was using 50 in the tests but now have changed it to 200 which lead to different results, attached are some plots summarising the results. The standalone notebook can be found on this gist: https://gist.github.com/snath-xoc/ef5dbe53ca4f02655fe99173e9ab1198 |
So, basically, for small number of bins there is no statistical difference between the methods. But I would say that there is no statistical test that will have enough power to find statistically significant difference between 10 samples. |
With the weighted resampling it may happen that the number of distinct points we get after resampling is smaller than the number of clusters we ask for KMeans. It's more likely for instance if some weight is very big compared to others such that the corresponding point is picked many times, leading to a lot of duplicates after sampling. I would expect the same behavior if we use repeated points instead of sample weights. |
Yes not so notably, although it is worth mentioning that the t-test is carried out independently for each bin once iterating through 500 seeds. So we calculate the t-test on 500 samples for each bin (titles on the plots only display the average) A summary table can be seen in the gist and it seems like for some bins the t-test shows significant differences when using weighted quantile/kmeans as compared to when simply using weighted resampling. |
@snath-xoc I played a bit with your notebook and designed a dataset which I suspected that it would show the discrepancy more reliably. The idea is to have a small number of points but with a large weight that matter a lot for the computation of bin edges. I made a 1d array with 4 blobs containing 1000 points of weight 1 each and a fifth blob far from the others containing only 10 points but with a weight of 100 (so 1000 effective points). It looks like the following without taking weights into account Then for the statistical test, I only checked resampling + quantiles computations, outside of the KBinsDiscretizer to iterate faster, but I think we could adapt it for the full KBinsDiscretizer now. The results show a significant difference between the 2 options, clearly visible qualitatively in the plots
The code to reproduce this is available here https://gist.github.com/jeremiedbb/be5ee6d91ab0aa0c589a5517aaf74be8 Based on these results and on the fact that we haven't found a case where the other option is clearly better, I'm more confident that doing a weighted resampling followed by unweighted quantiles is the correct way to deal with sample weights here. |
Thank you very much for the follow-up analysis @jeremiedbb. I agree with this conclusion. |
Agreed as well thanks @jeremiedbb, I suppose if this is the case we need to check the impact on HistGradietBoostingRegressor since it also uses resampling (and if so we may have to adapt sample weight passing etc.). What do you guys think? |
Sure but let's fix one estimator at a time ;) |
Yes of course... I meant that if we incorporate the changes i.e., add weighted resampling into the resample function then this may cause HistGradientBoostingRegressor to break so not sure if we should do it now or first do further tests? |
Quick search gives These changes will break backwards compatibility for sure, and I think bunch of functions in the test suite. Should this change be discussed among the core? |
I think we can make the change such that it doesn't impact other estimators for now. In other estimators we won't pass sample_weight to |
This is a bugfix. So backward compatibility is not really a concern, I believe. But we do need to document the fix in the changelog. |
The behavior of But I agree with @jeremiedbb: let's only resolve the problem for |
KBinsDiscretizer
@snath-xoc I did another quick pass of review. There was an old comment by @jeremiedbb that was not addressed. Please look at the diff view of the PR to catch them all: https://github.com/scikit-learn/scikit-learn/pull/29907/files Once all review comments are addressed / answered, please don't forget to press the "Ready for review" button to remove the "Draft" status of this PR and don't hesitate to ping people who already interacted with this PR previously (or related PRs/issues) to get them to review it. |
We also need to expand the tests for the data = np.asarray([-1, 0, 1])
sample_weight = np.asarray([0, 100, 1])
for seed in range(100):
mean_repeated = resample(
data.repeat(sample_weight), replace=True, random_state=seed
).mean()
mean_reweighted = resample(
data, sample_weight=sample_weight, replace=True, random_state=seed
).mean()
# collect all the means for all seed values and compare the distributions with KS.
... |
O.K. sounds good I can add something to the test_indexing.py files |
We also need a changelog entry. Since this PR is contributing both an enhancement of the We should probably split it in two: one for the enhancement of the Each PR will have its own file in the |
I played around with the tests and have pushed the one that seems to work best. So far the kstest is not passing and we get extremely low p-values from scipy.stats import kstest, ttest_ind
import sklearn
from sklearn.utils import resample
import numpy as np
data = np.array([-1, 0, 1, 100])
sample_weight = np.asarray([0, 100, 1, 20])
mean_repeated = []
mean_reweighted = []
for seed in range(100):
mean_repeated.append(
resample(data.repeat(sample_weight), replace=True, random_state=seed).mean()
)
mean_reweighted.append(
resample(
data, sample_weight=sample_weight, replace=True, random_state=seed
).mean()
)
mean_repeated = np.asarray(mean_repeated)
mean_reweighted = np.asarray(mean_reweighted)
test_result = kstest(mean_repeated, mean_reweighted)
assert test_result.pvalue > 0.05 The same goes for if I use data = np.array([-1, 0, 1])
sample_weight = np.asarray([0, 100, 1]) Instead, the ttest works overall though and gives less unstable p-values from experimentation so I am using that for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nitpicks and doc comments. Otherwise LGTM.
doc/whats_new/upcoming_changes/sklearn.utils/29907.enhancement.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
@snath-xoc, this PR is almost ready to be merged. There's just this #29907 (comment) that hasn't been addressed. We need an additional changelog entry that you should put in a file named |
Whoops sorry, I think the PR was so long I lost track of some of the comments, thank you for the reminder. I have updated the changelog, please check if it makes sense now. |
@snath-xoc there are still open suggestions and unresolved threads above, for instance: #29907 (review) More generally, it's good to always scan the diff view of the PR: https://github.com/scikit-learn/scikit-learn/pull/29907/files (and press the "Load diff" links in case some files with large diffs are collapsed) to review all the open discussion threads. If some open comment threads have not embedded code suggestions, but they have been addressed manually in another commit, reply in the open thread with the commit hash that addresses them, and then mark the thread as resolved by pressing "Resolved conversation". |
And for comments with explicit code suggestion, feel free to use the github button to accept the suggested changes (e.g. "Commit suggestion" to accept a single suggestion, or "Add suggestion to batch" many times to group several suggestions into a single commit and then "Commit batched suggestion" in the end). And then don't forget to pull those local commits into your local branch before working on further code changes locally to keep your local branch in sync with your PR. |
For reference, I ran our statistical test notebook against this branch, and it passes at the 5% level:
Note that we run a 30-way test, so we should really compare the p-value to 0.05/30 = 0.0017, although this might be a bit conservative as the individual tests are not independent. Still, it's a good sign that this statistical test failed to find an obvious bug in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly pushed the last suggestion. It was a really minor nitpick.
LGTM ! Thanks @snath-xoc for this PR that was a lot more complicated than expected 😄
Great yes, sorry about that I think I committed the actively suggested reviews, left some longer threads hanging though, hopefully nothing went missing. |
Hum, we forgot to do a full doc build to see the list of impacted examples that need to be updated to avoid raising the warning. Apparently there are 6 of them: |
- :func: `_averaged_weighted_percentile` now added which implements | ||
an averaged inverted cdf calculation of percentiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snath-xoc I missed that in the review, but we should not mention private API related names in the changelog. It's confusing: the primary users of the changelog are library users, and they should not be incentivized to use private API in their code otherwise, since it can break without deprecation cycles when upgrading scikit-learn.
Can you please open a new PR to delete those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O.K. will do, thanks for letting me know!
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai> Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
Fixes #29906 and cross-linked to meta-issue #16298
Sample weight was previously not passed through to resample throwing an error in KbinsDiscretizer when subsample was specified. Sample weight is now passed through and handled in _indexing/resample to return both resamples X and sample weight values.
Here is a notebook testing:
To do:
KBinsDiscretizer
#29907 (comment) instead.