Skip to content

ENH add support for sample_weight in KBinsDiscretizer(strategy="quantile") #24935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Dec 13, 2022

Conversation

DeaMariaLeon
Copy link
Contributor

Reference Issues/PRs

Fixes #22048 see also #20758

What does this implement/fix?

Adding support for sample_weight in KBinsDiscretizer(strategy="quantile"). That is to say : I added a parameter in the fit method of KBinsDiscretizer. This parameter, sample_weight can be set to control the weights associated to each sample only when the strategy is set to "quantile"

Additional comments

PR 22048 opened and almost finished by Seladus

Seladus and others added 24 commits December 21, 2021 13:58
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
… statement and adding a test case for the exception raise on this same check
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Merge remote-tracking branch 'upstream/main' into deas-kbins
Merge remote-tracking branch 'upstream/main' into deas-kbins
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should come up with a good test for the sample_weight case.
Somehow, I am thinking that we could come up with an array where we could decide easily the quantile and add some extreme values that could have an impact and pass a corresponding null weight to those samples. We will check in this case that zero-weight will be taken into account in the computation of the quantile.

@@ -233,9 +245,28 @@ def fit(self, X, y=None):
'`subsample` must be used with `strategy="quantile"`.'
)

elif self.strategy != "quantile" and sample_weight is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more logical in this order.

Suggested change
elif self.strategy != "quantile" and sample_weight is not None:
elif sample_weight is not None and self.strategy != "quantile":

@@ -35,6 +50,20 @@ def test_valid_n_bins():
assert KBinsDiscretizer(n_bins=2).fit(X).n_bins_.dtype == np.dtype(int)


def test_invalid_sample_weight():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_invalid_sample_weight():
@pytest.mark.parametrize("strategy", ["uniform", "kmeans"])
def test_kbinsdiscretizer_wrong_strategy_with_weights(strategy):

Comment on lines 55 to 64
strategy = ["uniform", "kmeans"]
sample_weight = [1, 1, 1, 1]
for s in strategy:
est = KBinsDiscretizer(n_bins=3, strategy=s)
err_msg = (
"`sample_weight` was provided but it can be only used with"
f"strategy='quantile'. Got strategy={s!r} instead."
)
with pytest.raises(ValueError, match=err_msg):
est.fit_transform(X, sample_weight=sample_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
strategy = ["uniform", "kmeans"]
sample_weight = [1, 1, 1, 1]
for s in strategy:
est = KBinsDiscretizer(n_bins=3, strategy=s)
err_msg = (
"`sample_weight` was provided but it can be only used with"
f"strategy='quantile'. Got strategy={s!r} instead."
)
with pytest.raises(ValueError, match=err_msg):
est.fit_transform(X, sample_weight=sample_weight)
sample_weight = np.ones(size=(X.shape[0]))
estimator = KBinsDiscretizer(n_bins=3, strategy=strategy)
err_msg = (
"`sample_weight` was provided but it can be only used with"
"strategy='quantile'."
)
with pytest.raises(ValueError, match=err_msg):
est.fit(X, sample_weight=sample_weight)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 55 sample_weight = np.ones(size=(X.shape[0])):
X is a list, so we can't use the attribute shape.
Also np.ones has a parameter shape, not size.
Maybe then sample_weight = np.ones(shape=len(X))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is indeed the first line of my suggestion :)

Merge remote-tracking branch 'upstream/main' into deas-kbins
@glemaitre glemaitre self-requested a review November 23, 2022 16:20
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from these small changes. LGTM.

DeaMariaLeon and others added 3 commits November 24, 2022 19:54
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Merge remote-tracking branch 'upstream/main' into deas-kbins
@DeaMariaLeon
Copy link
Contributor Author

I did the last changes @glemaitre... Do I need to let somebody know or should I keep waiting for a second reviewer?

I was thinking that I just should wait, but I'm not sure.

Thank you.

@glemaitre
Copy link
Member

We have to wait for a second reviewer, I will add a tag to inform other core developers that we request a second review. Thanks, @DeaMariaLeon for the work.

@glemaitre glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 1, 2022
@DeaMariaLeon
Copy link
Contributor Author

I should have asked earlier. I was thinking that magically someone was going to see it. 🤓

Thank you @glemaitre for all the teaching. :)

@glemaitre
Copy link
Member

I should have asked earlier. I was thinking that magically someone was going to see it. 🤓

Indeed, I personally follow-up PR based on email notification that happens only when commenting in PRs (or pushing in it). So if you want to get my attention, you need to explicitly write a comment :)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks and a formatting fix in the error message but other than that, LGTM!

Comment on lines 56 to 57
"""Check that we raise an error message when the wrong strategy
is used."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick PEP 257 style (first line should be a one-liner):

Suggested change
"""Check that we raise an error message when the wrong strategy
is used."""
"""Check that we raise an error when the wrong strategy is used."""

sample_weight = np.ones(shape=(len(X)))
est = KBinsDiscretizer(n_bins=3, strategy=strategy)
err_msg = (
"`sample_weight` was provided but it can be only used withstrategy='quantile'."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`sample_weight` was provided but it can be only used withstrategy='quantile'."
"`sample_weight` was provided but it can be only used with "
"strategy='quantile'."

@@ -94,6 +146,27 @@ def test_fit_transform_n_bins_array(strategy, expected):
assert bin_edges.shape == (n_bins + 1,)


@pytest.mark.filterwarnings("ignore: Bins whose width are too small")
def test_kbinsdiscretizer_effect_sample_weight():
"""Check that we take into account `sample_weight` when computing the quantiles."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check that we take into account `sample_weight` when computing the quantiles."""
"""Check the impact of `sample_weight` one computed quantiles."""



def test_kbinsdiscretizer_no_mutating_sample_weight():
"""Check to make sure that `sample_weight` is no changed in place."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Check to make sure that `sample_weight` is no changed in place."""
"""Make sure that `sample_weight` is not changed in place."""

Comment on lines 246 to 247
"`sample_weight` was provided but it can be only used with"
f"strategy='quantile'. Got strategy={self.strategy!r} instead."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`sample_weight` was provided but it can be only used with"
f"strategy='quantile'. Got strategy={self.strategy!r} instead."
"`sample_weight` was provided but it can only be "
f"used with strategy='quantile'. Got strategy="
f"{self.strategy!r} instead."

Merge remote-tracking branch 'upstream/main' into deas-kbins
@DeaMariaLeon
Copy link
Contributor Author

Implemented latest corrections. It should be completed.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we released 1.2 last week, we need to update the changelog for this feature. Otherwise, still LGTM.

@@ -680,6 +680,13 @@ Changelog
is now deprecated and will be removed in version 1.4. Use `sparse_output` instead.
:pr:`24412` by :user:`Rushil Desai <rusdes>`.

- |Enhancement| Added support for `sample_weight` in :class:`preprocessing.KBinsDiscretizer`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just released 1.2. We will need to move this section to the v1.3.rst file

Contains weight values to be associated with each sample.
Only possible when `strategy` is set to `"quantile"`.

.. versionadded:: 1.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 1.2
.. versionadded:: 1.3

Merge remote-tracking branch 'upstream/main' into deas-kbins
@DeaMariaLeon
Copy link
Contributor Author

Changed release news files.

However... There is this note in the instructions for documentation contributors that I just saw:
"When editing reStructuredText (.rst) files, try to keep line length under 80 characters when possible (exceptions include links and tables)."
I tried to keep it under 80 characters but it looked weird.

Also: "Before submitting your pull request check if your modifications have introduced new sphinx warnings and try to fix them."
When or where should I see the warnings?

Thanks

Comment on lines 48 to 53
- |Enhancement| Added support for `sample_weight` in :class:`preprocessing.KBinsDiscretizer`.
This allows specifying the parameter `sample_weight` for each sample to be used while
fitting. The option is only available when `strategy` is set to `quantile`.
:pr:`24935` by :user:`Seladus <seladus>`,
:user:`Guillaume Lemaitre <glemaitre>`, and
:user:`Dea María Léon <deamarialeon>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be less than 80 characters. This would be fine.

Suggested change
- |Enhancement| Added support for `sample_weight` in :class:`preprocessing.KBinsDiscretizer`.
This allows specifying the parameter `sample_weight` for each sample to be used while
fitting. The option is only available when `strategy` is set to `quantile`.
:pr:`24935` by :user:`Seladus <seladus>`,
:user:`Guillaume Lemaitre <glemaitre>`, and
:user:`Dea María Léon <deamarialeon>`.
- |Enhancement| Added support for `sample_weight` in
:class:`preprocessing.KBinsDiscretizer`. This allows specifying the parameter
`sample_weight` for each sample to be used while fitting. The option is only
available when `strategy` is set to `quantile`.
:pr:`24935` by :user:`Seladus <seladus>`, :user:`Guillaume Lemaitre <glemaitre>`, and
:user:`Dea María Léon <deamarialeon>`.

@DeaMariaLeon
Copy link
Contributor Author

Done.. I think.

@glemaitre glemaitre merged commit 1f3c1be into scikit-learn:main Dec 13, 2022
@glemaitre
Copy link
Member

All good. Thanks @DeaMariaLeon

@DeaMariaLeon
Copy link
Contributor Author

Thank you @glemaitre

@DeaMariaLeon DeaMariaLeon deleted the deas-kbins branch December 13, 2022 16:27
Vincent-Maladiere pushed a commit to Vincent-Maladiere/scikit-learn that referenced this pull request Dec 14, 2022
…quantile" (scikit-learn#24935)

Co-authored-by: seladus <clement.blancovolle@insa-rouen.fr>
Co-authored-by: Seladus <71873495+Seladus@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 3, 2023
…quantile" (scikit-learn#24935)

Co-authored-by: seladus <clement.blancovolle@insa-rouen.fr>
Co-authored-by: Seladus <71873495+Seladus@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants