Skip to content

ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

maxwell-aladago
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 14, 2019

Thank you for the PR!

We can most likely use:

def _check_sample_weight(sample_weight, X, dtype=None):
"""Validate sample weights.

which returns a validated sample weight. _rescale_data would most likely need a slight update to take this validated sample weight.

@amueller
Copy link
Member

Please add a non-regression test that would fail at master but pass in this PR.

@maxwell-aladago
Copy link
Contributor Author

maxwell-aladago commented Aug 14, 2019 via email

@NicolasHug
Copy link
Member

Do we really need a non-regression test for this @amueller ? The new _check_sample_weight is clearly stricter and I feel that as long as the CI is green this is good?

@maxwell-aladago A simple non-regression test would be to pass a sample_weight array with a bad number of components I think.

NicolasHug
NicolasHug previously approved these changes Aug 15, 2019
@maxwell-aladago
Copy link
Contributor Author

Do we really need a non-regression test for this @amueller ? The new _check_sample_weight is clearly stricter and I feel that as long as the CI is green this is good?

@maxwell-aladago A simple non-regression test would be to pass a sample_weight array with a bad number of components I think.

I'll add the test shortly. Thank you

@NicolasHug NicolasHug dismissed their stale review August 16, 2019 10:43

many changes since approval

@maxwell-aladago
Copy link
Contributor Author

The regression tests now added @NicolasHug

if sample_weight.shape != (n_samples,):
raise ValueError("sample_weight.shape == {}, expected {}!"
try:
sample_weight = np.array(sample_weight, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could avoid to make a copy (if possible) by adding copy=False, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's possible

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering what's the motivation for this refactoring of the _check_sample_weights? Previously if sample_weight was None, we would allocate an array of ones and return that. In this version, we make that allocation and than continue with the standard list of checks on this array. Granted they are fast, but still unnecessary.

sample_weight = np.array(sample_weight, dtype=dtype)
except ValueError as e:
e.args = (e.args[0] + ". sample weights must a scalar or "
"a 1D array numeric types",)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very enthusiastic about hacking of error messages in try... except.., why do we need this? or the try except clause? Is it related to the array function protocol #14687 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you sure? I see it in the diff for this PR and if I make a local checkout of latest changes here.

@@ -93,6 +93,55 @@ def test_linear_regression_sample_weights():
assert_almost_equal(inter1, coefs2[0])


def test_sample_weights():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of this is already covered by check_sample_weights_invariance in common tests (see sklearn/utils/estimator_checks.py) and I would rather we added more checks there, rather than add estimator specific tests that would take effort to maintain in the long run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will refactor the PR. Thanks

@maxwell-aladago maxwell-aladago changed the title added stronger validation to sample weight to LinearRegresssion ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators Aug 21, 2019
@amueller
Copy link
Member

related to #14702

@@ -636,6 +637,16 @@ def check_sample_weights_invariance(name, estimator_orig):
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('int'))
y = enforce_estimator_tags_y(estimator1, y)

# sample weights greater than 1D raises ValueError
sample_weight = [[1, 2]]
with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry in advance about this :)

In the estimator_checks.py, we are not using pytest (it would force third party using this code to use pytest as well).

Therefore, you need to use assert_raises instead of pytest.raises (but this is the only place).
Also, it might be useful to use assert_raises_regex and match a partial string. It would enforce consistency regarding the error message raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thank you. I realised there was a soft dependency on pytest and the condo had a problem installing it.

dtype = np.float64

if sample_weight is None or isinstance(sample_weight, numbers.Number):
if sample_weight is None:
sample_weight = np.ones(n_samples, dtype=dtype)
else:
elif isinstance(sample_weight, numbers.Number):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the else statement was fine, wasn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even write

if sample_weigt is None:
    sample_weight = np.ones(...)
elif isinstance(..., Number):
    ...
else:
    return sample_weight

Basically remove the first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sample_weight may have the wrong number of dimensions or elements if it's already an array. Thus, the further checks are necessary. We can only ignore the checks below if sample_weight is created within the function (i.e, when sample_weight is None or it's an integer

@@ -1025,27 +1025,27 @@ def _check_sample_weight(sample_weight, X, dtype=None):
"""
n_samples = _num_samples(X)

if dtype is not None and dtype not in [np.float32, np.float64]:
if hasattr(sample_weight, "dtype"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useless I think. When this is an array we will return it directly

Copy link
Contributor Author

@maxwell-aladago maxwell-aladago Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be an array of strings, returning it immediately can lead to problems later.

Edit: or it could have the wrong number of elements or dimensions.


sample_weight = np.array(sample_weight, dtype=dtype)

if sample_weight.ndim != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already return sample_weight if it was an array. Shall make the check as well in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't return sample weight if it's an array. We need to check that it's dtype is one of np.float32 or np.float64.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably worded my comment badly in #14653 (comment): checks can be indeed added to common tests however these will not pass until all estimators use _check_sample_weight which is not yet the case.

To validate the refactoring of _check_sample_weights we need non regression checks in sklearn/utils/tests/test_validation.py::test_check_sample_weight. Right now the added tests pass on master I think? A few comments:

  • this removes check_array in favor of np.array. The former does a number of checks and tries to reduce copies.
  • the handling of the dtype parameter is re-written I'm not sure what's the motivation there.
  • I agree we can probably simplify the nesting of if conditions as @glemaitre suggested

@maxwell-aladago
Copy link
Contributor Author

@rth, can you check the recent commit and let me know whether I am on the right path? Thank you

@jeremiedbb
Copy link
Member

@maxwell-aladago Can you explain the initial motivation for refactoring _check_sample_weight ? What breaks with current implementation ?

@adrinjalali adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants