ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

maxwell-aladago · 2019-08-14T15:12:34Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

thomasjpfan · 2019-08-14T16:15:23Z

Thank you for the PR!

We can most likely use:

scikit-learn/sklearn/utils/validation.py

Lines 1003 to 1004 in 1a14920

    
           def _check_sample_weight(sample_weight, X, dtype=None): 
        
               """Validate sample weights.

which returns a validated sample weight. _rescale_data would most likely need a slight update to take this validated sample weight.

amueller · 2019-08-14T19:54:51Z

Please add a non-regression test that would fail at master but pass in this PR.

maxwell-aladago · 2019-08-14T19:58:37Z

Sure. Thank you

…

On Wed, 14 Aug 2019, 15:56 Andreas Mueller, ***@***.***> wrote: Please add a non-regression test <https://en.wikipedia.org/wiki/Non-regression_testing> that would fail at master but pass in this PR. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14653?email_source=notifications&email_token=ADXZYMVLKMIIW7TCCD27L5TQERPPZA5CNFSM4ILV74F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4J5QXY#issuecomment-521394271>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADXZYMVDIIBI73QDZ4XHK4DQERPPZANCNFSM4ILV74FQ> .

NicolasHug · 2019-08-15T15:32:36Z

Do we really need a non-regression test for this @amueller ? The new _check_sample_weight is clearly stricter and I feel that as long as the CI is green this is good?

@maxwell-aladago A simple non-regression test would be to pass a sample_weight array with a bad number of components I think.

maxwell-aladago · 2019-08-15T15:38:27Z

Do we really need a non-regression test for this @amueller ? The new _check_sample_weight is clearly stricter and I feel that as long as the CI is green this is good?

@maxwell-aladago A simple non-regression test would be to pass a sample_weight array with a bad number of components I think.

I'll add the test shortly. Thank you

sklearn/linear_model/base.py

many changes since approval

maxwell-aladago · 2019-08-16T13:01:45Z

The regression tests now added @NicolasHug

glemaitre · 2019-08-19T20:43:30Z

sklearn/utils/validation.py

-        if sample_weight.shape != (n_samples,):
-            raise ValueError("sample_weight.shape == {}, expected {}!"
+        try:
+            sample_weight = np.array(sample_weight, dtype=dtype)


we could avoid to make a copy (if possible) by adding copy=False, isn't it?

Yes, that's possible

rth

Just wondering what's the motivation for this refactoring of the _check_sample_weights? Previously if sample_weight was None, we would allocate an array of ones and return that. In this version, we make that allocation and than continue with the standard list of checks on this array. Granted they are fast, but still unnecessary.

rth · 2019-08-20T13:30:37Z

sklearn/utils/validation.py

+            sample_weight = np.array(sample_weight, dtype=dtype)
+        except ValueError as e:
+            e.args = (e.args[0] + ". sample weights must a scalar or "
+                                  "a 1D array numeric types",)


I'm not very enthusiastic about hacking of error messages in try... except.., why do we need this? or the try except clause? Is it related to the array function protocol #14687 ?

you sure? I see it in the diff for this PR and if I make a local checkout of latest changes here.

rth · 2019-08-20T14:02:35Z

sklearn/linear_model/tests/test_base.py

@@ -93,6 +93,55 @@ def test_linear_regression_sample_weights():
                assert_almost_equal(inter1, coefs2[0])


+def test_sample_weights():


Part of this is already covered by check_sample_weights_invariance in common tests (see sklearn/utils/estimator_checks.py) and I would rather we added more checks there, rather than add estimator specific tests that would take effort to maintain in the long run.

ok, will refactor the PR. Thanks

amueller · 2019-08-21T18:28:01Z

related to #14702

glemaitre · 2019-08-23T07:56:14Z

sklearn/utils/estimator_checks.py

@@ -636,6 +637,16 @@ def check_sample_weights_invariance(name, estimator_orig):
                      1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('int'))
        y = enforce_estimator_tags_y(estimator1, y)

+        # sample weights greater than 1D raises ValueError
+        sample_weight = [[1, 2]]
+        with pytest.raises(ValueError):


Sorry in advance about this :)

In the estimator_checks.py, we are not using pytest (it would force third party using this code to use pytest as well).

Therefore, you need to use assert_raises instead of pytest.raises (but this is the only place).
Also, it might be useful to use assert_raises_regex and match a partial string. It would enforce consistency regarding the error message raised.

ok, thank you. I realised there was a soft dependency on pytest and the condo had a problem installing it.

glemaitre · 2019-08-23T07:57:33Z

sklearn/utils/validation.py

        dtype = np.float64

    if sample_weight is None or isinstance(sample_weight, numbers.Number):
        if sample_weight is None:
            sample_weight = np.ones(n_samples, dtype=dtype)
-        else:
+        elif isinstance(sample_weight, numbers.Number):


the else statement was fine, wasn't it?

I would even write

if sample_weigt is None: sample_weight = np.ones(...) elif isinstance(..., Number): ... else: return sample_weight

Basically remove the first

The sample_weight may have the wrong number of dimensions or elements if it's already an array. Thus, the further checks are necessary. We can only ignore the checks below if sample_weight is created within the function (i.e, when sample_weight is None or it's an integer

sklearn/utils/validation.py

glemaitre · 2019-08-23T08:02:51Z

sklearn/utils/validation.py

@@ -1025,27 +1025,27 @@ def _check_sample_weight(sample_weight, X, dtype=None):
    """
    n_samples = _num_samples(X)

-    if dtype is not None and dtype not in [np.float32, np.float64]:
+    if hasattr(sample_weight, "dtype"):


This is useless I think. When this is an array we will return it directly

It could be an array of strings, returning it immediately can lead to problems later.

Edit: or it could have the wrong number of elements or dimensions.

glemaitre · 2019-08-23T08:04:02Z

sklearn/utils/validation.py

+
+    sample_weight = np.array(sample_weight, dtype=dtype)
+
+    if sample_weight.ndim != 1:


We already return sample_weight if it was an array. Shall make the check as well in this case?

We don't return sample weight if it's an array. We need to check that it's dtype is one of np.float32 or np.float64.

rth

I probably worded my comment badly in #14653 (comment): checks can be indeed added to common tests however these will not pass until all estimators use _check_sample_weight which is not yet the case.

To validate the refactoring of _check_sample_weights we need non regression checks in sklearn/utils/tests/test_validation.py::test_check_sample_weight. Right now the added tests pass on master I think? A few comments:

this removes check_array in favor of np.array. The former does a number of checks and tries to reduce copies.
the handling of the dtype parameter is re-written I'm not sure what's the motivation there.
I agree we can probably simplify the nesting of if conditions as @glemaitre suggested

sklearn/linear_model/base.py

maxwell-aladago · 2019-08-23T12:52:26Z

@rth, can you check the recent commit and let me know whether I am on the right path? Thank you

jeremiedbb · 2019-09-18T14:55:56Z

@maxwell-aladago Can you explain the initial motivation for refactoring _check_sample_weight ? What breaks with current implementation ?

NicolasHug previously approved these changes Aug 15, 2019

View reviewed changes

rth reviewed Aug 15, 2019

View reviewed changes

sklearn/linear_model/base.py Outdated Show resolved Hide resolved

tests for stronger validation for linear regression weights

ac19cd6

glemaitre reviewed Aug 19, 2019

View reviewed changes

rth reviewed Aug 20, 2019

View reviewed changes

maxwell-aladago added 2 commits August 21, 2019 03:38

refactored validation and tested against all estimators

4c25185

added checks to calibration.py

7b80778

maxwell-aladago changed the title ~~added stronger validation to sample weight to LinearRegresssion~~ ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators Aug 21, 2019

flake8 issue

e679df8

glemaitre reviewed Aug 23, 2019

View reviewed changes

sklearn/utils/validation.py Outdated Show resolved Hide resolved

glemaitre reviewed Aug 23, 2019

View reviewed changes

rth requested changes Aug 23, 2019

View reviewed changes

rth reviewed Aug 23, 2019

View reviewed changes

sklearn/linear_model/base.py Outdated Show resolved Hide resolved

removing tests for all estimators

bdeb2c2

github-actions bot added module:ensemble module:linear_model labels Mar 2, 2020

github-actions bot added the module:utils label Mar 2, 2020

adrinjalali closed this Jan 22, 2021

adrinjalali deleted the branch scikit-learn:master January 22, 2021 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

maxwell-aladago commented Aug 14, 2019

thomasjpfan commented Aug 14, 2019 •

edited

Loading

amueller commented Aug 14, 2019

maxwell-aladago commented Aug 14, 2019 via email

NicolasHug commented Aug 15, 2019

maxwell-aladago commented Aug 15, 2019

maxwell-aladago commented Aug 16, 2019

glemaitre Aug 19, 2019

maxwell-aladago Aug 20, 2019

rth left a comment

rth Aug 20, 2019

rth Aug 20, 2019

rth Aug 20, 2019

maxwell-aladago Aug 20, 2019

amueller commented Aug 21, 2019

glemaitre Aug 23, 2019

maxwell-aladago Aug 23, 2019

glemaitre Aug 23, 2019

glemaitre Aug 23, 2019

maxwell-aladago Aug 23, 2019

glemaitre Aug 23, 2019

maxwell-aladago Aug 23, 2019 •

edited

Loading

glemaitre Aug 23, 2019

maxwell-aladago Aug 23, 2019

rth left a comment

maxwell-aladago commented Aug 23, 2019

jeremiedbb commented Sep 18, 2019

		@@ -93,6 +93,55 @@ def test_linear_regression_sample_weights():
		assert_almost_equal(inter1, coefs2[0])


		def test_sample_weights():


		sample_weight = np.array(sample_weight, dtype=dtype)

		if sample_weight.ndim != 1:

ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653

Conversation

maxwell-aladago commented Aug 14, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

thomasjpfan commented Aug 14, 2019 • edited Loading

amueller commented Aug 14, 2019

maxwell-aladago commented Aug 14, 2019 via email

NicolasHug commented Aug 15, 2019

maxwell-aladago commented Aug 15, 2019

maxwell-aladago commented Aug 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Aug 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxwell-aladago Aug 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth left a comment

Choose a reason for hiding this comment

maxwell-aladago commented Aug 23, 2019

jeremiedbb commented Sep 18, 2019

thomasjpfan commented Aug 14, 2019 •

edited

Loading

maxwell-aladago Aug 23, 2019 •

edited

Loading