MAINT Common sample_weight validation #14307

rth · 2019-07-12T09:47:52Z

This implements a helper function for sample weight validation, and uses it consistently across the code base when applicable.

Currently it's only applied to estimators (particularly in linear models and SVM) that already had some form of sample weight validation. It the future it might be worthwhile to add it to estimators that don't currently validate sample weights.

It is a pre-requisite for optionally enforcing positive sample weights (#12464)

Also somewhat related to #14286 and #14294

The main motivation is to avoid implementing this yet another time for GLMs in #14300

Any thought on this @glemaitre , since you were looking into related issues lately?

rth · 2019-07-12T09:49:27Z

sklearn/linear_model/huber.py

-            sample_weight = np.array(sample_weight)
-            check_consistent_length(y, sample_weight)
-        else:
-            sample_weight = np.ones_like(y)


I'm pretty sure this didn't produce sample weights with the expected dtype when y was an integer array.

rth · 2019-07-12T09:51:38Z

sklearn/utils/validation.py

+    if n_samples is not None and y is not None:
+        raise ValueError('Only one of y, n_samples must be provided!')
+    elif y is not None:
+        n_samples = y.shape[0]


I think this should handle the cases when,

y is 1D

y is 2D (a 1D column)

y is 2D (multi-output)

glemaitre · 2019-07-12T09:57:47Z

sklearn/utils/validation.py

+                         order=None, **kwargs):
+    """Validate sample weights
+
+    Parameters


I would add strictly_positive=True by default. We might have some places that negative weights are meaningful.

We can introduce this in the next PR if we don't break any back-compatibility.

Yes, I would rather make a separate PR for, it might also involve defining a global setting require_positive_sample_weight to ether 'all', 'any' or False cf related discussion in #12464 (comment)

glemaitre · 2019-07-12T09:58:48Z

sklearn/utils/validation.py

@@ -980,3 +980,56 @@ def check_scalar(x, name, target_type, min_val=None, max_val=None):

    if max_val is not None and x > max_val:
        raise ValueError('`{}`= {}, must be <= {}.'.format(name, x, max_val))
+
+
+def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None,


What about making it public the same way than check_classification_target. If we have a common test, all estimators will need to add this part, so public seems wise to me.

I don't think we should maintain backward compatibility on this or lets users use it in their code. Utils are semi-private anyway, this just marks it as explicitly private. We can still use it in common checks, even if it's private right?

sklearn/utils/validation.py

jeremiedbb · 2019-07-12T11:05:44Z

sklearn/utils/validation.py

+        n_samples = y.shape[0]
+
+    if sample_weight is None or isinstance(sample_weight, numbers.Number):
+        if order is None:


I don't see any reason for sample_weight to be a number. It's equivalent to all ones.
Only Ridge allows sample_weight to be a float, but I think it's useless and we should rather not allow a float as sample_weight and not support it here.

I also don't see the point of it, but this PR aims to do a refactoring without breaking backward compatibility. If we are going to disallow it (and potentially break some user code) I would rather it be considered in a separate PR.

jeremiedbb · 2019-07-12T11:08:23Z

sklearn/utils/validation.py

+
+
+def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None,
+                         order=None, **kwargs):


some estimators (e.g. AdaBoost) might need the weights to be normalized (i.e. w / w.sum()). You could add a normalize parameter ?

Or do we let each estimator do it since that it might be on estimator bases?

Yeah, for now the approach was to add functionality as I encountered it in the code. AdaBoost seems to do something fairly specific, I haven't seen the need for it in other estimators. But if we encounter more, we could certainly add it. This would need to be applied to more estimators in the future (and that's also why I would rather not make the function public, as parameters may change)..

I agree to make it private until all estimators use it and the functionalities are stable.

It seem to be specific to AdaBoost. Kmeans uses a different normalization (weights have to sum to n_samples...)

jeremiedbb · 2019-07-12T11:10:01Z

sklearn/utils/validation.py

+       target variable. Either y or n_samples must be provided.
+    n_samples: int or None
+       expected number of samples. Either y or n_samples must be provided.
+    dtype: dtype


you need n_samples in case y isn't there for unsupervised ?

yes, e.g. k-means.

should we then pass X instead of n_samples ?

jeremiedbb · 2019-07-12T11:13:49Z

sklearn/utils/validation.py

+    else:
+        if dtype is None:
+            dtype = [np.float64, np.float32]
+        sample_weight = check_array(


Are there cases where we don't want sample_weight.dtype to be the same as X.dtype ?
Because if there aren't any we can pass X as a parameter, force X.dtype and get rid of the n_samples parameter.

Are there cases where we don't want sample_weight.dtype to be the same as X.dtype ?

That's a very good question, but I'm not 100% sure where sample_weight are used across estimators. For instance, take the case or some regressor where X is float32, and y is float64. Right now sample_weight would be float64 by default. Would it make sense to also make it float32? I don't know, maybe. Would it break something, and do we currently have sufficient tests to detect it? I don't know either.

So here I went with the safer road of just adding this refactoring without changing dtypes. The second part could be added but I think it would deserve a more careful look as part of #11000 (and related issues).

glemaitre · 2019-07-12T11:15:13Z

sklearn/utils/validation.py

+                sample_weight, accept_sparse=False,
+                ensure_2d=False, dtype=dtype, order=order, **kwargs
+        )
+        if sample_weight.ndim != 1:


you can use column_or_1d validation to have consistent warning and error raised

So my problem with column_or_1d is that applies np.ravel even to 1D arrays, making the provided order irrelevant as the output is then contiguous (for the cost of a copy).

Also it doesn't look like this unraveling was previously done for sample_weight in the estimators I looked at. Meaning that 1D column like sample_weight I imagine would previously fail, and with this would pass. Do we really want to add this additional degree or freedom?

glemaitre · 2019-07-12T11:16:34Z

sklearn/utils/validation.py

+        if sample_weight.ndim != 1:
+            raise ValueError("Sample weights must be 1D array or scalar")
+
+        if sample_weight.shape != (n_samples,):


you can use check_consistency_length

to have consistent error raised

Yes, but y is not always provided, and check_consistency_length wouldn't work when only n_samples is provided.

uhm True. It is a pity that we cannot keep the error message in one location.

unless you check against X :)

You are right, checking against X is probably best. Will add it.

Was this addressed?

No. I just want to check the shape, while check_consistent_length does a bunch of generic and more complicated stuff. I'm also not convinced that the resulting generic error message,

Found input variables with inconsistent numbers of samples: %r

is an improvement over the current specific message,

samples_weight.shape == {}, expected {}!

(though I'm open to suggestions on how to improve the latter one).

The difference is also that by this point sample_weight has been validated, so I am sure it's an array, while check_consistent_length doesn't make any assumptions hence the more complicated and generic solution.

Makes sense.

glemaitre · 2019-07-12T11:19:21Z

sklearn/utils/tests/test_validation.py

+
+
+def test_check_sample_weight():
+    with pytest.raises(ValueError,


Do you want to group all the exception within the same parametrize test ;)

There is only two exceptions here, and the parameters used are not the same, so I'm not convinced that parametrization would improve readability..

sklearn/utils/tests/test_validation.py

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

rth · 2019-07-12T15:04:12Z

Thanks for reviews @glemaitre and @jeremiedbb . I think I addressed all comments. Please let me know if I missed something.

unless you check against X :)

You are right, checking against X is probably best. Will add it.

Actually I have looked into it, and it's not that simple. Checking against X works well if check_array was run on X and we are sure it's an array like. However there is no guarantee that this happened when we call _check_sample_weight in general.

In particular, X could be any iterable, that upon check_array(X) would return an array. In which case we don't know the number of samples. Of course we can check that it's an ndarray or sparse array and raise an error otherwise, but again there is no guarantee that we have tests that will cover this use case for the particular estimator that may have this issue.

I guess one could make an argument that y could also be an iterable, but here I used it when it was used before, so it's a net zero change in that respect.

In the end, although manually passing n_samples or y may require a tiny bit more of work, in the end I prefer it as it's safer (and I don't want to deal with all the edge cases that may arise when checking against X).

jeremiedbb · 2019-07-12T15:17:54Z

My last argument:
you can get n_samples from _num_samples(X), and if X is such that this would raise an error, it would have been caught later by check_array anyway (all estimators call check_array at some point)
But if you think it's safer this way, let it as is.

rth · 2019-07-15T09:41:49Z

I have added a fix so that when dtype is not float (e.g. an int), the resulting sample_weight will have the default float dtype (e.g. float64) see #14314 for mode details.

CI is green now. Any other comments that I should address?

sklearn/cluster/k_means_.py

jeremiedbb · 2019-07-15T09:29:59Z

sklearn/utils/validation.py

+    sample_weight : {ndarray, Number or None}, shape (n_samples,)
+       Input sample weights.
+    X : nd-array, list or sparse matrix
+        Input data.


There should be a blank line between parameters, right ?

jeremiedbb · 2019-07-15T09:31:55Z

sklearn/utils/validation.py

+        otherwise it will be of 'C' order by default.
+
+    Parameters
+    ----------


jeremiedbb · 2019-07-15T09:44:26Z

sklearn/utils/validation.py

+       dtype of the validated `sample_weight`. Note that if `dtype` is not
+       one of `float32`, `float64`, the output will be of dtype `float64`.
+    order : 'F', 'C' or None (default=None)
+        Whether an array will be forced to be fortran or c-style.


'F' is equivalent to 'C' for 1D array. Don't you think we can drop this parameter and always enforce 'C'.
Letting non contiguous array pass is risky if it's passed to cython, with pointers manipulation

rth · 2019-07-15T12:57:11Z

Thanks @jeremiedbb . I addressed your above comments.

jeremiedbb

LGTM !

jeremiedbb · 2019-07-15T13:01:22Z

sklearn/linear_model/ridge.py

@@ -428,8 +429,7 @@ def _ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
                         " %d != %d" % (n_samples, n_samples_))

    if has_sw:
-        if np.atleast_1d(sample_weight).ndim > 1:
-            raise ValueError("Sample weights must be 1D array or scalar")
+        sample_weight = _check_sample_weight(sample_weight, y)


it's probably equivalent but y -> X ?

thomasjpfan · 2019-07-15T21:24:20Z

sklearn/utils/validation.py

+    """
+    n_samples = _num_samples(X)
+
+    if dtype is not None and dtype not in [np.float32, np.float64]:


Test for when dtype=np.float32 the sample_weight is np.float32

Thanks @thomasjpfan, added the test.

rth · 2019-07-18T08:52:00Z

@glemaitre , @thomasjpfan I think I have addressed all of your comments in this PR.

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

NicolasHug

Nitpicks + 1 small concern

sklearn/utils/validation.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

rth · 2019-07-19T13:29:43Z

Thanks for the review @NicolasHug! any other comments ? Or can this be merged?

NicolasHug

2 nitpicks but LGTM, feel free to merge

NicolasHug · 2019-07-19T13:53:23Z

sklearn/utils/tests/test_validation.py

+
+    # check numbers input
+    sample_weight = _check_sample_weight(2.0, X=np.ones((5, 2)))
+    assert_allclose(sample_weight, 2*np.ones(5))


Weird that we don't get a flake8 warning here

NicolasHug · 2019-07-19T13:53:53Z

sklearn/utils/tests/test_validation.py

+    # int dtype will be converted to float64 instead
+    X = np.ones((5, 2), dtype=np.int)
+    sample_weight = _check_sample_weight(None, X, dtype=X.dtype)
+    assert sample_weight.dtype == np.float64


Maybe add a test for the shape difference error? (it's implicitly tested in kmeans)

rth · 2019-07-19T15:21:59Z

Thanks @NicolasHug , I addressed your comments.

thomasjpfan · 2019-07-19T15:26:29Z

Thank you @rth!

rth added 7 commits July 12, 2019 10:32

Common sample_weight validation in huber

9800b4b

Use _check_sample_weight in linear_model/stochastic_gradient

95df187

Use check_sample weight in logistic

c889db3

Better order parameter

bd52cfc

Sample weight checks in svm

9e108a4

Tests

4ff292d

Minor fix

bb64a9b

rth commented Jul 12, 2019

View reviewed changes

rth mentioned this pull request Jul 12, 2019

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

Merged

7 tasks

Better handle numeric sample_weight

29e4ff6

glemaitre reviewed Jul 12, 2019

View reviewed changes

jeremiedbb reviewed Jul 12, 2019

View reviewed changes

glemaitre reviewed Jul 12, 2019

View reviewed changes

sklearn/utils/tests/test_validation.py Show resolved Hide resolved

rth and others added 11 commits July 12, 2019 13:24

Update sklearn/utils/validation.py

4d7bb15

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

1c0f6a7

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

59abc05

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

84b0ac0

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

4ea0694

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

cfc7a97

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

908bbfc

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

c6280b6

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

2b84f90

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/utils/validation.py

d81fec1

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Remove kwargs

b2b1773

Use _check_sample_weight(sample_weight, X)

22f9275

Non float dtype should not be supported

ed2dc69

This was referenced Jul 12, 2019

Use of X.dtype when it is not float #14314

Open

Fix check_array dtype in MinibatchKMeans.partial_fit #14323

Merged

Merge remote-tracking branch 'upstream/master' into check-sample-weight

0ce20bc

jeremiedbb reviewed Jul 15, 2019

View reviewed changes

rth added 3 commits July 15, 2019 13:00

Drop order parameter + address review comments

380d9eb

Fix typo

3fa5f73

Fix tests

08e204f

jeremiedbb approved these changes Jul 15, 2019

View reviewed changes

Address review

44d99c1

thomasjpfan reviewed Jul 15, 2019

View reviewed changes

rth added 2 commits July 16, 2019 10:43

Attempt to fix 32bit / 64bit comparison test

3fc9d1a

Check that float32 dtype is preserved

22e1070

thomasjpfan approved these changes Jul 18, 2019

View reviewed changes

rth and others added 2 commits July 18, 2019 16:27

Update sklearn/utils/validation.py

561bb6a

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Lint

71ecf65

NicolasHug reviewed Jul 18, 2019

View reviewed changes

sklearn/utils/validation.py Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

sklearn/utils/validation.py Outdated Show resolved Hide resolved

rth and others added 3 commits July 19, 2019 10:14

Update sklearn/utils/validation.py

e244ad5

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

Improve docstring wording

13f9dec

Fix tests

9cccaf6

NicolasHug approved these changes Jul 19, 2019

View reviewed changes

Nicolas's comments

fb22cfc

thomasjpfan merged commit fb169cd into scikit-learn:master Jul 19, 2019

rth deleted the check-sample-weight branch July 19, 2019 15:25



		def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None,
		order=None, **kwargs):



		def test_check_sample_weight():
		with pytest.raises(ValueError,

Uh oh!

MAINT Common sample_weight validation #14307

MAINT Common sample_weight validation #14307

Uh oh!

Conversation

rth commented Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Jul 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Jul 12, 2019 •

edited

Loading

rth Jul 12, 2019 •

edited

Loading

jeremiedbb Jul 12, 2019 •

edited

Loading

rth Jul 12, 2019 •

edited

Loading