Skip to content

MAINT Common sample_weight validation #14307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jul 19, 2019

Conversation

rth
Copy link
Member

@rth rth commented Jul 12, 2019

This implements a helper function for sample weight validation, and uses it consistently across the code base when applicable.

Currently it's only applied to estimators (particularly in linear models and SVM) that already had some form of sample weight validation. It the future it might be worthwhile to add it to estimators that don't currently validate sample weights.

It is a pre-requisite for optionally enforcing positive sample weights (#12464)

Also somewhat related to #14286 and #14294

The main motivation is to avoid implementing this yet another time for GLMs in #14300

Any thought on this @glemaitre , since you were looking into related issues lately?

sample_weight = np.array(sample_weight)
check_consistent_length(y, sample_weight)
else:
sample_weight = np.ones_like(y)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this didn't produce sample weights with the expected dtype when y was an integer array.

if n_samples is not None and y is not None:
raise ValueError('Only one of y, n_samples must be provided!')
elif y is not None:
n_samples = y.shape[0]
Copy link
Member Author

@rth rth Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should handle the cases when,

  • y is 1D
  • y is 2D (a 1D column)
  • y is 2D (multi-output)

order=None, **kwargs):
"""Validate sample weights

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add strictly_positive=True by default. We might have some places that negative weights are meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can introduce this in the next PR if we don't break any back-compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would rather make a separate PR for, it might also involve defining a global setting require_positive_sample_weight to ether 'all', 'any' or False cf related discussion in #12464 (comment)

@@ -980,3 +980,56 @@ def check_scalar(x, name, target_type, min_val=None, max_val=None):

if max_val is not None and x > max_val:
raise ValueError('`{}`= {}, must be <= {}.'.format(name, x, max_val))


def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making it public the same way than check_classification_target. If we have a common test, all estimators will need to add this part, so public seems wise to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should maintain backward compatibility on this or lets users use it in their code. Utils are semi-private anyway, this just marks it as explicitly private. We can still use it in common checks, even if it's private right?

n_samples = y.shape[0]

if sample_weight is None or isinstance(sample_weight, numbers.Number):
if order is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason for sample_weight to be a number. It's equivalent to all ones.
Only Ridge allows sample_weight to be a float, but I think it's useless and we should rather not allow a float as sample_weight and not support it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see the point of it, but this PR aims to do a refactoring without breaking backward compatibility. If we are going to disallow it (and potentially break some user code) I would rather it be considered in a separate PR.



def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None,
order=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some estimators (e.g. AdaBoost) might need the weights to be normalized (i.e. w / w.sum()). You could add a normalize parameter ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we let each estimator do it since that it might be on estimator bases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for now the approach was to add functionality as I encountered it in the code. AdaBoost seems to do something fairly specific, I haven't seen the need for it in other estimators. But if we encounter more, we could certainly add it. This would need to be applied to more estimators in the future (and that's also why I would rather not make the function public, as parameters may change)..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to make it private until all estimators use it and the functionalities are stable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seem to be specific to AdaBoost. Kmeans uses a different normalization (weights have to sum to n_samples...)

target variable. Either y or n_samples must be provided.
n_samples: int or None
expected number of samples. Either y or n_samples must be provided.
dtype: dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need n_samples in case y isn't there for unsupervised ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, e.g. k-means.

Copy link
Member

@jeremiedbb jeremiedbb Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we then pass X instead of n_samples ?

else:
if dtype is None:
dtype = [np.float64, np.float32]
sample_weight = check_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where we don't want sample_weight.dtype to be the same as X.dtype ?
Because if there aren't any we can pass X as a parameter, force X.dtype and get rid of the n_samples parameter.

Copy link
Member Author

@rth rth Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where we don't want sample_weight.dtype to be the same as X.dtype ?

That's a very good question, but I'm not 100% sure where sample_weight are used across estimators. For instance, take the case or some regressor where X is float32, and y is float64. Right now sample_weight would be float64 by default. Would it make sense to also make it float32? I don't know, maybe. Would it break something, and do we currently have sufficient tests to detect it? I don't know either.

So here I went with the safer road of just adding this refactoring without changing dtypes. The second part could be added but I think it would deserve a more careful look as part of #11000 (and related issues).

sample_weight, accept_sparse=False,
ensure_2d=False, dtype=dtype, order=order, **kwargs
)
if sample_weight.ndim != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use column_or_1d validation to have consistent warning and error raised

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my problem with column_or_1d is that applies np.ravel even to 1D arrays, making the provided order irrelevant as the output is then contiguous (for the cost of a copy).

Also it doesn't look like this unraveling was previously done for sample_weight in the estimators I looked at. Meaning that 1D column like sample_weight I imagine would previously fail, and with this would pass. Do we really want to add this additional degree or freedom?

if sample_weight.ndim != 1:
raise ValueError("Sample weights must be 1D array or scalar")

if sample_weight.shape != (n_samples,):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use check_consistency_length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to have consistent error raised

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but y is not always provided, and check_consistency_length wouldn't work when only n_samples is provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhm True. It is a pity that we cannot keep the error message in one location.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you check against X :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, checking against X is probably best. Will add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this addressed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I just want to check the shape, while check_consistent_length does a bunch of generic and more complicated stuff. I'm also not convinced that the resulting generic error message,

Found input variables with inconsistent numbers of samples: %r

is an improvement over the current specific message,

samples_weight.shape == {}, expected {}!

(though I'm open to suggestions on how to improve the latter one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is also that by this point sample_weight has been validated, so I am sure it's an array, while check_consistent_length doesn't make any assumptions hence the more complicated and generic solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.



def test_check_sample_weight():
with pytest.raises(ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to group all the exception within the same parametrize test ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only two exceptions here, and the parameters used are not the same, so I'm not convinced that parametrization would improve readability..

rth and others added 11 commits July 12, 2019 13:24
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@rth
Copy link
Member Author

rth commented Jul 12, 2019

Thanks for reviews @glemaitre and @jeremiedbb . I think I addressed all comments. Please let me know if I missed something.

unless you check against X :)

You are right, checking against X is probably best. Will add it.

Actually I have looked into it, and it's not that simple. Checking against X works well if check_array was run on X and we are sure it's an array like. However there is no guarantee that this happened when we call _check_sample_weight in general.

In particular, X could be any iterable, that upon check_array(X) would return an array. In which case we don't know the number of samples. Of course we can check that it's an ndarray or sparse array and raise an error otherwise, but again there is no guarantee that we have tests that will cover this use case for the particular estimator that may have this issue.

I guess one could make an argument that y could also be an iterable, but here I used it when it was used before, so it's a net zero change in that respect.

In the end, although manually passing n_samples or y may require a tiny bit more of work, in the end I prefer it as it's safer (and I don't want to deal with all the edge cases that may arise when checking against X).

@jeremiedbb
Copy link
Member

My last argument:
you can get n_samples from _num_samples(X), and if X is such that this would raise an error, it would have been caught later by check_array anyway (all estimators call check_array at some point)
But if you think it's safer this way, let it as is.

@rth
Copy link
Member Author

rth commented Jul 15, 2019

I have added a fix so that when dtype is not float (e.g. an int), the resulting sample_weight will have the default float dtype (e.g. float64) see #14314 for mode details.

CI is green now. Any other comments that I should address?

sample_weight : {ndarray, Number or None}, shape (n_samples,)
Input sample weights.
X : nd-array, list or sparse matrix
Input data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a blank line between parameters, right ?

otherwise it will be of 'C' order by default.

Parameters
----------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns

dtype of the validated `sample_weight`. Note that if `dtype` is not
one of `float32`, `float64`, the output will be of dtype `float64`.
order : 'F', 'C' or None (default=None)
Whether an array will be forced to be fortran or c-style.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'F' is equivalent to 'C' for 1D array. Don't you think we can drop this parameter and always enforce 'C'.
Letting non contiguous array pass is risky if it's passed to cython, with pointers manipulation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@rth
Copy link
Member Author

rth commented Jul 15, 2019

Thanks @jeremiedbb . I addressed your above comments.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@@ -428,8 +429,7 @@ def _ridge_regression(X, y, alpha, sample_weight=None, solver='auto',
" %d != %d" % (n_samples, n_samples_))

if has_sw:
if np.atleast_1d(sample_weight).ndim > 1:
raise ValueError("Sample weights must be 1D array or scalar")
sample_weight = _check_sample_weight(sample_weight, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably equivalent but y -> X ?

"""
n_samples = _num_samples(X)

if dtype is not None and dtype not in [np.float32, np.float64]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for when dtype=np.float32 the sample_weight is np.float32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan, added the test.

@rth
Copy link
Member Author

rth commented Jul 18, 2019

@glemaitre , @thomasjpfan I think I have addressed all of your comments in this PR.

rth and others added 2 commits July 18, 2019 16:27
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks + 1 small concern

rth and others added 3 commits July 19, 2019 10:14
@rth
Copy link
Member Author

rth commented Jul 19, 2019

Thanks for the review @NicolasHug! any other comments ? Or can this be merged?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 nitpicks but LGTM, feel free to merge


# check numbers input
sample_weight = _check_sample_weight(2.0, X=np.ones((5, 2)))
assert_allclose(sample_weight, 2*np.ones(5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird that we don't get a flake8 warning here

# int dtype will be converted to float64 instead
X = np.ones((5, 2), dtype=np.int)
sample_weight = _check_sample_weight(None, X, dtype=X.dtype)
assert sample_weight.dtype == np.float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test for the shape difference error? (it's implicitly tested in kmeans)

@rth
Copy link
Member Author

rth commented Jul 19, 2019

Thanks @NicolasHug , I addressed your comments.

@thomasjpfan thomasjpfan merged commit fb169cd into scikit-learn:master Jul 19, 2019
@rth rth deleted the check-sample-weight branch July 19, 2019 15:25
@thomasjpfan
Copy link
Member

Thank you @rth!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants