-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Common sample_weight validation #14307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sample_weight = np.array(sample_weight) | ||
check_consistent_length(y, sample_weight) | ||
else: | ||
sample_weight = np.ones_like(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this didn't produce sample weights with the expected dtype when y
was an integer array.
sklearn/utils/validation.py
Outdated
if n_samples is not None and y is not None: | ||
raise ValueError('Only one of y, n_samples must be provided!') | ||
elif y is not None: | ||
n_samples = y.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should handle the cases when,
- y is 1D
- y is 2D (a 1D column)
- y is 2D (multi-output)
order=None, **kwargs): | ||
"""Validate sample weights | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add strictly_positive=True
by default. We might have some places that negative weights are meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can introduce this in the next PR if we don't break any back-compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would rather make a separate PR for, it might also involve defining a global setting require_positive_sample_weight
to ether 'all', 'any' or False cf related discussion in #12464 (comment)
sklearn/utils/validation.py
Outdated
@@ -980,3 +980,56 @@ def check_scalar(x, name, target_type, min_val=None, max_val=None): | |||
|
|||
if max_val is not None and x > max_val: | |||
raise ValueError('`{}`= {}, must be <= {}.'.format(name, x, max_val)) | |||
|
|||
|
|||
def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about making it public the same way than check_classification_target
. If we have a common test, all estimators will need to add this part, so public seems wise to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should maintain backward compatibility on this or lets users use it in their code. Utils are semi-private anyway, this just marks it as explicitly private. We can still use it in common checks, even if it's private right?
sklearn/utils/validation.py
Outdated
n_samples = y.shape[0] | ||
|
||
if sample_weight is None or isinstance(sample_weight, numbers.Number): | ||
if order is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason for sample_weight
to be a number. It's equivalent to all ones.
Only Ridge allows sample_weight
to be a float, but I think it's useless and we should rather not allow a float as sample_weight
and not support it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't see the point of it, but this PR aims to do a refactoring without breaking backward compatibility. If we are going to disallow it (and potentially break some user code) I would rather it be considered in a separate PR.
sklearn/utils/validation.py
Outdated
|
||
|
||
def _check_sample_weight(sample_weight, y=None, n_samples=None, dtype=None, | ||
order=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some estimators (e.g. AdaBoost) might need the weights to be normalized (i.e. w / w.sum()
). You could add a normalize parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we let each estimator do it since that it might be on estimator bases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for now the approach was to add functionality as I encountered it in the code. AdaBoost
seems to do something fairly specific, I haven't seen the need for it in other estimators. But if we encounter more, we could certainly add it. This would need to be applied to more estimators in the future (and that's also why I would rather not make the function public, as parameters may change)..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to make it private until all estimators use it and the functionalities are stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem to be specific to AdaBoost. Kmeans uses a different normalization (weights have to sum to n_samples...)
sklearn/utils/validation.py
Outdated
target variable. Either y or n_samples must be provided. | ||
n_samples: int or None | ||
expected number of samples. Either y or n_samples must be provided. | ||
dtype: dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need n_samples
in case y isn't there for unsupervised ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, e.g. k-means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we then pass X instead of n_samples ?
else: | ||
if dtype is None: | ||
dtype = [np.float64, np.float32] | ||
sample_weight = check_array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where we don't want sample_weight.dtype
to be the same as X.dtype
?
Because if there aren't any we can pass X as a parameter, force X.dtype and get rid of the n_samples
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where we don't want sample_weight.dtype to be the same as X.dtype ?
That's a very good question, but I'm not 100% sure where sample_weight
are used across estimators. For instance, take the case or some regressor where X
is float32
, and y
is float64
. Right now sample_weight
would be float64
by default. Would it make sense to also make it float32? I don't know, maybe. Would it break something, and do we currently have sufficient tests to detect it? I don't know either.
So here I went with the safer road of just adding this refactoring without changing dtypes. The second part could be added but I think it would deserve a more careful look as part of #11000 (and related issues).
sample_weight, accept_sparse=False, | ||
ensure_2d=False, dtype=dtype, order=order, **kwargs | ||
) | ||
if sample_weight.ndim != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use column_or_1d
validation to have consistent warning and error raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my problem with column_or_1d
is that applies np.ravel
even to 1D arrays, making the provided order
irrelevant as the output is then contiguous (for the cost of a copy).
Also it doesn't look like this unraveling was previously done for sample_weight in the estimators I looked at. Meaning that 1D column like sample_weight
I imagine would previously fail, and with this would pass. Do we really want to add this additional degree or freedom?
if sample_weight.ndim != 1: | ||
raise ValueError("Sample weights must be 1D array or scalar") | ||
|
||
if sample_weight.shape != (n_samples,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use check_consistency_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to have consistent error raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but y
is not always provided, and check_consistency_length
wouldn't work when only n_samples
is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhm True. It is a pity that we cannot keep the error message in one location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you check against X :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, checking against X
is probably best. Will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I just want to check the shape, while check_consistent_length
does a bunch of generic and more complicated stuff. I'm also not convinced that the resulting generic error message,
Found input variables with inconsistent numbers of samples: %r
is an improvement over the current specific message,
samples_weight.shape == {}, expected {}!
(though I'm open to suggestions on how to improve the latter one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is also that by this point sample_weight
has been validated, so I am sure it's an array, while check_consistent_length
doesn't make any assumptions hence the more complicated and generic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
|
||
|
||
def test_check_sample_weight(): | ||
with pytest.raises(ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to group all the exception within the same parametrize test ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only two exceptions here, and the parameters used are not the same, so I'm not convinced that parametrization would improve readability..
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Thanks for reviews @glemaitre and @jeremiedbb . I think I addressed all comments. Please let me know if I missed something.
Actually I have looked into it, and it's not that simple. Checking against In particular, I guess one could make an argument that In the end, although manually passing |
My last argument: |
I have added a fix so that when CI is green now. Any other comments that I should address? |
sample_weight : {ndarray, Number or None}, shape (n_samples,) | ||
Input sample weights. | ||
X : nd-array, list or sparse matrix | ||
Input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a blank line between parameters, right ?
sklearn/utils/validation.py
Outdated
otherwise it will be of 'C' order by default. | ||
|
||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns
sklearn/utils/validation.py
Outdated
dtype of the validated `sample_weight`. Note that if `dtype` is not | ||
one of `float32`, `float64`, the output will be of dtype `float64`. | ||
order : 'F', 'C' or None (default=None) | ||
Whether an array will be forced to be fortran or c-style. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'F' is equivalent to 'C' for 1D array. Don't you think we can drop this parameter and always enforce 'C'.
Letting non contiguous array pass is risky if it's passed to cython, with pointers manipulation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Thanks @jeremiedbb . I addressed your above comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
sklearn/linear_model/ridge.py
Outdated
@@ -428,8 +429,7 @@ def _ridge_regression(X, y, alpha, sample_weight=None, solver='auto', | |||
" %d != %d" % (n_samples, n_samples_)) | |||
|
|||
if has_sw: | |||
if np.atleast_1d(sample_weight).ndim > 1: | |||
raise ValueError("Sample weights must be 1D array or scalar") | |||
sample_weight = _check_sample_weight(sample_weight, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably equivalent but y -> X ?
""" | ||
n_samples = _num_samples(X) | ||
|
||
if dtype is not None and dtype not in [np.float32, np.float64]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for when dtype=np.float32
the sample_weight
is np.float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan, added the test.
@glemaitre , @thomasjpfan I think I have addressed all of your comments in this PR. |
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicks + 1 small concern
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
Thanks for the review @NicolasHug! any other comments ? Or can this be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 nitpicks but LGTM, feel free to merge
|
||
# check numbers input | ||
sample_weight = _check_sample_weight(2.0, X=np.ones((5, 2))) | ||
assert_allclose(sample_weight, 2*np.ones(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that we don't get a flake8 warning here
# int dtype will be converted to float64 instead | ||
X = np.ones((5, 2), dtype=np.int) | ||
sample_weight = _check_sample_weight(None, X, dtype=X.dtype) | ||
assert sample_weight.dtype == np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test for the shape difference error? (it's implicitly tested in kmeans)
Thanks @NicolasHug , I addressed your comments. |
Thank you @rth! |
This implements a helper function for sample weight validation, and uses it consistently across the code base when applicable.
Currently it's only applied to estimators (particularly in linear models and SVM) that already had some form of sample weight validation. It the future it might be worthwhile to add it to estimators that don't currently validate sample weights.
It is a pre-requisite for optionally enforcing positive sample weights (#12464)
Also somewhat related to #14286 and #14294
The main motivation is to avoid implementing this yet another time for GLMs in #14300
Any thought on this @glemaitre , since you were looking into related issues lately?