-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: refactored utils/validation._check_sample_weights() and added stronger sample_weights checks for all estimators #14653
Conversation
Thank you for the PR! We can most likely use: scikit-learn/sklearn/utils/validation.py Lines 1003 to 1004 in 1a14920
which returns a validated sample weight. |
Please add a non-regression test that would fail at master but pass in this PR. |
Sure. Thank you
…On Wed, 14 Aug 2019, 15:56 Andreas Mueller, ***@***.***> wrote:
Please add a non-regression test
<https://en.wikipedia.org/wiki/Non-regression_testing> that would fail at
master but pass in this PR.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#14653?email_source=notifications&email_token=ADXZYMVLKMIIW7TCCD27L5TQERPPZA5CNFSM4ILV74F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4J5QXY#issuecomment-521394271>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADXZYMVDIIBI73QDZ4XHK4DQERPPZANCNFSM4ILV74FQ>
.
|
Do we really need a non-regression test for this @amueller ? The new @maxwell-aladago A simple non-regression test would be to pass a sample_weight array with a bad number of components I think. |
I'll add the test shortly. Thank you |
The regression tests now added @NicolasHug |
sklearn/utils/validation.py
Outdated
if sample_weight.shape != (n_samples,): | ||
raise ValueError("sample_weight.shape == {}, expected {}!" | ||
try: | ||
sample_weight = np.array(sample_weight, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could avoid to make a copy (if possible) by adding copy=False
, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering what's the motivation for this refactoring of the _check_sample_weights
? Previously if sample_weight was None
, we would allocate an array of ones and return that. In this version, we make that allocation and than continue with the standard list of checks on this array. Granted they are fast, but still unnecessary.
sklearn/utils/validation.py
Outdated
sample_weight = np.array(sample_weight, dtype=dtype) | ||
except ValueError as e: | ||
e.args = (e.args[0] + ". sample weights must a scalar or " | ||
"a 1D array numeric types",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very enthusiastic about hacking of error messages in try... except..
, why do we need this? or the try except clause? Is it related to the array function protocol #14687 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you sure? I see it in the diff for this PR and if I make a local checkout of latest changes here.
@@ -93,6 +93,55 @@ def test_linear_regression_sample_weights(): | |||
assert_almost_equal(inter1, coefs2[0]) | |||
|
|||
|
|||
def test_sample_weights(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of this is already covered by check_sample_weights_invariance
in common tests (see sklearn/utils/estimator_checks.py
) and I would rather we added more checks there, rather than add estimator specific tests that would take effort to maintain in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will refactor the PR. Thanks
related to #14702 |
sklearn/utils/estimator_checks.py
Outdated
@@ -636,6 +637,16 @@ def check_sample_weights_invariance(name, estimator_orig): | |||
1, 1, 1, 1, 2, 2, 2, 2], dtype=np.dtype('int')) | |||
y = enforce_estimator_tags_y(estimator1, y) | |||
|
|||
# sample weights greater than 1D raises ValueError | |||
sample_weight = [[1, 2]] | |||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry in advance about this :)
In the estimator_checks.py
, we are not using pytest
(it would force third party using this code to use pytest as well).
Therefore, you need to use assert_raises
instead of pytest.raises
(but this is the only place).
Also, it might be useful to use assert_raises_regex
and match a partial string. It would enforce consistency regarding the error message raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thank you. I realised there was a soft dependency on pytest
and the condo
had a problem installing it.
dtype = np.float64 | ||
|
||
if sample_weight is None or isinstance(sample_weight, numbers.Number): | ||
if sample_weight is None: | ||
sample_weight = np.ones(n_samples, dtype=dtype) | ||
else: | ||
elif isinstance(sample_weight, numbers.Number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the else statement was fine, wasn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even write
if sample_weigt is None:
sample_weight = np.ones(...)
elif isinstance(..., Number):
...
else:
return sample_weight
Basically remove the first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sample_weight
may have the wrong number of dimensions or elements if it's already an array. Thus, the further checks are necessary. We can only ignore the checks below if sample_weight
is created within the function (i.e, when sample_weight
is None
or it's an integer
@@ -1025,27 +1025,27 @@ def _check_sample_weight(sample_weight, X, dtype=None): | |||
""" | |||
n_samples = _num_samples(X) | |||
|
|||
if dtype is not None and dtype not in [np.float32, np.float64]: | |||
if hasattr(sample_weight, "dtype"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useless I think. When this is an array we will return it directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be an array of strings, returning it immediately can lead to problems later.
Edit: or it could have the wrong number of elements or dimensions.
|
||
sample_weight = np.array(sample_weight, dtype=dtype) | ||
|
||
if sample_weight.ndim != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already return sample_weight if it was an array. Shall make the check as well in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't return sample weight if it's an array. We need to check that it's dtype
is one of np.float32
or np.float64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably worded my comment badly in #14653 (comment): checks can be indeed added to common tests however these will not pass until all estimators use _check_sample_weight
which is not yet the case.
To validate the refactoring of _check_sample_weights
we need non regression checks in sklearn/utils/tests/test_validation.py::test_check_sample_weight
. Right now the added tests pass on master I think? A few comments:
- this removes
check_array
in favor ofnp.array
. The former does a number of checks and tries to reduce copies. - the handling of the dtype parameter is re-written I'm not sure what's the motivation there.
- I agree we can probably simplify the nesting of if conditions as @glemaitre suggested
@rth, can you check the recent commit and let me know whether I am on the right path? Thank you |
@maxwell-aladago Can you explain the initial motivation for refactoring |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Any other comments?