-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Add estimator check for not calling __array_function__ #14702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT Add estimator check for not calling __array_function__ #14702
Conversation
this is not a complete fix and is also breaking some things. |
hm this is actually nearly ok but I gotta run ;) |
now added a test for |
Should be good now? |
oh coverage fails because we don't test on new numpy I guess. Should we? I think waiting till it's on conda might be ok? |
|
||
def __array__(self, dtype=None): | ||
return self.data | ||
|
||
def __array_function__(self, func, types, args, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning True
and raising TypeError
needs to be tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean explicitly tested? Sure, i could do that. Thought that was a bit overkill for a test helper. They are obviously used in the tests, but on CI there's no __array_function__
protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more onboard if NotAnArray
was private, which goes back to "public vs private utils" #6616 (comment)
On the other hand, we are depending on this raising when something is wrong. When everything is working, our tests do not run __array_function__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I'll add a test (that won't be run)
I'm not sure how many people install numpy from pypi, but my guess is "many". I'd rather have that, or conda-forge as the "test against the latest" in the CI, rather than conda. WDYT? |
@@ -118,6 +118,7 @@ def fit(self, X, y, sample_weight=None): | |||
self.sparse_output_ = sp.issparse(y) | |||
|
|||
if not self.sparse_output_: | |||
y = np.asarray(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not want to have a more complex atleast_1d
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean column_or_1d
? I honestly don't know why we would need atleast_1d
. We usually want scalars to be treated differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also check_array(ensure_2d=False)
which might be more suitable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I tend to forget that we have both column_or_1d
and atleast_1d
. Ideally I think I'd rather have ensure_ndims=n
in check_array
maybe.
@@ -433,6 +434,8 @@ def fit(self, X, y, sample_weight=None): | |||
self.n_outputs_ = y.shape[1] | |||
|
|||
check_consistent_length(X, y, sample_weight) | |||
if sample_weight is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this belongs to check_array
too, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean. Using check_array(sample_weights, ensure_2d=True)
instead of asarray? We don't have any tests for NaN in sample weights, do we? I'm not sure how much I want to make this PR about adding way more checks to sample weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have _check_sample_weight in validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that does a lot more though, right?
The test failure is due to me using I'm ok with moving the "latest" CI to conda-forge |
# Conflicts: # sklearn/model_selection/tests/test_search.py
Never mind, #14381 doesn't fix that, I just won't use pytest in that file. |
any takers? Should I include the change to CI in this PR? |
I was gonna do the CI change, if you don't, I'll do in a separate PR. |
bc9df43
to
e0f239e
Compare
this is quite hard to keep in sync with master :-/ |
greeeeen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We need a changelog entry (and maybe a section in the doc) to explain that sklearn estimators will not rely on __array_function__
by default and instead materialize fit parameters as numpy arrays internally.
@ogrisel like this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention NEP 18 explicitly to improve googl-ability.
doc/whats_new/v0.22.rst
Outdated
@@ -623,6 +627,10 @@ These changes mostly affect library developers. | |||
Such classifiers need to have the `binary_only=True` estimator tag. | |||
:pr:`13875` by `Trevor Stephens`_. | |||
|
|||
- Estimators are expected to convert input data (``X``, ``y``, | |||
``sample_weights``) to numpy ``ndarray`` and never call | |||
``__array_function__`` on the original datatype that is passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``__array_function__`` on the original datatype that is passed. | |
``__array_function__`` on the original datatype that is passed. This means that, | |
by default, estimators are not expected to support the array function dispatching | |
mechanism of `NEP 18`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this wording is particularly clear. Estimators are forbidden from supporting the mechanism of NEP 18 with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to phrase it in a way that this will always be the case in the future. But phrase it as you wish as long as NEP 18 is mentioned :)
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise lgtm
doc/whats_new/v0.22.rst
Outdated
@@ -623,6 +628,11 @@ These changes mostly affect library developers. | |||
Such classifiers need to have the `binary_only=True` estimator tag. | |||
:pr:`13875` by `Trevor Stephens`_. | |||
|
|||
- Estimators are expected to convert input data (``X``, ``y``, | |||
``sample_weights``) to numpy ``ndarray`` and never call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use :class:`numpy.ndarray`
to get the cross-reference?
doc/whats_new/v0.22.rst
Outdated
@@ -601,6 +601,11 @@ Changelog | |||
Miscellaneous | |||
............. | |||
|
|||
- |API| Scikit-learn currently converts any input data structure implementing a duck array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "now" instead of "currently"?
The only test is that estimators don't complain when you pass I think we should have @amueller I addressed Joel's comments and merged with master. I'm happy to address future comments if you need Regarding all the changes to sample_weight, I agree with @glemaitre that we could/should be using |
I think we should have _NotAnArray.__array_function__ raise an exception
to properly cover that?
Isn't that what this pr does?
|
Ah, indeed. I wasn't reading that well for some reason |
test for #14687.
I don't think we test against numpy 1.17 so this will only fail on cron not here?