Skip to content

MNT Add estimator check for not calling __array_function__ #14702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Oct 24, 2019

Conversation

amueller
Copy link
Member

test for #14687.

I don't think we test against numpy 1.17 so this will only fail on cron not here?

@amueller
Copy link
Member Author

this is not a complete fix and is also breaking some things.
I feel like we might want to allow may_share_memory? If someone implements that wrongly and doesn't provide True or False they can figure that out themselves ;)

@amueller
Copy link
Member Author

hm this is actually nearly ok but I gotta run ;)

@amueller
Copy link
Member Author

now added a test for sample_weights as well. That wasn't tested with NotAnArray before (so didn't work) and I needed to add some logic to the slicing in GridSearchCV to make this work with NotAnArray fit_params

@amueller
Copy link
Member Author

Should be good now?

@amueller
Copy link
Member Author

oh coverage fails because we don't test on new numpy I guess. Should we? I think waiting till it's on conda might be ok?


def __array__(self, dtype=None):
return self.data

def __array_function__(self, func, types, args, kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning True and raising TypeError needs to be tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean explicitly tested? Sure, i could do that. Thought that was a bit overkill for a test helper. They are obviously used in the tests, but on CI there's no __array_function__ protocol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more onboard if NotAnArray was private, which goes back to "public vs private utils" #6616 (comment)

On the other hand, we are depending on this raising when something is wrong. When everything is working, our tests do not run __array_function__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll add a test (that won't be run)

@adrinjalali
Copy link
Member

oh coverage fails because we don't test on new numpy I guess. Should we? I think waiting till it's on conda might be ok?

I'm not sure how many people install numpy from pypi, but my guess is "many". I'd rather have that, or conda-forge as the "test against the latest" in the CI, rather than conda. WDYT?

@@ -118,6 +118,7 @@ def fit(self, X, y, sample_weight=None):
self.sparse_output_ = sp.issparse(y)

if not self.sparse_output_:
y = np.asarray(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not want to have a more complex atleast_1d instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean column_or_1d? I honestly don't know why we would need atleast_1d. We usually want scalars to be treated differently.

Copy link
Member Author

@amueller amueller Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also check_array(ensure_2d=False) which might be more suitable here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I tend to forget that we have both column_or_1d and atleast_1d. Ideally I think I'd rather have ensure_ndims=n in check_array maybe.

@@ -433,6 +434,8 @@ def fit(self, X, y, sample_weight=None):
self.n_outputs_ = y.shape[1]

check_consistent_length(X, y, sample_weight)
if sample_weight is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this belongs to check_array too, doesn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean. Using check_array(sample_weights, ensure_2d=True) instead of asarray? We don't have any tests for NaN in sample weights, do we? I'm not sure how much I want to make this PR about adding way more checks to sample weights

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have _check_sample_weight in validation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does a lot more though, right?

@amueller
Copy link
Member Author

The test failure is due to me using pytest in test_estimator_checks :-/ That will be fixed in #14381 I think.

I'm ok with moving the "latest" CI to conda-forge

@amueller amueller added the High Priority High priority issues and pull requests label Aug 26, 2019
# Conflicts:
#	sklearn/model_selection/tests/test_search.py
@amueller
Copy link
Member Author

Never mind, #14381 doesn't fix that, I just won't use pytest in that file.

@amueller
Copy link
Member Author

amueller commented Sep 3, 2019

any takers? Should I include the change to CI in this PR?

@adrinjalali
Copy link
Member

I was gonna do the CI change, if you don't, I'll do in a separate PR.

@amueller amueller added this to the 0.22 milestone Sep 27, 2019
@amueller
Copy link
Member Author

this is quite hard to keep in sync with master :-/

@amueller
Copy link
Member Author

greeeeen

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We need a changelog entry (and maybe a section in the doc) to explain that sklearn estimators will not rely on __array_function__ by default and instead materialize fit parameters as numpy arrays internally.

@amueller
Copy link
Member Author

amueller commented Oct 7, 2019

@ogrisel like this?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention NEP 18 explicitly to improve googl-ability.

@@ -623,6 +627,10 @@ These changes mostly affect library developers.
Such classifiers need to have the `binary_only=True` estimator tag.
:pr:`13875` by `Trevor Stephens`_.

- Estimators are expected to convert input data (``X``, ``y``,
``sample_weights``) to numpy ``ndarray`` and never call
``__array_function__`` on the original datatype that is passed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``__array_function__`` on the original datatype that is passed.
``__array_function__`` on the original datatype that is passed. This means that,
by default, estimators are not expected to support the array function dispatching
mechanism of `NEP 18`_.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this wording is particularly clear. Estimators are forbidden from supporting the mechanism of NEP 18 with this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to phrase it in a way that this will always be the case in the future. But phrase it as you wish as long as NEP 18 is mentioned :)

amueller and others added 3 commits October 8, 2019 17:45
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Co-Authored-By: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise lgtm

@@ -623,6 +628,11 @@ These changes mostly affect library developers.
Such classifiers need to have the `binary_only=True` estimator tag.
:pr:`13875` by `Trevor Stephens`_.

- Estimators are expected to convert input data (``X``, ``y``,
``sample_weights``) to numpy ``ndarray`` and never call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use :class:`numpy.ndarray` to get the cross-reference?

@@ -601,6 +601,11 @@ Changelog
Miscellaneous
.............

- |API| Scikit-learn currently converts any input data structure implementing a duck array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "now" instead of "currently"?

@NicolasHug
Copy link
Member

The only test is that estimators don't complain when you pass _NotAnArray, but it doesn't make sure __array_function__ isn't used, which is what we ultimately want right? At least that's what the what'snew says.

I think we should have _NotAnArray.__array_function__ raise an exception to properly cover that?

@amueller I addressed Joel's comments and merged with master. I'm happy to address future comments if you need


Regarding all the changes to sample_weight, I agree with @glemaitre that we could/should be using _check_sample_weights.
But I'm fine with doing that in another PR.

@jnothman
Copy link
Member

jnothman commented Oct 24, 2019 via email

@NicolasHug
Copy link
Member

Ah, indeed. I wasn't reading that well for some reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants