Skip to content

ENH check_is_fitted calls __is_fitted__ if available #20657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Aug 20, 2021

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Aug 2, 2021

This PR introduces __sk_is_fitted__ which is used by check_is_fitted. This fixes the issue of Pipeline not passing check_is_fitted(pipeline) as well as FunctionTransformer which is stateless.

Related to #9741

cc @jnothman @thomasjpfan @glemaitre @romanlutz

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! :)

@@ -657,6 +659,21 @@ def n_features_in_(self):
# delegate to first step (which will call _check_is_fitted)
return self.steps[0][1].n_features_in_

def __is_fitted__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, I think it is nicer to namespace this and pull it apart from Python "magic methods":

Suggested change
def __is_fitted__(self):
def _sk_is_fitted_(self):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to namespace it. But I'm less sure about a magic method vs a private one. This is something which can be used by third party estimators. The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the __is_fitted__ but I don't have any argument in favor :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. But I'm less sure about a magic method vs a private one.

I'm mostly thinking about double underscore or single underscore. double underscore == more "python magic method like".

The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.

I think _sk_is_fitted_ is simple enough to make it a "developer API". I do not see it changing. (The only thing we need to decide on is the naming.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double underscore == more "python magic method like".

That's not how I think about it though. For instance, I know if I write something which should work with numpy, I can implement __array__. But I almost never end up having to implement a private, i.e. single underscore, method to work with a different framework.

Copy link
Member

@thomasjpfan thomasjpfan Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its __array_*__, and dask has __dask_*__.

The only exception is jupyter using _repr_html_ for their HTML display.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its array*, and dask has dask*.

This seems to be a good argument to me to mimic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'll make it __sk_is_fitted__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had similar surprise to @thomasjpfan here. Ipython uses single underscore _repr_html_ but perhaps that is the exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reopening this discussion: make we should expand sk to sklearn for namespace scoping as we are not the only scikit.

@jnothman
Copy link
Member

jnothman commented Aug 3, 2021

I am also a bit surprised by the choice of a method, not a property/attribute, for this

@adrinjalali
Copy link
Member Author

I am also a bit surprised by the choice of a method, not a property/attribute, for this

@jnothman partly because I don't know of many __blah__ class/instance attributes which are not a method. To me a __sk_is_fitted__ property seems odd, but no strong feelings there.

@jnothman
Copy link
Member

jnothman commented Aug 5, 2021

I don't know of many blah class/instance attributes which are not a method

__class__,__name__, __module__, __file__... but these are not often overridden.

@adrinjalali
Copy link
Member Author

Would you be happy otherwise, if I change it to be a property?

@@ -657,6 +659,21 @@ def n_features_in_(self):
# delegate to first step (which will call _check_is_fitted)
return self.steps[0][1].n_features_in_

def __sk_is_fitted__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that the pattern inside the function looks typical of a property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that true for _more_tags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does as well :)

@adrinjalali
Copy link
Member Author

Changed to a property, but had to change the test since available_if doesn't work on a property, do we wanna fix that @jnothman ?

@glemaitre
Copy link
Member

Did you try to call available_if inside the property similarly to https://github.com/scikit-learn/scikit-learn/pull/20685/files#diff-44602c6feb13bfed0cd07fbdb69462a92b7015c13e6b3fe966318cf24af89517R617-R619

Until we raise AttributeError then hasattr will work as expected.

@adrinjalali
Copy link
Member Author

Interesting pattern. I tried raising an AttributeError in the method, but since hasattr doesn't actually call the method, it didn't work. But now that it's a property, it would work. It would still be nice to have available_if on properties I think. I'm happy with the the test as is anyway, it's simpler when divided in two classes.

Now that I think about it, developers don't even need to have a property, they can just set __sk_is_fitted__ = True in their fit method and all works well. So I'm in favor of it being a property now too.

-------
bool
True if the last step is fitted, false otherwise.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you which, you can make a single line summary since this is a property. We configure numpydoc for it (the day that pipeline pass the test).

except NotFittedError:
pass
estimator.fit(X, y)
check_is_fitted(estimator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to raise a more informative error here mentioning that the estimator fails check_is_fitted even after being fit?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of the leading + trailing dunder notation. I think this convention was intended by Python developers to leave a clean namespace for the user code. But here we are in user code and I think it would be perfectly fine to use _is_fitted or _sk_is_fitted if you want the extra safety of sklearn specific scoping.

Other than that, I like the general idea of the PR.

@adrinjalali
Copy link
Member Author

I am not a big fan of the leading + trailing dunder notation. I think this convention was intended by Python developers to leave a clean namespace for the user code.

It's kind of the same here though. There should be a difference between third party sklearn developers and "users". We have the convention that private attribute are not considered as a part of the public API and therefore we would change them easily. Dunder notation indicates that it's a part of the public, but developer API.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's public in a sense that developers would use it

How would developers "use" it? They would only define this attribute in their own estimator class to have sklearn's check_estimator pass on those, but they would never write code that accesses this attribute from sklearn's own estimators right?

But ok with keeping the double dunder notation if everybody thinks that it's the best way to convey that this attribute is special and used by scikit-learn if users write their own class and compose it with scikit-learn classes.

I also added a comment to a previously "resolved" and outdated discussion that does not longer show-up in the github diff: #20657 (comment). I was suggesting to expand sk to sklearn as our project-wide namespace scope to avoid potential conflicts with other scikits (and be respectful in acknowledging their existence ;).

"""Indicate whether pipeline has been fit."""
try:
# check if the last step of the pipeline is fitted
check_is_fitted(self.steps[-1][1])
Copy link
Member

@ogrisel ogrisel Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we only check the last step of the pipeline is to be nice with users that have a currently working pipeline with their own custom stateless transformers that would fail the check_is_fitted check if we were to have this property call this check_is_fitted on all steps?

Or is there another reason?

In both cases it might be worth it to make that explicit in the inline comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre 's script was checking for the first step, I thought it makes sense to do it for the last step. I'm agnostic on how we do it. I'll add a comment.

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 6, 2021

I was suggesting to expand sk to sklearn as our project-wide namespace scope to avoid potential conflicts with other scikits

I agree.

Being a property, the feature does not look like other python protocols. i.e, object.__len__(), object.__bool__(), etc.
As a property, the feature is starting to look like an estimator tag:

  1. tags["is_fitted"] is None -> use implied defined by check_is_fitted (The default)
  2. tags["is_fitted"] is not None -> use custom definition

@adrinjalali
Copy link
Member Author

Being a property, the feature does not look like other python protocols. i.e, object.__len__(), object.__bool__(), etc.
As a property, the feature is starting to look like an estimator tag:

That's why I'd rather have it as a method.

  1. tags["is_fitted"] is not None -> use custom definition

Would then tags['is_fitted'] be a callable or a boolean?

@ogrisel
Copy link
Member

ogrisel commented Aug 9, 2021

Would then tags['is_fitted'] be a callable or a boolean?

It would be a boolean, but since _more_tags is an abstract method that is meant to be implemented in custom estimators to dynamically compute the list of tags of an instance, we could use it for that. But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

The proposed __sklearn_is_fitted__ attribute / property is much simpler to handle for this purpose.

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 9, 2021

The proposed sklearn_is_fitted attribute / property is much simpler to handle for this purpose.

This interaction with very similar to _estimator_type + is_regressor() and friends. XREF: #16469

I am not a big fan of the leading + trailing dunder notation.

I still think using a property for this protocol-like behavior is less "pythonic" when compared to: __array__ + asarray(), __len__ + len(), etc. But in the interest of moving forward, I am okay with keeping _sklearn_if_fitted a property.

@glemaitre
Copy link
Member

But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

Would it be easy to isolate the tags machinery outside of BaseEstimator?

@adrinjalali
Copy link
Member Author

But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

It's already pretty much impossible in many cases to pass estimator_checks while NOT inheriting from BaseEstimator. I don't think this should be a parameter for us to decide how to develop the API. I'm happy not to do isinstance(...), but limiting ourselves to things which are a bit more complicated than what we have in BaseEstimator makes the development of our API extremely hard and is slowing down the development and preventing us from using many new practices in many cases.

So I think the best solution here is to have __sklearn_is_fitted__ as a method as it was before. I'll revert the last commit then.

@jnothman
Copy link
Member

jnothman commented Aug 12, 2021 via email

@adrinjalali
Copy link
Member Author

I think we have a consensus here now, anything left?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel
Copy link
Member

ogrisel commented Aug 20, 2021

I think we have a consensus here now, anything left?

I don't see any points to address left. I think we can merge.

@ogrisel ogrisel merged commit 3e7c04f into scikit-learn:main Aug 20, 2021
@adrinjalali adrinjalali deleted the pipeline-check branch August 20, 2021 13:32
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
eddiebergman added a commit to automl/auto-sklearn that referenced this pull request Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants