-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH check_is_fitted calls __is_fitted__ if available #20657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! :)
sklearn/pipeline.py
Outdated
@@ -657,6 +659,21 @@ def n_features_in_(self): | |||
# delegate to first step (which will call _check_is_fitted) | |||
return self.steps[0][1].n_features_in_ | |||
|
|||
def __is_fitted__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think it is nicer to namespace this and pull it apart from Python "magic methods":
def __is_fitted__(self): | |
def _sk_is_fitted_(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to namespace it. But I'm less sure about a magic method vs a private one. This is something which can be used by third party estimators. The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the __is_fitted__
but I don't have any argument in favor :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. But I'm less sure about a magic method vs a private one.
I'm mostly thinking about double underscore or single underscore. double underscore == more "python magic method like".
The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.
I think _sk_is_fitted_
is simple enough to make it a "developer API". I do not see it changing. (The only thing we need to decide on is the naming.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double underscore == more "python magic method like".
That's not how I think about it though. For instance, I know if I write something which should work with numpy, I can implement __array__
. But I almost never end up having to implement a private, i.e. single underscore, method to work with a different framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its __array_*__
, and dask has __dask_*__
.
The only exception is jupyter using _repr_html_
for their HTML display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its array*, and dask has dask*.
This seems to be a good argument to me to mimic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I'll make it __sk_is_fitted__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had similar surprise to @thomasjpfan here. Ipython uses single underscore _repr_html_
but perhaps that is the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopening this discussion: make we should expand sk
to sklearn
for namespace scoping as we are not the only scikit.
I am also a bit surprised by the choice of a method, not a property/attribute, for this |
@jnothman partly because I don't know of many |
|
Would you be happy otherwise, if I change it to be a property? |
sklearn/pipeline.py
Outdated
@@ -657,6 +659,21 @@ def n_features_in_(self): | |||
# delegate to first step (which will call _check_is_fitted) | |||
return self.steps[0][1].n_features_in_ | |||
|
|||
def __sk_is_fitted__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that the pattern inside the function looks typical of a property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that true for _more_tags
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does as well :)
Changed to a property, but had to change the test since |
Did you try to call Until we raise |
Interesting pattern. I tried raising an Now that I think about it, developers don't even need to have a property, they can just set |
sklearn/pipeline.py
Outdated
------- | ||
bool | ||
True if the last step is fitted, false otherwise. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you which, you can make a single line summary since this is a property. We configure numpydoc for it (the day that pipeline pass the test).
sklearn/utils/estimator_checks.py
Outdated
except NotFittedError: | ||
pass | ||
estimator.fit(X, y) | ||
check_is_fitted(estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to raise a more informative error here mentioning that the estimator fails check_is_fitted
even after being fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of the leading + trailing dunder notation. I think this convention was intended by Python developers to leave a clean namespace for the user code. But here we are in user code and I think it would be perfectly fine to use _is_fitted
or _sk_is_fitted
if you want the extra safety of sklearn specific scoping.
Other than that, I like the general idea of the PR.
It's kind of the same here though. There should be a difference between third party sklearn developers and "users". We have the convention that private attribute are not considered as a part of the public API and therefore we would change them easily. Dunder notation indicates that it's a part of the public, but developer API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's public in a sense that developers would use it
How would developers "use" it? They would only define this attribute in their own estimator class to have sklearn's check_estimator
pass on those, but they would never write code that accesses this attribute from sklearn's own estimators right?
But ok with keeping the double dunder notation if everybody thinks that it's the best way to convey that this attribute is special and used by scikit-learn if users write their own class and compose it with scikit-learn classes.
I also added a comment to a previously "resolved" and outdated discussion that does not longer show-up in the github diff: #20657 (comment). I was suggesting to expand sk
to sklearn
as our project-wide namespace scope to avoid potential conflicts with other scikits (and be respectful in acknowledging their existence ;).
"""Indicate whether pipeline has been fit.""" | ||
try: | ||
# check if the last step of the pipeline is fitted | ||
check_is_fitted(self.steps[-1][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we only check the last step of the pipeline is to be nice with users that have a currently working pipeline with their own custom stateless transformers that would fail the check_is_fitted
check if we were to have this property call this check_is_fitted
on all steps?
Or is there another reason?
In both cases it might be worth it to make that explicit in the inline comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre 's script was checking for the first step, I thought it makes sense to do it for the last step. I'm agnostic on how we do it. I'll add a comment.
I agree. Being a property, the feature does not look like other python protocols.
|
That's why I'd rather have it as a method.
Would then |
It would be a boolean, but since The proposed |
This interaction with very similar to
I still think using a property for this protocol-like behavior is less "pythonic" when compared to: |
Would it be easy to isolate the tags machinery outside of |
It's already pretty much impossible in many cases to pass So I think the best solution here is to have |
My suggestion of a property/scalar attribute was in part so that the
estimator could just set the attribute when fitting, rather than it being
calculated from some other attribute. But I'm fine with a method.
|
I think we have a consensus here now, anything left? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I don't see any points to address left. I think we can merge. |
This PR introduces
__sk_is_fitted__
which is used bycheck_is_fitted
. This fixes the issue ofPipeline
not passingcheck_is_fitted(pipeline)
as well asFunctionTransformer
which is stateless.Related to #9741
cc @jnothman @thomasjpfan @glemaitre @romanlutz