ENH check_is_fitted calls __is_fitted__ if available #20657

adrinjalali · 2021-08-02T14:15:02Z

This PR introduces __sk_is_fitted__ which is used by check_is_fitted. This fixes the issue of Pipeline not passing check_is_fitted(pipeline) as well as FunctionTransformer which is stateless.

Related to #9741

cc @jnothman @thomasjpfan @glemaitre @romanlutz

thomasjpfan

Thank you for working on this! :)

thomasjpfan · 2021-08-02T15:50:23Z

sklearn/pipeline.py

@@ -657,6 +659,21 @@ def n_features_in_(self):
        # delegate to first step (which will call _check_is_fitted)
        return self.steps[0][1].n_features_in_

+    def __is_fitted__(self):


Looking at this again, I think it is nicer to namespace this and pull it apart from Python "magic methods":

Suggested change

def __is_fitted__(self):

def _sk_is_fitted_(self):

I'm happy to namespace it. But I'm less sure about a magic method vs a private one. This is something which can be used by third party estimators. The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.

I would prefer the __is_fitted__ but I don't have any argument in favor :)

. But I'm less sure about a magic method vs a private one.

I'm mostly thinking about double underscore or single underscore. double underscore == more "python magic method like".

The question is, do we want to have it as a part of our "developer API" (which we don't really have yet), or really a private API which can change anytime.

I think _sk_is_fitted_ is simple enough to make it a "developer API". I do not see it changing. (The only thing we need to decide on is the naming.)

double underscore == more "python magic method like".

That's not how I think about it though. For instance, I know if I write something which should work with numpy, I can implement __array__. But I almost never end up having to implement a private, i.e. single underscore, method to work with a different framework.

Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its __array_*__, and dask has __dask_*__.

The only exception is jupyter using _repr_html_ for their HTML display.

Looking at all the libraries I am faimliar with. Most use double underscore. Numpy has all its array*, and dask has dask*.

This seems to be a good argument to me to mimic.

Ok, then I'll make it __sk_is_fitted__

I think I had similar surprise to @thomasjpfan here. Ipython uses single underscore _repr_html_ but perhaps that is the exception.

Reopening this discussion: make we should expand sk to sklearn for namespace scoping as we are not the only scikit.

sklearn/utils/estimator_checks.py

jnothman · 2021-08-03T14:36:55Z

I am also a bit surprised by the choice of a method, not a property/attribute, for this

adrinjalali · 2021-08-04T10:41:09Z

I am also a bit surprised by the choice of a method, not a property/attribute, for this

@jnothman partly because I don't know of many __blah__ class/instance attributes which are not a method. To me a __sk_is_fitted__ property seems odd, but no strong feelings there.

jnothman · 2021-08-05T11:12:53Z

I don't know of many blah class/instance attributes which are not a method

__class__,__name__, __module__, __file__... but these are not often overridden.

adrinjalali · 2021-08-05T11:29:37Z

Would you be happy otherwise, if I change it to be a property?

glemaitre · 2021-08-05T14:51:02Z

sklearn/pipeline.py

@@ -657,6 +659,21 @@ def n_features_in_(self):
        # delegate to first step (which will call _check_is_fitted)
        return self.steps[0][1].n_features_in_

+    def __sk_is_fitted__(self):


It is true that the pattern inside the function looks typical of a property.

isn't that true for _more_tags?

Yes it does as well :)

adrinjalali · 2021-08-06T12:20:56Z

Changed to a property, but had to change the test since available_if doesn't work on a property, do we wanna fix that @jnothman ?

glemaitre · 2021-08-06T12:28:29Z

Did you try to call available_if inside the property similarly to https://github.com/scikit-learn/scikit-learn/pull/20685/files#diff-44602c6feb13bfed0cd07fbdb69462a92b7015c13e6b3fe966318cf24af89517R617-R619

Until we raise AttributeError then hasattr will work as expected.

adrinjalali · 2021-08-06T12:41:34Z

Interesting pattern. I tried raising an AttributeError in the method, but since hasattr doesn't actually call the method, it didn't work. But now that it's a property, it would work. It would still be nice to have available_if on properties I think. I'm happy with the the test as is anyway, it's simpler when divided in two classes.

Now that I think about it, developers don't even need to have a property, they can just set __sk_is_fitted__ = True in their fit method and all works well. So I'm in favor of it being a property now too.

glemaitre · 2021-08-06T12:50:05Z

sklearn/pipeline.py

+        -------
+        bool
+            True if the last step is fitted, false otherwise.
+        """


If you which, you can make a single line summary since this is a property. We configure numpydoc for it (the day that pipeline pass the test).

glemaitre · 2021-08-06T13:15:00Z

sklearn/utils/estimator_checks.py

+        except NotFittedError:
+            pass
+    estimator.fit(X, y)
+    check_is_fitted(estimator)


Do we want to raise a more informative error here mentioning that the estimator fails check_is_fitted even after being fit?

glemaitre

LGTM

ogrisel

I am not a big fan of the leading + trailing dunder notation. I think this convention was intended by Python developers to leave a clean namespace for the user code. But here we are in user code and I think it would be perfectly fine to use _is_fitted or _sk_is_fitted if you want the extra safety of sklearn specific scoping.

Other than that, I like the general idea of the PR.

sklearn/preprocessing/_function_transformer.py

adrinjalali · 2021-08-06T14:00:46Z

I am not a big fan of the leading + trailing dunder notation. I think this convention was intended by Python developers to leave a clean namespace for the user code.

It's kind of the same here though. There should be a difference between third party sklearn developers and "users". We have the convention that private attribute are not considered as a part of the public API and therefore we would change them easily. Dunder notation indicates that it's a part of the public, but developer API.

ogrisel

It's public in a sense that developers would use it

How would developers "use" it? They would only define this attribute in their own estimator class to have sklearn's check_estimator pass on those, but they would never write code that accesses this attribute from sklearn's own estimators right?

But ok with keeping the double dunder notation if everybody thinks that it's the best way to convey that this attribute is special and used by scikit-learn if users write their own class and compose it with scikit-learn classes.

I also added a comment to a previously "resolved" and outdated discussion that does not longer show-up in the github diff: #20657 (comment). I was suggesting to expand sk to sklearn as our project-wide namespace scope to avoid potential conflicts with other scikits (and be respectful in acknowledging their existence ;).

ogrisel · 2021-08-06T15:11:49Z

sklearn/pipeline.py

+        """Indicate whether pipeline has been fit."""
+        try:
+            # check if the last step of the pipeline is fitted
+            check_is_fitted(self.steps[-1][1])


The fact that we only check the last step of the pipeline is to be nice with users that have a currently working pipeline with their own custom stateless transformers that would fail the check_is_fitted check if we were to have this property call this check_is_fitted on all steps?

Or is there another reason?

In both cases it might be worth it to make that explicit in the inline comment.

@glemaitre 's script was checking for the first step, I thought it makes sense to do it for the last step. I'm agnostic on how we do it. I'll add a comment.

sklearn/utils/estimator_checks.py

thomasjpfan · 2021-08-06T18:35:23Z

I was suggesting to expand sk to sklearn as our project-wide namespace scope to avoid potential conflicts with other scikits

I agree.

Being a property, the feature does not look like other python protocols. i.e, object.__len__(), object.__bool__(), etc.
As a property, the feature is starting to look like an estimator tag:

tags["is_fitted"] is None -> use implied defined by check_is_fitted (The default)
tags["is_fitted"] is not None -> use custom definition

adrinjalali · 2021-08-09T14:30:00Z

Being a property, the feature does not look like other python protocols. i.e, object.__len__(), object.__bool__(), etc.
As a property, the feature is starting to look like an estimator tag:

That's why I'd rather have it as a method.

tags["is_fitted"] is not None -> use custom definition

Would then tags['is_fitted'] be a callable or a boolean?

ogrisel · 2021-08-09T14:33:29Z

Would then tags['is_fitted'] be a callable or a boolean?

It would be a boolean, but since _more_tags is an abstract method that is meant to be implemented in custom estimators to dynamically compute the list of tags of an instance, we could use it for that. But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

The proposed __sklearn_is_fitted__ attribute / property is much simpler to handle for this purpose.

thomasjpfan · 2021-08-09T16:27:13Z

The proposed sklearn_is_fitted attribute / property is much simpler to handle for this purpose.

This interaction with very similar to _estimator_type + is_regressor() and friends. XREF: #16469

I am not a big fan of the leading + trailing dunder notation.

I still think using a property for this protocol-like behavior is less "pythonic" when compared to: __array__ + asarray(), __len__ + len(), etc. But in the interest of moving forward, I am okay with keeping _sklearn_if_fitted a property.

glemaitre · 2021-08-09T16:48:46Z

But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

Would it be easy to isolate the tags machinery outside of BaseEstimator?

adrinjalali · 2021-08-12T13:14:59Z

But that imposes third-party estimator developers to implement the full estimator tags machinery (or inherit from BaseEstimator).

It's already pretty much impossible in many cases to pass estimator_checks while NOT inheriting from BaseEstimator. I don't think this should be a parameter for us to decide how to develop the API. I'm happy not to do isinstance(...), but limiting ourselves to things which are a bit more complicated than what we have in BaseEstimator makes the development of our API extremely hard and is slowing down the development and preventing us from using many new practices in many cases.

So I think the best solution here is to have __sklearn_is_fitted__ as a method as it was before. I'll revert the last commit then.

This reverts commit 9552776.

This reverts commit 282e23b.

jnothman · 2021-08-12T14:10:31Z

My suggestion of a property/scalar attribute was in part so that the estimator could just set the attribute when fitting, rather than it being calculated from some other attribute. But I'm fine with a method.

adrinjalali · 2021-08-17T11:39:54Z

I think we have a consensus here now, anything left?

ogrisel

LGTM!

ogrisel · 2021-08-20T10:37:11Z

I think we have a consensus here now, anything left?

I don't see any points to address left. I think we can merge.

)

adrinjalali added 3 commits August 2, 2021 16:05

ENH check_is_fitted calls __is_fitted__ if available

9a624fc

fix description

54e5efd

simplify validation

5076b69

github-actions bot added module:pipeline module:preprocessing module:utils labels Aug 2, 2021

adrinjalali added 3 commits August 2, 2021 16:18

add whats_new

544ec0f

fix changelog format

17853da

test pipeline and check_estimator

a93426f

thomasjpfan reviewed Aug 2, 2021

View reviewed changes

glemaitre reviewed Aug 2, 2021

View reviewed changes

sklearn/utils/estimator_checks.py Show resolved Hide resolved

adrinjalali added 2 commits August 3, 2021 11:59

rename to __sk_is_fitted__

64c6a96

use LogisticRegression in pipeline instead

552fec1

try a fake estimator

f59e95e

glemaitre reviewed Aug 5, 2021

View reviewed changes

adrinjalali added 2 commits August 6, 2021 14:16

change from method to property

282e23b

Merge remote-tracking branch 'upstream/main' into pipeline-check

1c4e0a1

glemaitre reviewed Aug 6, 2021

View reviewed changes

glemaitre approved these changes Aug 6, 2021

View reviewed changes

address Guillaume's comments

bb72f97

ogrisel reviewed Aug 6, 2021

View reviewed changes

sklearn/preprocessing/_function_transformer.py Show resolved Hide resolved

ogrisel reviewed Aug 6, 2021

View reviewed changes

adrinjalali mentioned this pull request Aug 9, 2021

RFC Support for Some Developer Utilities #15801

Open

adrinjalali added 4 commits August 12, 2021 15:16

Revert "FunctionTransformer uses a class attribute"

c9384bf

This reverts commit 9552776.

Revert "change from method to property"

c0a84dd

This reverts commit 282e23b.

Olivier's comments

d30ba5d

rename to __sklearn_is_fitted__

89d41cb

fix import

1b08447

ogrisel approved these changes Aug 20, 2021

View reviewed changes

ogrisel merged commit 3e7c04f into scikit-learn:main Aug 20, 2021

ogrisel mentioned this pull request Aug 20, 2021

[MRG] run check_estimator on meta-estimators #9741

Closed

adrinjalali deleted the pipeline-check branch August 20, 2021 13:32

thomasjpfan mentioned this pull request Aug 21, 2021

Revisting the tags interface #20804

Closed

romanlutz mentioned this pull request Sep 6, 2021

DOC fix warnings in documentation build fairlearn/fairlearn#852

Closed

StrikerRUS mentioned this pull request Sep 30, 2021

[python][sklearn] add __sklearn_is_fitted__() method to be better compatible with scikit-learn API microsoft/LightGBM#4636

Merged

lorentzenchr mentioned this pull request Nov 27, 2021

RFC / API add option to fit/predict without input validation #21804

Open

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

ENH check_is_fitted calls __is_fitted__ if available (scikit-learn#20657

dab1b6a

)

adrinjalali mentioned this pull request Dec 21, 2021

__sklearn_clone__ protocol proposal #21838

Closed

WenjieZ mentioned this pull request Feb 10, 2022

Some ideas for a developer-level API #22432

Closed

adrinjalali mentioned this pull request Feb 17, 2022

ENH Add "adversarial debiasing" fairlearn/fairlearn#973

Closed

eddiebergman mentioned this pull request Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

54 tasks

eddiebergman added a commit to automl/auto-sklearn that referenced this pull request Nov 15, 2022

chore(Steps): scikit-learn/scikit-learn#20657

165afe5

ENH check_is_fitted calls __is_fitted__ if available #20657

ENH check_is_fitted calls __is_fitted__ if available #20657

Conversation

adrinjalali commented Aug 2, 2021 • edited Loading

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Aug 2, 2021 • edited by glemaitre Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Aug 3, 2021

adrinjalali commented Aug 4, 2021

jnothman commented Aug 5, 2021

adrinjalali commented Aug 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Aug 6, 2021

glemaitre commented Aug 6, 2021

adrinjalali commented Aug 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

adrinjalali commented Aug 6, 2021

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

ogrisel Aug 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan commented Aug 6, 2021 • edited Loading

adrinjalali commented Aug 9, 2021

ogrisel commented Aug 9, 2021

thomasjpfan commented Aug 9, 2021 • edited by glemaitre Loading

glemaitre commented Aug 9, 2021

adrinjalali commented Aug 12, 2021

jnothman commented Aug 12, 2021 via email

adrinjalali commented Aug 17, 2021

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Aug 20, 2021

adrinjalali commented Aug 2, 2021 •

edited

Loading

thomasjpfan Aug 2, 2021 •

edited by glemaitre

Loading

ogrisel left a comment •

edited

Loading

ogrisel Aug 6, 2021 •

edited

Loading

thomasjpfan commented Aug 6, 2021 •

edited

Loading

thomasjpfan commented Aug 9, 2021 •

edited by glemaitre

Loading