Skip to content

check_is_fitted has false positives on custom subclasses with private attributes #15845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alegonz opened this issue Dec 9, 2019 · 37 comments · Fixed by #15947
Closed

check_is_fitted has false positives on custom subclasses with private attributes #15845

alegonz opened this issue Dec 9, 2019 · 37 comments · Fixed by #15947

Comments

@alegonz
Copy link

alegonz commented Dec 9, 2019

Description

check_is_fitted has false positives on custom subclasses with private attributes.

I believe check_is_fitted should not look at variables with a leading underscore because 1) that is Python's convention for private attributes, and 2) the scikit-learn API specifies fitted attributes as those with a trailing underscore (and there is no specification regarding leading underscores, correct?).

Backtracking from PR #14545 where the new check logic was introduced, I noticed that the check for leading underscore was added to cover these two modules:

  • sklearn/neighbors/base.py
  • sklearn/neighbors/lof.py

But perhaps these modules are actually not following the API specification?

Steps/Code to Reproduce

class MyPCA(PCA):
    def __init__(self, ...):  # omitted arguments for brevity
        super().__init__(...)
        self._my_private_attr = 42

mypca = MyPCA()
check_is_fitted(mypca)  # does not raise NotFittedError

Expected Results

check_is_fitted raises NotFittedError even on custom subclasses that have private attributes following the Python convention of a leading underscore.

Actual Results

NotFittedError is not raised.

Versions

>>> import sklearn; sklearn.show_versions()

System:
    python: 3.7.3 (default, Aug  8 2019, 19:40:58)  [GCC 5.4.0 20160609]
executable: /media/ale/data/education+research/code/baikal/venv/bin/python
   machine: Linux-4.4.0-170-generic-x86_64-with-debian-stretch-sid

Python dependencies:
       pip: 19.3.1
setuptools: 40.8.0
   sklearn: 0.22
     numpy: 1.17.4
     scipy: 1.3.3
    Cython: None
    pandas: None
matplotlib: None
    joblib: 0.14.0

Built with OpenMP: True
@jnothman
Copy link
Member

jnothman commented Dec 9, 2019

Ping @amueller

@alegonz alegonz changed the title check_is_fitted has false positives on custom classes with private attributes check_is_fitted has false positives on custom subclasses with private attributes Dec 10, 2019
@qinhanmin2014
Copy link
Member

And this behaviour is even inconsistent with the doc of check_is_fitted
The doc

Checks if the estimator is fitted by verifying the presence of
fitted attributes (ending with a trailing underscore) and otherwise
raises a NotFittedError with the given message.

The code

attrs = [v for v in vars(estimator)
         if (v.endswith("_") or v.startswith("_"))
         and not v.startswith("__")]

(And why do we include __?)

@qinhanmin2014 qinhanmin2014 added this to the 0.22.1 milestone Dec 11, 2019
@alegonz
Copy link
Author

alegonz commented Dec 11, 2019

(And why do we include __?)

I think it is just to match exactly one leading underscore.

I now realize that a similar thing should be done for the trailing underscore check:

`v.endswith("_") and not v.endswith("__")`

Though I'd favor a regex instead.

@alegonz
Copy link
Author

alegonz commented Dec 11, 2019

If you need someone for a fix PR, please let me know. I'd be glad to help.

@qinhanmin2014
Copy link
Member

(1) We can't remove v.startswith("_") because some estimators do not create any public attributes in fit (e.g., TfidfTransformer).
(2) I'm not sure whether your example is reasonable, because we usually don't create a private attribute in init.
(3) We do not introduce a deprecation cycle because sklearn.utils is not guaranteed to be stable (see https://scikit-learn.org/stable/developers/utilities.html). Perhaps it's not good, since check_is_fitted is public.
(4) At least we should update the doc to mention attributes starting with a underscore
ping the reviewers in #14545 @NicolasHug @thomasjpfan @jnothman @glemaitre

@NicolasHug
Copy link
Member

NicolasHug commented Dec 12, 2019

there was a deprecation cycle, though it lasted 1 version instead of 2. We just removed the parameters in #15803

I think the current check_is_fitted is correct, as long as the estimator follows our convention to not set anything in init except for self.param = param (where param should have trailing or leading underscore).

Since @alegonz 's estimator is not following that convention, I don't think this is a regression nor a blocker.

@jnothman
Copy link
Member

jnothman commented Dec 12, 2019 via email

@jnothman
Copy link
Member

jnothman commented Dec 12, 2019 via email

@alegonz
Copy link
Author

alegonz commented Dec 13, 2019

Sorry for the late reply. Japan time here.

(2) I'm not sure whether your example is reasonable, because we usually don't create a private attribute in init.
(3) We do not introduce a deprecation cycle because sklearn.utils is not guaranteed to be stable (see https://scikit-learn.org/stable/developers/utilities.html). Perhaps it's not good, since check_is_fitted is public.

I realize my original description was misleading. Let me clarify.

I was not using check_is_fitted directly --that example was just to point out the root cause. I had code along the lines of:

class MyPCA(PCA):
    def __init__(self, ...):
        ...
        self._my_private_attr = 42
...
# some more stuff
...
try:
    z = mypca.transform(X)
except NotFittedError:
    # it was supposed to be fitted but for some
    # reason it is not, so do something here about it

So I expected it to raise NotFittedError but it instead raised AttributeError because check_is_fitted did not raise the error prior to using the mean_ attribute.

The above just happens to set the private attribute in __init__, but of course this could have been done in any other method.

(4) At least we should update the doc to mention attributes starting with a underscore

I believe such changes to the API specification should not be taken lightly as it is not a matter of just updating the docs.

I think the current check_is_fitted is correct, as long as the estimator follows our convention to not set anything in init except for self.param = param (where param should have trailing or leading underscore).

Not allowing private attributes in classes sounds like bad API design. Client code should be able to define private attributes that hold extra data for its application. scikit-learn itself would most likely have problems in the future with such policy.

Take TfidfVectorizer, for example. It sets a non-param private attribute in its __init__, and that is a reasonable thing to do.

Anyway, in my understanding, scikit-learn had no issues with private attributes until this change. Indeed, the API only requires that all your params are keyword arguments and are assigned to public attributes in __init__, so classes play well with GridSearchCV and friends. It certainly does not forbid other kind of attributes except those with a trailing underscore which are reserved for parameters set during fitting.

So in my opinion, as far as check_is_fitted is concerned, it should only care about attributes with a single trailing underscore. Existing classes that had fitted params with a leading underscore should be fixed, and not the other way around.

But most importantly, aren't fitted attributes using the trailing underscore precisely to allow and not collide with private attributes in the first place?

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Dec 13, 2019

Take TfidfVectorizer, for example. It sets a non-param private attribute in its init, and that is a reasonable thing to do.

Thanks a lot! Yes, this is a serious problem, i.e., check_is_fitted won't work for TfidfVectorizer.

from sklearn.utils.validation import check_is_fitted
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
check_is_fitted(vectorizer)
# pass

@qinhanmin2014
Copy link
Member

And I think we should introduce a deprecation cycle if we want to change public functions in sklearn.utils

@NicolasHug
Copy link
Member

Again, we've always done deprecation cycles in utils. cf the dozen of PRs related to #6616 (comment)

@thomasjpfan
Copy link
Member

Our convention with underscores is fairly implicit. I would prefer a more explicit list of attributes that define all the attributes that will be learnt during training, thus allowing private attributes to be defined in init.

class MyPCA:
    fitted_attributes = ['n_components_']

    def __init__(self, ...):
       self._private_attr = 42

    def fit(self,...):
        self.n_components_ = 4
	    self._learned_attr = 10

check_is_fitted will only check for fitted_attributes. (Yes this kind of goes back to the former version of check_is_fitted)

I have seen this issue come up in skorch as well, i.e. this change in check_is_fitted is kind of breaking third party estimators.

@alegonz
Copy link
Author

alegonz commented Dec 13, 2019

@thomasjpfan
Indeed, it seems that the surest way is that each class is aware of its own fitted attributes with a special class attribute.

I was thinking something similar to what you describe, but instead of check_is_fitted, I was imagining a fitted abstract property that would be implemented appropriately for each case. Though I'm guessing such approach was already considered and rejected in the past.

@qinhanmin2014
Copy link
Member

I prefer to revert before we come up with a better solution.

@jnothman
Copy link
Member

How about we make it continue to be automagic, except where attributes is specified. I.e. we stop deprecating attributes

@jnothman
Copy link
Member

Now we also ideally need 0.22.1 to be backwards compatible, so a complete reversion isn't great.

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Dec 16, 2019

Now we also ideally need 0.22.1 to be backwards compatible, so a complete reversion isn't great.

Do you have a better idea? I don't think it's reasonable to keep the "automagic" option because it's not correct even within the scopt of scikit-learn.

from sklearn.utils.validation import check_is_fitted
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
check_is_fitted(vectorizer)
# pass

@jnothman
Copy link
Member

Just to be clear... is check_is_fitted used within the estimator or by meta-estimators? it was clearer in the 0.21 interface that it was for use within the estimator, not by externals like meta-estimators. If that remains the case (and should be more clearly documented), then the fact that some estimators must call check_is_fitted with attributes passed is no problem.

If check_is_fitted is intended to be used by meta-estimators, then I agree, it's not working well enough. In the future, checking the presence of n_features_in_ should be sufficient.

@amueller
Copy link
Member

TfidfVectorizer also doesn't adhere to sklearn conventions and doesn't pass the estimator tests (like, at all).

I think the solution put forward by @jnothman is good. We don't want to break code that people wrote for 0.22, so we can't really remove it. We could deprecate it but I'm not sure that's better than leaving it there.

@amueller
Copy link
Member

One of the reasons why I changed check_is_fitted is because I wanted to be able to use it outside the estimator itself, say in a meta-estimator. That's actually a common use-case and signalling whether something has been fitted is definitely useful. I'm not sure what the behavior for stateless estimators should be, though, and probably didn't put enough thought into that.

@amueller
Copy link
Member

@alegonz we haven't really considered that. We could also just add a boolean fitted_ attribute. But I think we didn't want to extend the API if we can do without it.

Our current mode of not declaring attributes in __init__ is certainly a bit odd for the rest of the world, though. So if we wanted to move towards declaring all attributes, we would need to add another mechanism.

@alegonz
Copy link
Author

alegonz commented Dec 22, 2019

@jnothman @amueller @thomasjpfan
Sorry for the late reply, I've been pondering about this for the past few days. What follows might be a long comment, but please bear with me.


I do not think @jnothman's solution implemented in PR #15947 is the right path to follow.

That or v.startswith("_") needs to be gone. It is a harmful behavior. I can't stress this enough. Let me elaborate.

I believe there is a conflation of the API specification and its implementation.

As I mentioned before, the scikit-learn specification (i.e. the guidelines as established here) has not and does not set restrictions regarding the use of attributes with leading underscores (whether they are set in __init__ or any other method). Indeed, it does not need this restriction for it to be able to function, but more importantly, it should not impose it for the simple reason that leading underscores are an essential convention for private attributes in Python.

The scikit-learn implementation (i.e. everything on this repository), on the other hand, might have an implicit convention of avoiding attributes starting with an underscore. And that is fine and should be preserved; I believe there are sensible historical reasons for that.

But that convention should only be limited to this implementation, and it not should not be extended nor enforced upon custom third-party classes (including customized inheritors of scikit-learn classes) and certainly should not be enforced indirectly via an utility function such as check_is_fitted. The fact that check_is_fitted is a function hard-coded inside predict / transform and not an abstract method further compounds the problem since you cannot override it with any other logic (unless you do some ill-advised monkey-patching).

That or v.startswith("_") is effectively preventing client code of using private attributes from fear of breaking other behavior (e.g. not raising NotFittedError when it should).

What I think should be done instead

Short term:

  1. Remove or v.startswith("_") ASAP. I don't think anyone is relying on this because 1) it is a harmful behavior, and 2) people should be following the trailing underscore specification anyway. (and 3) it is unlikely people is relying on this after only 2~3 weeks since the release, but that's besides the point.)

  2. For classes that do not set an attribute with a trailing underscore in their fit method, add a self.fitted_ = True so it can be caught by check_is_fitted.

Also, any (meta-)estimator that doesn't adhere to the scikit-learn specification (e.g. TfidfVectorizer) should be fixed appropriately.

Medium-long term:

  1. Deprecate check_is_fitted entirely, or at least remove it from the public API.

    I don't think check_is_fitted should be automated, as it is unwieldy and prone to bugs (such as this one) to try to cover all cases with a single logic. Anyway, the fact that the previous implementation of check_is_fitted was taking the attributes explicitly tells me that the trailing underscore specification is more about readability rather than the automation of these checks.

  2. Instead, BaseEstimator should have a fitted abstract property that returns a boolean (or perhaps a method that raises and error, if you favor EAFP) that would be implemented appropriately for each class.

    Meta-estimators should also implement this property (they are sub-classes of BaseEstimator after all). Conceptually, it is natural to expect that anything that implements a fit method also has a way to inform whether it has been fitted or not. For example, Pipeline could do this:

    @property
    def fitted(self):
        return all(step.fitted for _, step in self.steps)

    and stateless estimators could just do:

    @property
    def fitted(self):
        # since you can readily call predict/transform on them
        return True

    (It is odd that such estimators have a fit method in the first place, but I gather it is for Pipeline compatibility).


Thanks to everyone for following up on this issue and I look forward to your thoughts.

@jnothman
Copy link
Member

jnothman commented Dec 22, 2019 via email

@jnothman
Copy link
Member

jnothman commented Dec 22, 2019 via email

@glemaitre
Copy link
Member

@alegonz Thanks for the thoughtful reply.

Regarding the short-term solution, I think that it makes sense.
Then for the long-term solution, up-to-now I think that scikit-learn avoided using the abstract mechanisms on the BaseEstimator. I would think that it was done on purpose, probably @amueller , @jnothman, @GaelVaroquaux, @ogrisel could give some thoughts. Instead, the check_estimator can be used to ensure that an estimator raises the NotFittedError when it is supposed to do so.

@amueller
Copy link
Member

amueller commented Dec 23, 2019

The point of the change in check_is_fitted was to allow downstream libraries to check if an estimator was fitted without having to create data, as it might be hard to create the right kind of data to do a try: est.fit() to see if the estimator has been fitted.

I think generally having an attribute or a method is fine. I didn't go this way because we generally don't like extending the API. It would be much more explicit than a badly documented convention though.

I think one of the things we're coming up against here is again the distinction of third-party use and within-sklearn-use, both of the estimator checks and the utils. We really need to think more about this.

It's noteworthy that right now, check_estimator does not use check_is_fitted anywhere, so check_is_fitted should not be considered party of the API specification. So it does not enforce anything on third-party classes. I think your point is that it does enforce some restrictions on third-party estimators that inherit from existing estimators.

Ot might have been better to make the version that has no parameters private from the start as it was geared towards internal use.
If we want a consistent way to check whether something was fitted for anything that's scikit-learn compatible, indeed check_is_fitted is not the right choice and an attribute or method would be better.

I'm happy with the short-term solution. The question for the medium-term solution is what the purpose of check_is_fitted is.
There is two implicit purposes: the original purpose was to make it easy to raise the right exception if an estimator calls predict without being fitted.
I abused / extended that for calling it outside the estimator itself, to check the state of an estimator (which was not possible before).

Both of these are useful, but distinct functions, for third party libraries.
If we want the ability to check the state from the outside, a method is probably a good way to go. If we then also want to make it easy to raise the right exception,
we either need to provide a _check_is_fitted private method, or need to bake it into _validate_data (https://github.com/scikit-learn/enhancement_proposals/pull/22/files), both of which are more "frameworky" than what we usually like.

@ogrisel
Copy link
Member

ogrisel commented Dec 23, 2019

I abused / extended that for calling it outside the estimator itself, to check the state of an estimator (which was not possible before).

Do you have a more precise description of specific use cases for this?

Maybe a combo of a generic public check_is_fitted utility whose behavior could be overridden with an optional _check_is_fitted method would work? That would make it less frameworky while still be explicit and 3rd party customizable.

Also maybe the name check_is_fitted is not always very appropriate. For instance for stateless feature engineering transformers, we don't really care about check_is_fitted but would rather have a check_is_ready_to_predict/transform. But maybe the name check_is_fitted is good enough and we could check for the presence of the "stateless" estimator tag and return True when it's value is True.

@amueller
Copy link
Member

Do you have a more precise description of specific use cases for this?

I needed it for one of the drafts of the feature_names implementation. But I can't find it right now, so I'm not sure it actually had a good reason.
I know that yellowbrick has a usecase in that they want to call fit unless an estimator has been fit before.

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2019

I have updated #15947 to remove the check for the leading _ in check_is_fitted. I also updated the TF-IDF classes to use it properly. I think this is ready for merge for 0.22.1.

@alegonz
Copy link
Author

alegonz commented Dec 24, 2019

Thanks everyone for your answers :)


@jnothman

We tend to proscribe setting anything in init aside from parameter copies because we don't want estimators to validate or set derived attributes of these parameters. Doing so can break set_params.

Yes, I am aware of that. My point here is that, while native scikit-learn classes might follow this convention, for 3rd party classes it should be left at its discretion.


@amueller

I think one of the things we're coming up against here is again the distinction of third-party use and within-sklearn-use, both of the estimator checks and the utils. We really need to think more about this.
... I think your point is that it does enforce some restrictions on third-party estimators that inherit from existing estimators.

Yes, that is my main point.

There is two implicit purposes: the original purpose was to make it easy to raise the right exception if an estimator calls predict without being fitted.
...
If we want the ability to check the state from the outside, a method is probably a good way to go. If we then also want to make it easy to raise the right exception, we either need to provide a _check_is_fitted private method, or need to bake it into _validate_data, both of which are more "frameworky" than what we usually like.

Why is it necessary to bake the error-raising logic into a function/private method? What are the downsides of doing... ?

class PCA(_BasePCA):
    def transform(self, X):
        if not self.fitted:
            raise NotFittedError
        # ...

    @property
    def fitted(self):
        return all(hasattr(self, attr) for attr in ("mean_", "components_"))

And, just of out of curiosity, what exactly do you mean by "frameworky"?


@ogrisel
Thank you for following up on the PR!

@ogrisel
Copy link
Member

ogrisel commented Dec 24, 2019

My point here is that, while native scikit-learn classes might follow this convention, for 3rd party classes it should be left at its discretion.

As @jnothman said earlier, I think it's fine to set private members in init as long as they are not dependent on constructor parameters.

It is actually very important that all parameters can be set/get with set_params/get_params so that the model selection tools (e.g. grid / random search) can work properly, including when the estimators are composed in pipelines, ensembles and other meta-estimators.

If 3rd party estimators do not follow scikit-learn conventions, they will break when composed with other scikit-learn tools and they should not be advertised as scikit-learn compatible (because they would not be).

And, just of out of curiosity, what exactly do you mean by "frameworky"?

Imposing estimators to derive from a rich hierarchy of base classes or implement complex API contracts (interfaces with many methods to implements). E.g. if work with web developments framework such as django (or even Zope) you have to know master the django concepts before being able to write your custom components and those components feels like django things rather than simple Python classes or functions. Historically we tried to keep the "frameworkness" of scikit-learn to a minimum but I agree that we had to progressively add framekorkish aspects to scikit-learn to be able to:

  • ensure consistent behavior via automated testing (e.g. check_estimators);
  • make it possible to compose models (pipelines, column-transformers);
  • make it possible to use generic model selection and inspection tools on composed models.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 24, 2019 via email

@alegonz
Copy link
Author

alegonz commented Dec 25, 2019

@ogrisel @GaelVaroquaux @jnothman
I had misunderstood what was said here:

We tend to proscribe setting anything in init aside from parameter copies because we don't want estimators to validate or set derived attributes of these parameters.

Upon re-reading, yes, I understand and agree with that. I was thinking about private attributes for things that are unrelated to parameters. Sorry about that.

@ogrisel
Thanks for the clarification on "frameworky". I think I get what you mean.

@GaelVaroquaux

We try to avoid properties. They make codebases harder to follow.

While I don't agree in general with that statement, in this case @property is not essential; that was more of a personal choice. It could also be a plain is_fitted() method. The important thing in my opinion is that it is implemented appropriately for each class (as opposed to a single function attempting to catch all cases). @thomasjpfan 's suggestion above also achieves the same thing, but I think a method gives also the freedom to override it.

@alegonz
Copy link
Author

alegonz commented Dec 27, 2019

Thank you everyone! 🎉

@amueller
Copy link
Member

@alegonz

Why is it necessary to bake the error-raising logic into a function/private method? What are the downsides of doing... ?

It adds hundreds of lines of code to scikit-learn to add an if to all the transforms. So while it's not strictly necessary, it would be odd not to factor that into a method that's called anyway.

@alegonz
Copy link
Author

alegonz commented Jan 3, 2020

@amueller

It adds hundreds of lines of code to scikit-learn to add an if to all the transforms. So while it's not strictly necessary, it would be odd not to factor that into a method that's called anyway.

I see.

In that case, I think it would be better to have both: a fitted public method/property that returns a boolean, and a _check_is_fitted private method/function that uses fitted to check and raise the error. I'd argue that raising an error should be part of the API of predict/transform and not of the fitted API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants