-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
check_is_fitted has false positives on custom subclasses with private attributes #15845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ping @amueller |
And this behaviour is even inconsistent with the doc of check_is_fitted
The code
(And why do we include |
I think it is just to match exactly one leading underscore. I now realize that a similar thing should be done for the trailing underscore check:
Though I'd favor a regex instead. |
If you need someone for a fix PR, please let me know. I'd be glad to help. |
(1) We can't remove |
there was a deprecation cycle, though it lasted 1 version instead of 2. We just removed the parameters in #15803 I think the current Since @alegonz 's estimator is not following that convention, I don't think this is a regression nor a blocker. |
We can't remove v.startswith("_") because some estimators do not create
any public attributes in fit (e.g., TfidfTransformer).
After slep010 is implemented they will.
|
I can imagine reasons to set private attributes in `__init__`, and as long
as they are not derived from parameters they should technically be okay
with our APIs.
|
Sorry for the late reply. Japan time here.
I realize my original description was misleading. Let me clarify. I was not using
So I expected it to raise The above just happens to set the private attribute in
I believe such changes to the API specification should not be taken lightly as it is not a matter of just updating the docs.
Not allowing private attributes in classes sounds like bad API design. Client code should be able to define private attributes that hold extra data for its application. scikit-learn itself would most likely have problems in the future with such policy. Take TfidfVectorizer, for example. It sets a non-param private attribute in its Anyway, in my understanding, scikit-learn had no issues with private attributes until this change. Indeed, the API only requires that all your params are keyword arguments and are assigned to public attributes in So in my opinion, as far as But most importantly, aren't fitted attributes using the trailing underscore precisely to allow and not collide with private attributes in the first place? |
Thanks a lot! Yes, this is a serious problem, i.e., check_is_fitted won't work for TfidfVectorizer.
|
And I think we should introduce a deprecation cycle if we want to change public functions in sklearn.utils |
Again, we've always done deprecation cycles in utils. cf the dozen of PRs related to #6616 (comment) |
Our convention with underscores is fairly implicit. I would prefer a more explicit list of attributes that define all the attributes that will be learnt during training, thus allowing private attributes to be defined in init. class MyPCA:
fitted_attributes = ['n_components_']
def __init__(self, ...):
self._private_attr = 42
def fit(self,...):
self.n_components_ = 4
self._learned_attr = 10
I have seen this issue come up in |
@thomasjpfan I was thinking something similar to what you describe, but instead of |
I prefer to revert before we come up with a better solution. |
How about we make it continue to be automagic, except where |
Now we also ideally need 0.22.1 to be backwards compatible, so a complete reversion isn't great. |
Do you have a better idea? I don't think it's reasonable to keep the "automagic" option because it's not correct even within the scopt of scikit-learn.
|
Just to be clear... is check_is_fitted used within the estimator or by meta-estimators? it was clearer in the 0.21 interface that it was for use within the estimator, not by externals like meta-estimators. If that remains the case (and should be more clearly documented), then the fact that some estimators must call If |
I think the solution put forward by @jnothman is good. We don't want to break code that people wrote for 0.22, so we can't really remove it. We could deprecate it but I'm not sure that's better than leaving it there. |
One of the reasons why I changed |
@alegonz we haven't really considered that. We could also just add a boolean Our current mode of not declaring attributes in |
@jnothman @amueller @thomasjpfan I do not think @jnothman's solution implemented in PR #15947 is the right path to follow. That I believe there is a conflation of the API specification and its implementation. As I mentioned before, the scikit-learn specification (i.e. the guidelines as established here) has not and does not set restrictions regarding the use of attributes with leading underscores (whether they are set in The scikit-learn implementation (i.e. everything on this repository), on the other hand, might have an implicit convention of avoiding attributes starting with an underscore. And that is fine and should be preserved; I believe there are sensible historical reasons for that. But that convention should only be limited to this implementation, and it not should not be extended nor enforced upon custom third-party classes (including customized inheritors of scikit-learn classes) and certainly should not be enforced indirectly via an utility function such as That What I think should be done insteadShort term:
Also, any (meta-)estimator that doesn't adhere to the scikit-learn specification (e.g. Medium-long term:
Thanks to everyone for following up on this issue and I look forward to your thoughts. |
Fwiw I am okay with removing the attr.startswith("_") condition. For the
sake of the future I think removing it is the better option. But because
the intention is for the estimator to use the function internally, it
doesn't matter too much.
|
But in general I think your suggestions are good.
We tend to proscribe setting anything in init aside from parameter copies
because we don't want estimators to validate or set derived attributes of
these parameters. Doing so can break set_params.
|
@alegonz Thanks for the thoughtful reply. Regarding the short-term solution, I think that it makes sense. |
The point of the change in I think generally having an attribute or a method is fine. I didn't go this way because we generally don't like extending the API. It would be much more explicit than a badly documented convention though. I think one of the things we're coming up against here is again the distinction of third-party use and within-sklearn-use, both of the estimator checks and the utils. We really need to think more about this. It's noteworthy that right now, Ot might have been better to make the version that has no parameters private from the start as it was geared towards internal use. I'm happy with the short-term solution. The question for the medium-term solution is what the purpose of Both of these are useful, but distinct functions, for third party libraries. |
Do you have a more precise description of specific use cases for this? Maybe a combo of a generic public Also maybe the name |
I needed it for one of the drafts of the feature_names implementation. But I can't find it right now, so I'm not sure it actually had a good reason. |
I have updated #15947 to remove the check for the leading |
Thanks everyone for your answers :)
Yes, I am aware of that. My point here is that, while native scikit-learn classes might follow this convention, for 3rd party classes it should be left at its discretion.
Yes, that is my main point.
Why is it necessary to bake the error-raising logic into a function/private method? What are the downsides of doing... ? class PCA(_BasePCA):
def transform(self, X):
if not self.fitted:
raise NotFittedError
# ...
@property
def fitted(self):
return all(hasattr(self, attr) for attr in ("mean_", "components_")) And, just of out of curiosity, what exactly do you mean by "frameworky"? @ogrisel |
As @jnothman said earlier, I think it's fine to set private members in init as long as they are not dependent on constructor parameters. It is actually very important that all parameters can be set/get with set_params/get_params so that the model selection tools (e.g. grid / random search) can work properly, including when the estimators are composed in pipelines, ensembles and other meta-estimators. If 3rd party estimators do not follow scikit-learn conventions, they will break when composed with other scikit-learn tools and they should not be advertised as scikit-learn compatible (because they would not be).
Imposing estimators to derive from a rich hierarchy of base classes or implement complex API contracts (interfaces with many methods to implements). E.g. if work with web developments framework such as django (or even Zope) you have to know master the django concepts before being able to write your custom components and those components feels like django things rather than simple Python classes or functions. Historically we tried to keep the "frameworkness" of scikit-learn to a minimum but I agree that we had to progressively add framekorkish aspects to scikit-learn to be able to:
|
We tend to proscribe setting anything in init aside from parameter copies
because we don't want estimators to validate or set derived attributes of
these parameters. Doing so can break set_params.
Yes, I am aware of that. My point here is that, while native scikit-learn
classes might follow this convention, for 3rd party classes it should be left
at its discretion.
Well, strictly speaking they are violating our assumptions and are not
scikit-learn compatible. The rule that notion happens in the init is
there to make sure that a static representation of a non-fitted estimator
is easy to build and has a clear contract. Not following it might lead to
things breaking. We impose this rule to make sure that our utilities,
such as our model-selection framework work well.
@Property
def fitted(self):
return all(hasattr(self, attr) for attr in ("mean_", "components_"))
We try to avoid properties. They make codebases harder to follow.
|
@ogrisel @GaelVaroquaux @jnothman
Upon re-reading, yes, I understand and agree with that. I was thinking about private attributes for things that are unrelated to parameters. Sorry about that. @ogrisel
While I don't agree in general with that statement, in this case |
Thank you everyone! 🎉 |
It adds hundreds of lines of code to scikit-learn to add an |
I see. In that case, I think it would be better to have both: a |
Description
check_is_fitted
has false positives on custom subclasses with private attributes.I believe
check_is_fitted
should not look at variables with a leading underscore because 1) that is Python's convention for private attributes, and 2) the scikit-learn API specifies fitted attributes as those with a trailing underscore (and there is no specification regarding leading underscores, correct?).Backtracking from PR #14545 where the new check logic was introduced, I noticed that the check for leading underscore was added to cover these two modules:
sklearn/neighbors/base.py
sklearn/neighbors/lof.py
But perhaps these modules are actually not following the API specification?
Steps/Code to Reproduce
Expected Results
check_is_fitted
raisesNotFittedError
even on custom subclasses that have private attributes following the Python convention of a leading underscore.Actual Results
NotFittedError
is not raised.Versions
The text was updated successfully, but these errors were encountered: