-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
CLN Only check for n_features_in_ only when it exists #18011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN Only check for n_features_in_ only when it exists #18011
Conversation
Sorry I don't understand the relation with the code snippet you provided. I don't see where |
Currently the The alternative is to set |
(I've edited the snippet so that the call to This is a bit of a tricky case, but I'm not super comfortable with having less strict checks just so we can support cases where Are there other occurrences of such technique? |
Coming back to this, this can happen for stateless estimators like the |
Hm, since |
I think it should be fine to have people use |
I would argue that Regardless, I'm not super happy with the current change because it loosens the check in the general case. |
I just discovered this discussion a posteriori. As I just said in #18577 (review) I think we should really be tolerant and allow users to call transform without calling fit on transformers with the "stateless" estimator tag. For other estimators, I am a bit less certain. We do predict / transform without fit only in tests (as done for logistic regression in #18578 and PowerTransformer in #18577). But third party devs might have other valid use cases. So I in conclusion I think we should not check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch coverage issue is expected. coverage of those lines will increase in the PRs linked in the above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would be happier if we only did that for the stateless estimators.
For the rest, it should be reasonable to let developers either set n_features_in_
manually, or leave reset=True
to its default.
(Otherwise, the docstring of _check_n_features
needs an update.)
sklearn/base.py
Outdated
fitted_n_features_in = getattr(self, 'n_features_in_', None) | ||
if fitted_n_features_in is None: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that hasattr
would be cleaner, especially since self.n_features_in_
is used later and fitted_n_features_in
isn't
I wonder if this kind of code: will not show up in third-party libraries and we will break them when upgrading for 0.24 for not good enough reasons. |
I think the combo of the loose "check of n_feaures_in_ if present" + the We can always make the input validation code stricter in the future (e.g. 0.25) once |
It makes me a bit uncomfortable that
IMHO, building an estimator from scratch without calling fit and without using any data is a grey area, akin to using the private API. But OK, I don't have much more to add if you still think we should merge. @thomasjpfan can you please update the docstring? LGTM then |
I am still +1 to merge as is (with the docstring update) to get this in with as few friction as possible, continue the work on upgrading all the modules to have them pass the common test and re-explore the possibility of this validation stricter the future. (also in light of the experience with stored feature names checks). @thomasjpfan alright with you? |
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Updated docstring at |
sklearn/base.py
Outdated
Else, the attribute must already exist and the function checks | ||
that it is equal to `X.shape[1]`. | ||
If False and the attribute exists, then check that it is equal to | ||
`X.shape[1]`. If False and the attribute does *not* exists, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`X.shape[1]`. If False and the attribute does *not* exists, then | |
`X.shape[1]`. If False and the attribute does *not* exist, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last bit
Merged! |
…8011) * CLN Checks n_features_in only if it exists * Update sklearn/base.py Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> * DOC Update docstring * DOC Grammer * Grammar [ci skip] Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
…8011) * CLN Checks n_features_in only if it exists * Update sklearn/base.py Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> * DOC Update docstring * DOC Grammer * Grammar [ci skip] Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Reference Issues/PRs
Related to #18010
What does this implement/fix? Explain your changes.
We have code blocks like this:
scikit-learn/sklearn/linear_model/_logistic.py
Lines 974 to 1006 in f642ff7
that would fail this check,
_check_n_features
was being run forreset=False
.Any other comments?
I am +0.5 on doing this. I have been trying to get
_validate_data
to be inpredict
,transform
and friends which would make it easier to continue with column name consistency #18010CC @NicolasHug