-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add n_features_in_ attribute to BaseEstimator #13603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is heading in the right direction. We would put pandas column verification in validate_X after storing a signature in validate_X_y?
Do we want them to also validate other input such as sample_weights? |
Yes we could easily integrate #11607
If it's easy, why not. For now I would suggest to start slow though |
ping @GaelVaroquaux @ogrisel @jorisvandenbossche @qinhanmin2014 would you be OK with this? We need it to properly implement #13307 |
would you be OK with this?
From a purely conceptual standpoint, it feels that it's one level too
high up: not every estimator has vectorial input (eg vectorizers do not),
hence it does not seem to me that it applies to all estimators.
We could consider an additional class below BaseEstimator.
|
@GaelVaroquaux so what you're saying is that the Usually we often do mixins, and we could this with a I think having a |
I'm happy to see a VectorizerMixin class that checks its input is iterable
and not a string and sets `n_features_in_ = None` :)
|
I created a
@GaelVaroquaux @jnothman please let me know what you think, and if it's OK I'll proceed to changing the rest of the estimators to use |
I don't want to worry about this before release. Ping after?
|
ping @jnothman now that 0.21 has been released :) |
Yup. Prioritising the PRs that you and Thomas have pulled together, apart from everyone else's, looks like a challenge! |
sklearn/base.py
Outdated
self._validate_n_features(X, check_n_features) | ||
return X | ||
|
||
def validate_X_y(self, X, y, check_n_features=False, **check_X_y_params): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to have this, certainly not public, on unsupervised estimators.
@jnothman If you have thoughts about how to structure PRs and reviewing to bridge the full-time dev vs community gap, I'd be happy to chat. @NicolasHug and @thomasjpfan are trying to focus more on reviewing to ease the load a bit. |
I've updated the PR with support for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good.
Please add a common test and then add this everywhere, I guess?
@jnothman , @GaelVaroquaux , I think I addressed your previous comments. Meta estimator delegate the Could you please take a look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good now, thanks @NicolasHug
@@ -272,6 +272,8 @@ def _yield_all_checks(name, estimator): | |||
yield check_dict_unchanged | |||
yield check_dont_overwrite_parameters | |||
yield check_fit_idempotent | |||
if not tags["no_validation"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to make sure the n_features_in_
is None
or doesn't exist
if the no_validation
tag is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say no, since you could have a meta estimator that delegates the validation to its estimators, and still have the attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
.format(self.__class__.__name__) | ||
) from nfe | ||
|
||
return self.best_estimator_.n_features_in_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could set n_features_in_
in fit
, and not have the property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I don't mind too much. I think for the searches we use properties a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least in its current state, classes_
is the only other public property.
Even best_score_
seems not to be a property. It's just that setting it in
fit is less code and kinda cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to all the if_delegate_has_method
methods but yeah these aren't properties.
OK I'll set it in fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it again I'd rather not: the best_estimator_ might just not have the attribute and that would raise an error.
I don't want to be bureaucratic, but does this require a SLEP according to the agreed governance doc???? I really want to see the Pandas reordering issues fixed and I think this is a step towards that. It's also a step towards feature names and more introspection capability for ColumnTransformer/FeatureUnion. I'd love to see all of those things progressed. But this does, by definition, modify the API principles, altering the public API of and placing a constraint on literally every estimator lacking a 'no_validation' tag...?
|
In my mind I though if there's no objection, we can always get them in. But the governance doc does indeed require this to have a SLEP. On the other hand, if the estimators set the tag and don't implement these APIs, they'd still pass the estimator checks, so it's not mandatory for them to have it. Overall I think having a SLEP for |
I was sort of hoping that my warning during the last meeting would avoid this. I'll raise the issue again on Monday. |
Trying to solve the conflicts: what should |
It should be |
You mean creating a new nsamples attribute? |
No, the distances to the training samples *are* the features in a
transformed space.
|
Opened SLEP at scikit-learn/enhancement_proposals#22 |
will address merge conflicts when the SLEP is accepted |
The SLEP is now accepted \o/ |
The SLEP is now accepted \o/
Yey! Thank you to everybody involved.
|
yay! I think @NicolasHug is still on vacation but will follow up soon :) |
Given the amount of conflicts and the differences with the accepted solution, I'm closing this PR and will open another one soon |
This PR adds 2 methods
self.validate_X
andself.validate_X_y
that wrap aroundcheck_array
andcheck_X_y
, and additionally set an_features_in_
attribute atfit
time. An error is raised at prediction / transform time in case of a mismatch.Using this would require replacing most calls to
check_array
andcheck_X_y
.ping @amueller is this what you had in mind?
EDIT: more up to date summary: #13603 (comment)
Another one in #13603 (comment)