Skip to content

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 50 commits into from

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Apr 9, 2019

This PR adds 2 methods self.validate_X and self.validate_X_y that wrap around check_array and check_X_y, and additionally set a n_features_in_ attribute at fit time. An error is raised at prediction / transform time in case of a mismatch.

Using this would require replacing most calls to check_array and check_X_y.

ping @amueller is this what you had in mind?

EDIT: more up to date summary: #13603 (comment)

Another one in #13603 (comment)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is heading in the right direction. We would put pandas column verification in validate_X after storing a signature in validate_X_y?

@adrinjalali
Copy link
Member

Do we want them to also validate other input such as sample_weights?

@NicolasHug
Copy link
Member Author

We would put pandas column verification in validate_X after storing a signature in validate_X_y?

Yes we could easily integrate #11607

Do we want them to also validate other input such as sample_weights?

If it's easy, why not. For now I would suggest to start slow though

@NicolasHug
Copy link
Member Author

ping @GaelVaroquaux @ogrisel @jorisvandenbossche @qinhanmin2014

would you be OK with this?

We need it to properly implement #13307

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 10, 2019 via email

@amueller
Copy link
Member

amueller commented Apr 17, 2019

@GaelVaroquaux so what you're saying is that the validate_X methods (and also the other ones) shouldn't be on BaseEstimator? Or only the _validate_n_features method shouldn't be?

Usually we often do mixins, and we could this with a RectangularDataMixin but it seems simpler with a RectangularDataEstimator base class. That would require us to change a lot of code, and would require downstream packages to also change code if they want the functionality (instead of automatically getting it).

I think having a n_features_in_ = None on the Vectorizers would be useful, though, and so I don't entirely see what we we would get out of creating another class? We could just overwrite the method on the vectorizers.

@jnothman
Copy link
Member

jnothman commented Apr 18, 2019 via email

@NicolasHug
Copy link
Member Author

I created a NonRectangularInputMixin mixin. (Turns out VectorizerMixin already exists in text.py, but we cannot use it since it's not used by DictVectorizer).

n_features_in_ is always None for NonRectangularInputMixin, but I'm happy to make it a property and raise an AttributeError or something.

@GaelVaroquaux @jnothman please let me know what you think, and if it's OK I'll proceed to changing the rest of the estimators to use validate_X() and validate_X_y(). Thanks!

@jnothman
Copy link
Member

jnothman commented Apr 21, 2019 via email

@NicolasHug
Copy link
Member Author

ping @jnothman now that 0.21 has been released :)

@jnothman
Copy link
Member

ping @jnothman now that 0.21 has been released :)

Yup. Prioritising the PRs that you and Thomas have pulled together, apart from everyone else's, looks like a challenge!

sklearn/base.py Outdated
self._validate_n_features(X, check_n_features)
return X

def validate_X_y(self, X, y, check_n_features=False, **check_X_y_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to have this, certainly not public, on unsupervised estimators.

@amueller
Copy link
Member

@jnothman If you have thoughts about how to structure PRs and reviewing to bridge the full-time dev vs community gap, I'd be happy to chat. @NicolasHug and @thomasjpfan are trying to focus more on reviewing to ease the load a bit.

@NicolasHug
Copy link
Member Author

I've updated the PR with support for n_features_in_ in pipelines and grid search. Every meta estimator will need a custom way of dealing with this attribute.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.
Please add a common test and then add this everywhere, I guess?

@NicolasHug
Copy link
Member Author

@jnothman , @GaelVaroquaux , I think I addressed your previous comments.

Meta estimator delegate the n_features_in attribute, and for vectorizers it is always None.

Could you please take a look at test_n_features_in_attribute() and provide feedback? Thanks!

@adrinjalali adrinjalali self-assigned this Sep 17, 2019
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good now, thanks @NicolasHug

@@ -272,6 +272,8 @@ def _yield_all_checks(name, estimator):
yield check_dict_unchanged
yield check_dont_overwrite_parameters
yield check_fit_idempotent
if not tags["no_validation"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to make sure the n_features_in_ is None or doesn't exist
if the no_validation tag is set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say no, since you could have a meta estimator that delegates the validation to its estimators, and still have the attribute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

.format(self.__class__.__name__)
) from nfe

return self.best_estimator_.n_features_in_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could set n_features_in_ in fit, and not have the property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I don't mind too much. I think for the searches we use properties a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least in its current state, classes_ is the only other public property.

Even best_score_ seems not to be a property. It's just that setting it in
fit is less code and kinda cleaner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to all the if_delegate_has_method methods but yeah these aren't properties.
OK I'll set it in fit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again I'd rather not: the best_estimator_ might just not have the attribute and that would raise an error.

@jnothman
Copy link
Member

I don't want to be bureaucratic, but does this require a SLEP according to the agreed governance doc???? I really want to see the Pandas reordering issues fixed and I think this is a step towards that. It's also a step towards feature names and more introspection capability for ColumnTransformer/FeatureUnion. I'd love to see all of those things progressed.

But this does, by definition, modify the API principles, altering the public API of and placing a constraint on literally every estimator lacking a 'no_validation' tag...?

n_features_in_ does not appear in a docstring's Attributes sections yet, either... should it?

@adrinjalali
Copy link
Member

In my mind I though if there's no objection, we can always get them in. But the governance doc does indeed require this to have a SLEP.

On the other hand, if the estimators set the tag and don't implement these APIs, they'd still pass the estimator checks, so it's not mandatory for them to have it.

Overall I think having a SLEP for n_features_in_ and n_features_out_ doesn't seem like an unreasonable requirement for the two of them. But I don't like that we require the SLEPs for these two.

@NicolasHug
Copy link
Member Author

I was sort of hoping that my warning during the last meeting would avoid this. I'll raise the issue again on Monday.

@NicolasHug
Copy link
Member Author

Trying to solve the conflicts: what should n_features_in_ be when X is a precomputed distance matrix?

@amueller
Copy link
Member

It should be n_samples_. It's the same for kernels, right?

@NicolasHug
Copy link
Member Author

You mean creating a new nsamples attribute?

@jnothman
Copy link
Member

jnothman commented Sep 18, 2019 via email

@NicolasHug
Copy link
Member Author

Opened SLEP at scikit-learn/enhancement_proposals#22

@NicolasHug
Copy link
Member Author

will address merge conflicts when the SLEP is accepted

@NicolasHug NicolasHug added this to the 0.22 milestone Oct 24, 2019
@NicolasHug NicolasHug added the High Priority High priority issues and pull requests label Oct 24, 2019
@adrinjalali adrinjalali modified the milestones: 0.22, 0.23 Oct 31, 2019
@adrinjalali
Copy link
Member

The SLEP is now accepted \o/

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 6, 2020 via email

@amueller
Copy link
Member

amueller commented Jan 6, 2020

yay! I think @NicolasHug is still on vacation but will follow up soon :)

@NicolasHug
Copy link
Member Author

Given the amount of conflicts and the differences with the accepted solution, I'm closing this PR and will open another one soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants