[MRG] Add n_features_in_ attribute to BaseEstimator #13603

NicolasHug · 2019-04-09T18:18:14Z

This PR adds 2 methods self.validate_X and self.validate_X_y that wrap around check_array and check_X_y, and additionally set a n_features_in_ attribute at fit time. An error is raised at prediction / transform time in case of a mismatch.

Using this would require replacing most calls to check_array and check_X_y.

ping @amueller is this what you had in mind?

EDIT: more up to date summary: #13603 (comment)

Another one in #13603 (comment)

jnothman

I think this is heading in the right direction. We would put pandas column verification in validate_X after storing a signature in validate_X_y?

sklearn/base.py

adrinjalali · 2019-04-10T07:56:31Z

Do we want them to also validate other input such as sample_weights?

NicolasHug · 2019-04-10T10:55:39Z

We would put pandas column verification in validate_X after storing a signature in validate_X_y?

Yes we could easily integrate #11607

Do we want them to also validate other input such as sample_weights?

If it's easy, why not. For now I would suggest to start slow though

NicolasHug · 2019-04-10T19:40:50Z

ping @GaelVaroquaux @ogrisel @jorisvandenbossche @qinhanmin2014

would you be OK with this?

We need it to properly implement #13307

GaelVaroquaux · 2019-04-10T19:52:36Z

would you be OK with this?

From a purely conceptual standpoint, it feels that it's one level too high up: not every estimator has vectorial input (eg vectorizers do not), hence it does not seem to me that it applies to all estimators. We could consider an additional class below BaseEstimator.

amueller · 2019-04-17T21:23:51Z

@GaelVaroquaux so what you're saying is that the validate_X methods (and also the other ones) shouldn't be on BaseEstimator? Or only the _validate_n_features method shouldn't be?

Usually we often do mixins, and we could this with a RectangularDataMixin but it seems simpler with a RectangularDataEstimator base class. That would require us to change a lot of code, and would require downstream packages to also change code if they want the functionality (instead of automatically getting it).

I think having a n_features_in_ = None on the Vectorizers would be useful, though, and so I don't entirely see what we we would get out of creating another class? We could just overwrite the method on the vectorizers.

jnothman · 2019-04-18T04:47:24Z

I'm happy to see a VectorizerMixin class that checks its input is iterable and not a string and sets `n_features_in_ = None` :)

NicolasHug · 2019-04-19T19:45:56Z

I created a NonRectangularInputMixin mixin. (Turns out VectorizerMixin already exists in text.py, but we cannot use it since it's not used by DictVectorizer).

n_features_in_ is always None for NonRectangularInputMixin, but I'm happy to make it a property and raise an AttributeError or something.

@GaelVaroquaux @jnothman please let me know what you think, and if it's OK I'll proceed to changing the rest of the estimators to use validate_X() and validate_X_y(). Thanks!

jnothman · 2019-04-21T14:44:15Z

I don't want to worry about this before release. Ping after?

NicolasHug · 2019-05-20T15:03:40Z

ping @jnothman now that 0.21 has been released :)

jnothman · 2019-05-22T02:26:38Z

ping @jnothman now that 0.21 has been released :)

Yup. Prioritising the PRs that you and Thomas have pulled together, apart from everyone else's, looks like a challenge!

jnothman · 2019-05-22T02:30:33Z

sklearn/base.py

+        self._validate_n_features(X, check_n_features)
+        return X
+
+    def validate_X_y(self, X, y, check_n_features=False, **check_X_y_params):


I don't think it makes sense to have this, certainly not public, on unsupervised estimators.

sklearn/base.py

amueller · 2019-05-29T19:51:18Z

@jnothman If you have thoughts about how to structure PRs and reviewing to bridge the full-time dev vs community gap, I'd be happy to chat. @NicolasHug and @thomasjpfan are trying to focus more on reviewing to ease the load a bit.

…features_in

NicolasHug · 2019-05-31T15:38:46Z

I've updated the PR with support for n_features_in_ in pipelines and grid search. Every meta estimator will need a custom way of dealing with this attribute.

amueller

I think this looks good.
Please add a common test and then add this everywhere, I guess?

sklearn/base.py

…features_in

NicolasHug · 2019-06-26T18:15:18Z

@jnothman , @GaelVaroquaux , I think I addressed your previous comments.

Meta estimator delegate the n_features_in attribute, and for vectorizers it is always None.

Could you please take a look at test_n_features_in_attribute() and provide feedback? Thanks!

adrinjalali

Looks pretty good now, thanks @NicolasHug

adrinjalali · 2019-09-17T15:06:46Z

sklearn/utils/estimator_checks.py

@@ -272,6 +272,8 @@ def _yield_all_checks(name, estimator):
    yield check_dict_unchanged
    yield check_dont_overwrite_parameters
    yield check_fit_idempotent
+    if not tags["no_validation"]:


does it make sense to make sure the n_features_in_ is None or doesn't exist
if the no_validation tag is set?

I would say no, since you could have a meta estimator that delegates the validation to its estimators, and still have the attribute?

adrinjalali · 2019-09-17T15:07:30Z

sklearn/model_selection/_search.py

+                .format(self.__class__.__name__)
+            ) from nfe
+
+        return self.best_estimator_.n_features_in_


we could set n_features_in_ in fit, and not have the property.

Sure. I don't mind too much. I think for the searches we use properties a lot.

at least in its current state, classes_ is the only other public property.

Even best_score_ seems not to be a property. It's just that setting it in
fit is less code and kinda cleaner?

I was referring to all the if_delegate_has_method methods but yeah these aren't properties.
OK I'll set it in fit

Thinking about it again I'd rather not: the best_estimator_ might just not have the attribute and that would raise an error.

jnothman · 2019-09-18T11:39:06Z

I don't want to be bureaucratic, but does this require a SLEP according to the agreed governance doc???? I really want to see the Pandas reordering issues fixed and I think this is a step towards that. It's also a step towards feature names and more introspection capability for ColumnTransformer/FeatureUnion. I'd love to see all of those things progressed.

But this does, by definition, modify the API principles, altering the public API of and placing a constraint on literally every estimator lacking a 'no_validation' tag...?

n_features_in_ does not appear in a docstring's Attributes sections yet, either... should it?

adrinjalali · 2019-09-18T13:31:29Z

In my mind I though if there's no objection, we can always get them in. But the governance doc does indeed require this to have a SLEP.

On the other hand, if the estimators set the tag and don't implement these APIs, they'd still pass the estimator checks, so it's not mandatory for them to have it.

Overall I think having a SLEP for n_features_in_ and n_features_out_ doesn't seem like an unreasonable requirement for the two of them. But I don't like that we require the SLEPs for these two.

…features_in

NicolasHug · 2019-09-18T14:58:23Z

I was sort of hoping that my warning during the last meeting would avoid this. I'll raise the issue again on Monday.

NicolasHug · 2019-09-18T21:23:18Z

Trying to solve the conflicts: what should n_features_in_ be when X is a precomputed distance matrix?

amueller · 2019-09-18T21:28:03Z

It should be n_samples_. It's the same for kernels, right?

NicolasHug · 2019-09-18T22:48:39Z

You mean creating a new nsamples attribute?

jnothman · 2019-09-18T23:23:04Z

No, the distances to the training samples *are* the features in a transformed space.

…features_in

NicolasHug · 2019-09-23T15:55:58Z

Opened SLEP at scikit-learn/enhancement_proposals#22

NicolasHug · 2019-09-25T18:08:05Z

will address merge conflicts when the SLEP is accepted

adrinjalali · 2020-01-06T15:12:35Z

The SLEP is now accepted \o/

GaelVaroquaux · 2020-01-06T16:43:52Z

The SLEP is now accepted \o/

Yey! Thank you to everybody involved.

amueller · 2020-01-06T18:19:54Z

yay! I think @NicolasHug is still on vacation but will follow up soon :)

NicolasHug · 2020-01-12T16:09:33Z

Given the amount of conflicts and the differences with the accepted solution, I'm closing this PR and will open another one soon

Basic validate_X and validate_X_y methods for _n_features_in attribute

7d9dcc4

jnothman reviewed Apr 9, 2019

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

adrinjalali reviewed Apr 10, 2019

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

NicolasHug added 3 commits April 19, 2019 15:36

created NonRectangularInputMixin

f117745

Merge remote-tracking branch 'upstream/master' into n_features_in

95b330c

resolved conflicts

e56592b

jnothman reviewed May 22, 2019

View reviewed changes

jorisvandenbossche reviewed May 28, 2019

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/base.py Outdated Show resolved Hide resolved

NicolasHug added 5 commits May 30, 2019 15:21

Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…

3bdcb5c

…features_in

_validate** is not private

8ecc690

Added support for pipeline and grid search

60e4cea

pep8

ff19f22

Trigger CI??

a44318b

adrinjalali mentioned this pull request May 31, 2019

SLEP 8: Propagating feature names scikit-learn/enhancement_proposals#18

Closed

amueller reviewed Jun 26, 2019

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/base.py Show resolved Hide resolved

NicolasHug added 2 commits June 26, 2019 13:06

Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…

42249fb

…features_in

Added to decision tree for gridsearch tests to pass

abdc94e

adrinjalali self-assigned this Sep 17, 2019

adrinjalali reviewed Sep 17, 2019

View reviewed changes

Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…

e11b0bb

…features_in

NicolasHug added 4 commits September 19, 2019 08:50

merged (maybe)

6846bea

Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…

60c5108

…features_in

Merge branch 'master' of github.com:scikit-learn/scikit-learn into n_…

615140e

…features_in

set n_features_in_ for stacking estimators

fe052e6

NicolasHug mentioned this pull request Sep 23, 2019

SLEP010 n_features_in_ attribute scikit-learn/enhancement_proposals#22

Merged

amueller mentioned this pull request Sep 23, 2019

Pass through non-ndarray object that support nep13 and nep18: vaex+sklearn out of core #14963

Closed

dont hardcode attribute in init for sparsecoder

9a205dd

NicolasHug added this to the 0.22 milestone Oct 24, 2019

NicolasHug added the High Priority High priority issues and pull requests label Oct 24, 2019

adrinjalali modified the milestones: 0.22, 0.23 Oct 31, 2019

NicolasHug closed this Jan 12, 2020

NicolasHug mentioned this pull request Jan 13, 2020

MNT Introduction of n_features_in_ attr with _validate_data mtd #16112

Merged

ogrisel mentioned this pull request Jul 22, 2020

MiniBatchKMeans partial_fit uninformative error message when number of features changes #12430

Closed

Uh oh!

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

[MRG] Add n_features_in_ attribute to BaseEstimator #13603

Uh oh!

Conversation

NicolasHug commented Apr 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Apr 10, 2019

Uh oh!

NicolasHug commented Apr 10, 2019

Uh oh!

NicolasHug commented Apr 10, 2019

Uh oh!

GaelVaroquaux commented Apr 10, 2019 via email

Uh oh!

amueller commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Apr 18, 2019 via email

Uh oh!

NicolasHug commented Apr 19, 2019

Uh oh!

jnothman commented Apr 21, 2019 via email

Uh oh!

NicolasHug commented May 20, 2019

Uh oh!

jnothman commented May 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amueller commented May 29, 2019

Uh oh!

NicolasHug commented May 31, 2019

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NicolasHug commented Jun 26, 2019

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 18, 2019

Uh oh!

adrinjalali commented Sep 18, 2019

Uh oh!

NicolasHug commented Sep 18, 2019

Uh oh!

NicolasHug commented Sep 18, 2019

Uh oh!

amueller commented Sep 18, 2019

Uh oh!

NicolasHug commented Apr 9, 2019 •

edited

Loading

amueller commented Apr 17, 2019 •

edited

Loading