CLN Only check for n_features_in_ only when it exists #18011

thomasjpfan · 2020-07-27T13:59:40Z

Reference Issues/PRs

Related to #18010

What does this implement/fix? Explain your changes.

We have code blocks like this:

scikit-learn/sklearn/linear_model/_logistic.py

Lines 974 to 1006 in f642ff7

    
           log_reg = LogisticRegression(solver=solver, multi_class=multi_class) 
        
           # The score method of Logistic Regression has a classes_ attribute. 
        
           if multi_class == 'ovr': 
        
               log_reg.classes_ = np.array([-1, 1]) 
        
           elif multi_class == 'multinomial': 
        
               log_reg.classes_ = np.unique(y_train) 
        
           else: 
        
               raise ValueError("multi_class should be either multinomial or ovr, " 
        
                                "got %d" % multi_class) 
        
           if pos_class is not None: 
        
               mask = (y_test == pos_class) 
        
               y_test = np.ones(y_test.shape, dtype=np.float64) 
        
               y_test[~mask] = -1. 
        
           scores = list() 
        
           scoring = get_scorer(scoring) 
        
           for w in coefs: 
        
               if multi_class == 'ovr': 
        
                   w = w[np.newaxis, :] 
        
               if fit_intercept: 
        
                   log_reg.coef_ = w[:, :-1] 
        
                   log_reg.intercept_ = w[:, -1] 
        
               else: 
        
                   log_reg.coef_ = w 
        
                   log_reg.intercept_ = 0. 
        
               if scoring is None: 
        
                   scores.append(log_reg.score(X_test, y_test)) 
        
               else: 
        
                   scores.append(scoring(log_reg, X_test, y_test))

that would fail this check, _check_n_features was being run for reset=False.

Any other comments?

I am +0.5 on doing this. I have been trying to get _validate_data to be in predict, transform and friends which would make it easier to continue with column name consistency #18010

CC @NicolasHug

NicolasHug · 2020-07-27T15:12:43Z

Sorry I don't understand the relation with the code snippet you provided. I don't see where _check_n_features is called here. I don't see _validate_data either?

thomasjpfan · 2020-07-27T15:25:35Z

Currently the predict, transform, etc. methods do not currently use _validate_data. Once they start to use _validate_data, then enforcing n_features_in_ will break for cases like the snippet will break when calling score (which calls predict).

The alternative is to set n_features_in_ in these snippets as well.

NicolasHug · 2020-07-27T15:39:55Z

(I've edited the snippet so that the call to score is actually shown ;) )

This is a bit of a tricky case, but I'm not super comfortable with having less strict checks just so we can support cases where fit wasn't called. It seems that what the snippet is doing is basically a hack relying on some semi-private mechanism. Which makes me think that it should probably indeed set n_features_in_ manually there.

Are there other occurrences of such technique?

thomasjpfan · 2020-08-27T15:55:32Z

Coming back to this, this can happen for stateless estimators like the Normalizer. I guess we can also skip this check for stateless estimators?

NicolasHug · 2020-08-27T17:15:51Z

Hm, since _validate_data is called in Normalizer.fit, I think we instead need to update our docs and encourage people to call fit on it.

amueller · 2020-09-23T19:22:37Z

I think it should be fine to have people use Normalizer without calling fit though it makes the logic a bit more tricky. If you call transform twice on differently shaped data, should it error? I guess there's no reason to?

NicolasHug · 2020-09-23T19:34:52Z

I would argue that fit should always be called if the estimator is used, and normalize can be used otherwise... but yeah we don't enforce that, sadly.

Regardless, I'm not super happy with the current change because it loosens the check in the general case.

ogrisel · 2020-10-09T15:20:16Z

I just discovered this discussion a posteriori. As I just said in #18577 (review) I think we should really be tolerant and allow users to call transform without calling fit on transformers with the "stateless" estimator tag.

For other estimators, I am a bit less certain. We do predict / transform without fit only in tests (as done for logistic regression in #18578 and PowerTransformer in #18577). But third party devs might have other valid use cases. So I in conclusion I think we should not check for n_feature_in_ consistency when reset=False whenever it's missing for any reason.

ogrisel

The patch coverage issue is expected. coverage of those lines will increase in the PRs linked in the above comment.

NicolasHug

I think I would be happier if we only did that for the stateless estimators.
For the rest, it should be reasonable to let developers either set n_features_in_ manually, or leave reset=True to its default.

(Otherwise, the docstring of _check_n_features needs an update.)

NicolasHug · 2020-10-12T12:26:14Z

sklearn/base.py

+        fitted_n_features_in = getattr(self, 'n_features_in_', None)
+        if fitted_n_features_in is None:
+            return


I think that hasattr would be cleaner, especially since self.n_features_in_ is used later and fitted_n_features_in isn't

ogrisel · 2020-10-12T13:52:25Z

I think I would be happier if we only did that for the stateless estimators.
For the rest, it should be reasonable to let developers either set n_features_in_ manually, or leave reset=True to its default.

I wonder if this kind of code:

https://github.com/scikit-learn/scikit-learn/pull/18578/files/61ba6d5b2a80f2fbbc8c34d093e36ee6dad6abd5#diff-3594349d15254765b299928caf118517

will not show up in third-party libraries and we will break them when upgrading for 0.24 for not good enough reasons.

ogrisel · 2020-10-12T13:55:37Z

I think the combo of the loose "check of n_feaures_in_ if present" + the test_check_n_features_in_after_fitting common test is enough to assert that scikit-learn estimators check consistency with informative error messages.

We can always make the input validation code stricter in the future (e.g. 0.25) once check_n_features_in_after_fitting is officially part of the public suite of estimator checks.

NicolasHug · 2020-10-12T14:18:50Z

It makes me a bit uncomfortable that _validate_data(reset=False) might actually not check anything, because changing reset from its default (True) is supposed to express the intention of validating something. At least that's how we designed it.

will not show up in third-party libraries and we will break them when upgrading for 0.24 for not good enough reasons.

IMHO, building an estimator from scratch without calling fit and without using any data is a grey area, akin to using the private API.

But OK, I don't have much more to add if you still think we should merge.

@thomasjpfan can you please update the docstring? LGTM then

ogrisel · 2020-10-12T14:48:57Z

But OK, I don't have much more to add if you still think we should merge.

I am still +1 to merge as is (with the docstring update) to get this in with as few friction as possible, continue the work on upgrading all the modules to have them pass the common test and re-explore the possibility of this validation stricter the future. (also in light of the experience with stored feature names checks).

@thomasjpfan alright with you?

sklearn/base.py

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

thomasjpfan · 2020-10-12T21:12:55Z

Updated docstring at edb6029 (#18011)

NicolasHug · 2020-10-12T21:14:37Z

sklearn/base.py

-            Else, the attribute must already exist and the function checks
-            that it is equal to `X.shape[1]`.
+            If False and the attribute exists, then check that it is equal to
+            `X.shape[1]`. If False and the attribute does *not* exists, then


Suggested change

`X.shape[1]`. If False and the attribute does *not* exists, then

`X.shape[1]`. If False and the attribute does *not* exist, then

NicolasHug

last bit

sklearn/base.py

ogrisel · 2020-10-13T09:07:42Z

Merged!

…8011) * CLN Checks n_features_in only if it exists * Update sklearn/base.py Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com> * DOC Update docstring * DOC Grammer * Grammar [ci skip] Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

thomasjpfan added 2 commits July 26, 2020 23:14

CLN Checks n_features_in only if it exists

6eb9905

Merge remote-tracking branch 'upstream/master' into cln_check_n_features

ed5a7db

thomasjpfan mentioned this pull request Oct 9, 2020

ENH Adds n_features_in_ checks to linear and svm modules #18578

Merged

Merge remote-tracking branch 'upstream/master' into cln_check_n_features

49fed40

ogrisel approved these changes Oct 12, 2020

View reviewed changes

NicolasHug reviewed Oct 12, 2020

View reviewed changes

ogrisel reviewed Oct 12, 2020

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

thomasjpfan and others added 2 commits October 12, 2020 17:08

Update sklearn/base.py

d71201f

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

DOC Update docstring

edb6029

NicolasHug reviewed Oct 12, 2020

View reviewed changes

NicolasHug approved these changes Oct 12, 2020

View reviewed changes

DOC Grammer

833f185

ogrisel reviewed Oct 13, 2020

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

Grammar [ci skip]

2e09089

ogrisel merged commit e8ffa31 into scikit-learn:master Oct 13, 2020

ogrisel mentioned this pull request Oct 13, 2020

ENH Checks n_features_in_ in preprocessing module #18577

Merged

TheaperDeng mentioned this pull request Dec 30, 2020

automl feature transform sklearn >=0.23 support intel/BigDL#3304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN Only check for n_features_in_ only when it exists #18011

CLN Only check for n_features_in_ only when it exists #18011

thomasjpfan commented Jul 27, 2020 •

edited by NicolasHug

Loading

NicolasHug commented Jul 27, 2020

thomasjpfan commented Jul 27, 2020

NicolasHug commented Jul 27, 2020

thomasjpfan commented Aug 27, 2020

NicolasHug commented Aug 27, 2020

amueller commented Sep 23, 2020

NicolasHug commented Sep 23, 2020

ogrisel commented Oct 9, 2020 •

edited

Loading

ogrisel left a comment

NicolasHug left a comment

NicolasHug Oct 12, 2020

ogrisel commented Oct 12, 2020

ogrisel commented Oct 12, 2020 •

edited

Loading

NicolasHug commented Oct 12, 2020

ogrisel commented Oct 12, 2020 •

edited

Loading

thomasjpfan commented Oct 12, 2020

NicolasHug Oct 12, 2020

NicolasHug left a comment

ogrisel commented Oct 13, 2020

	log_reg = LogisticRegression(solver=solver, multi_class=multi_class)

	# The score method of Logistic Regression has a classes_ attribute.
	if multi_class == 'ovr':
	log_reg.classes_ = np.array([-1, 1])
	elif multi_class == 'multinomial':
	log_reg.classes_ = np.unique(y_train)
	else:
	raise ValueError("multi_class should be either multinomial or ovr, "
	"got %d" % multi_class)

	if pos_class is not None:
	mask = (y_test == pos_class)
	y_test = np.ones(y_test.shape, dtype=np.float64)
	y_test[~mask] = -1.

	scores = list()

	scoring = get_scorer(scoring)
	for w in coefs:
	if multi_class == 'ovr':
	w = w[np.newaxis, :]
	if fit_intercept:
	log_reg.coef_ = w[:, :-1]
	log_reg.intercept_ = w[:, -1]
	else:
	log_reg.coef_ = w
	log_reg.intercept_ = 0.

	if scoring is None:
	scores.append(log_reg.score(X_test, y_test))
	else:
	scores.append(scoring(log_reg, X_test, y_test))

	`X.shape[1]`. If False and the attribute does not exists, then
	`X.shape[1]`. If False and the attribute does not exist, then

CLN Only check for n_features_in_ only when it exists #18011

CLN Only check for n_features_in_ only when it exists #18011

Conversation

thomasjpfan commented Jul 27, 2020 • edited by NicolasHug Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

NicolasHug commented Jul 27, 2020

thomasjpfan commented Jul 27, 2020

NicolasHug commented Jul 27, 2020

thomasjpfan commented Aug 27, 2020

NicolasHug commented Aug 27, 2020

amueller commented Sep 23, 2020

NicolasHug commented Sep 23, 2020

ogrisel commented Oct 9, 2020 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug Oct 12, 2020

Choose a reason for hiding this comment

ogrisel commented Oct 12, 2020

ogrisel commented Oct 12, 2020 • edited Loading

NicolasHug commented Oct 12, 2020

ogrisel commented Oct 12, 2020 • edited Loading

thomasjpfan commented Oct 12, 2020

NicolasHug Oct 12, 2020

Choose a reason for hiding this comment

NicolasHug left a comment

Choose a reason for hiding this comment

ogrisel commented Oct 13, 2020

thomasjpfan commented Jul 27, 2020 •

edited by NicolasHug

Loading

ogrisel commented Oct 9, 2020 •

edited

Loading

ogrisel commented Oct 12, 2020 •

edited

Loading

ogrisel commented Oct 12, 2020 •

edited

Loading