Skip to content

CLN Only check for n_features_in_ only when it exists #18011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 13, 2020

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jul 27, 2020

Reference Issues/PRs

Related to #18010

What does this implement/fix? Explain your changes.

We have code blocks like this:

log_reg = LogisticRegression(solver=solver, multi_class=multi_class)
# The score method of Logistic Regression has a classes_ attribute.
if multi_class == 'ovr':
log_reg.classes_ = np.array([-1, 1])
elif multi_class == 'multinomial':
log_reg.classes_ = np.unique(y_train)
else:
raise ValueError("multi_class should be either multinomial or ovr, "
"got %d" % multi_class)
if pos_class is not None:
mask = (y_test == pos_class)
y_test = np.ones(y_test.shape, dtype=np.float64)
y_test[~mask] = -1.
scores = list()
scoring = get_scorer(scoring)
for w in coefs:
if multi_class == 'ovr':
w = w[np.newaxis, :]
if fit_intercept:
log_reg.coef_ = w[:, :-1]
log_reg.intercept_ = w[:, -1]
else:
log_reg.coef_ = w
log_reg.intercept_ = 0.
if scoring is None:
scores.append(log_reg.score(X_test, y_test))
else:
scores.append(scoring(log_reg, X_test, y_test))

that would fail this check, _check_n_features was being run for reset=False.

Any other comments?

I am +0.5 on doing this. I have been trying to get _validate_data to be in predict, transform and friends which would make it easier to continue with column name consistency #18010

CC @NicolasHug

@NicolasHug
Copy link
Member

Sorry I don't understand the relation with the code snippet you provided. I don't see where _check_n_features is called here. I don't see _validate_data either?

@thomasjpfan
Copy link
Member Author

Currently the predict, transform, etc. methods do not currently use _validate_data. Once they start to use _validate_data, then enforcing n_features_in_ will break for cases like the snippet will break when calling score (which calls predict).

The alternative is to set n_features_in_ in these snippets as well.

@NicolasHug
Copy link
Member

(I've edited the snippet so that the call to score is actually shown ;) )

This is a bit of a tricky case, but I'm not super comfortable with having less strict checks just so we can support cases where fit wasn't called. It seems that what the snippet is doing is basically a hack relying on some semi-private mechanism. Which makes me think that it should probably indeed set n_features_in_ manually there.

Are there other occurrences of such technique?

@thomasjpfan
Copy link
Member Author

Coming back to this, this can happen for stateless estimators like the Normalizer. I guess we can also skip this check for stateless estimators?

@NicolasHug
Copy link
Member

Hm, since _validate_data is called in Normalizer.fit, I think we instead need to update our docs and encourage people to call fit on it.

@amueller
Copy link
Member

I think it should be fine to have people use Normalizer without calling fit though it makes the logic a bit more tricky. If you call transform twice on differently shaped data, should it error? I guess there's no reason to?

@NicolasHug
Copy link
Member

I would argue that fit should always be called if the estimator is used, and normalize can be used otherwise... but yeah we don't enforce that, sadly.

Regardless, I'm not super happy with the current change because it loosens the check in the general case.

@ogrisel
Copy link
Member

ogrisel commented Oct 9, 2020

I just discovered this discussion a posteriori. As I just said in #18577 (review) I think we should really be tolerant and allow users to call transform without calling fit on transformers with the "stateless" estimator tag.

For other estimators, I am a bit less certain. We do predict / transform without fit only in tests (as done for logistic regression in #18578 and PowerTransformer in #18577). But third party devs might have other valid use cases. So I in conclusion I think we should not check for n_feature_in_ consistency when reset=False whenever it's missing for any reason.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch coverage issue is expected. coverage of those lines will increase in the PRs linked in the above comment.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would be happier if we only did that for the stateless estimators.
For the rest, it should be reasonable to let developers either set n_features_in_ manually, or leave reset=True to its default.

(Otherwise, the docstring of _check_n_features needs an update.)

sklearn/base.py Outdated
Comment on lines 375 to 377
fitted_n_features_in = getattr(self, 'n_features_in_', None)
if fitted_n_features_in is None:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that hasattr would be cleaner, especially since self.n_features_in_ is used later and fitted_n_features_in isn't

@ogrisel
Copy link
Member

ogrisel commented Oct 12, 2020

I think I would be happier if we only did that for the stateless estimators.
For the rest, it should be reasonable to let developers either set n_features_in_ manually, or leave reset=True to its default.

I wonder if this kind of code:

https://github.com/scikit-learn/scikit-learn/pull/18578/files/61ba6d5b2a80f2fbbc8c34d093e36ee6dad6abd5#diff-3594349d15254765b299928caf118517

will not show up in third-party libraries and we will break them when upgrading for 0.24 for not good enough reasons.

@ogrisel
Copy link
Member

ogrisel commented Oct 12, 2020

I think the combo of the loose "check of n_feaures_in_ if present" + the test_check_n_features_in_after_fitting common test is enough to assert that scikit-learn estimators check consistency with informative error messages.

We can always make the input validation code stricter in the future (e.g. 0.25) once check_n_features_in_after_fitting is officially part of the public suite of estimator checks.

@NicolasHug
Copy link
Member

It makes me a bit uncomfortable that _validate_data(reset=False) might actually not check anything, because changing reset from its default (True) is supposed to express the intention of validating something. At least that's how we designed it.

will not show up in third-party libraries and we will break them when upgrading for 0.24 for not good enough reasons.

IMHO, building an estimator from scratch without calling fit and without using any data is a grey area, akin to using the private API.

But OK, I don't have much more to add if you still think we should merge.

@thomasjpfan can you please update the docstring? LGTM then

@ogrisel
Copy link
Member

ogrisel commented Oct 12, 2020

But OK, I don't have much more to add if you still think we should merge.

I am still +1 to merge as is (with the docstring update) to get this in with as few friction as possible, continue the work on upgrading all the modules to have them pass the common test and re-explore the possibility of this validation stricter the future. (also in light of the experience with stored feature names checks).

@thomasjpfan alright with you?

thomasjpfan and others added 2 commits October 12, 2020 17:08
Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
@thomasjpfan
Copy link
Member Author

Updated docstring at edb6029 (#18011)

sklearn/base.py Outdated
Else, the attribute must already exist and the function checks
that it is equal to `X.shape[1]`.
If False and the attribute exists, then check that it is equal to
`X.shape[1]`. If False and the attribute does *not* exists, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`X.shape[1]`. If False and the attribute does *not* exists, then
`X.shape[1]`. If False and the attribute does *not* exist, then

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last bit

@ogrisel ogrisel merged commit e8ffa31 into scikit-learn:master Oct 13, 2020
@ogrisel
Copy link
Member

ogrisel commented Oct 13, 2020

Merged!

amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020
…8011)

* CLN Checks n_features_in only if it exists

* Update sklearn/base.py

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

* DOC Update docstring

* DOC Grammer

* Grammar [ci skip]

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
…8011)

* CLN Checks n_features_in only if it exists

* Update sklearn/base.py

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>

* DOC Update docstring

* DOC Grammer

* Grammar [ci skip]

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants