ENH Uses _validate_data in other methods in the neural_network module #18514

thomasjpfan · 2020-10-02T00:09:00Z

Reference Issues/PRs

Related to #18010

What does this implement/fix? Explain your changes.

This PR adds the use of _validate_data to non-fit methods. N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE is used to ignore modules so that we can work on this issues with smaller PRs. Currently, this PR only adds _validate_data to the neural_network module.

Other comments?

While we work on the details of #18010, we can add _validate_data in non-fit methods concurrently. #18010 will be "free" if we can have _validate_data in non-fit methods.

CC @NicolasHug @adrinjalali @amueller

adrinjalali · 2020-10-02T08:09:15Z

interesting choice of a first module to fix :D tests failing :)

NicolasHug · 2020-10-02T13:21:40Z

sklearn/base.py

@@ -360,6 +360,10 @@ def _check_n_features(self, X, reset):
            If True, the `n_features_in_` attribute is set to `X.shape[1]`.
            Else, the attribute must already exist and the function checks
            that it is equal to `X.shape[1]`.
+            .. note::
+               It is recommended to call reset=True in `fit` and in the first
+               call to `partial_fit`. All other methods that validates `X`


Suggested change

call to `partial_fit`. All other methods that validates `X`

call to `partial_fit`. All other methods that validate `X`

same below

NicolasHug · 2020-10-02T13:25:36Z

sklearn/utils/estimator_checks.py

@@ -3121,6 +3121,66 @@ def check_requires_y_none(name, estimator_orig, strict_mode=True):
            warnings.warn(warning_msg, FutureWarning)


+def check_n_features_in_after_fitting(name, estimator_orig, strict_mode=True):


Once all modules are supported, should this be merge with the already-existing check_n_features_in check?

We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then? (If yes let's document it here and next to N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE please)

Yea we should merge it into check_n_features_in.

We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then?

I think you are referring to check_estimators_partial_fit_n_features. This new check adds two new requirements on top of check_estimators_partial_fit_n_features:

n_features_in_ is set during the first call to partial_fit.

More strict when it comes to the error message.

I updated the comment with the above message.

I was more referring to e.g. check_classifiers_train :

msg = ("The classifier {} does not raise an error when the number of " "features in {} is different from the number of features in " "fit.")

but we can keep it as-is and remove later (or not, as long as it passes)

NicolasHug · 2020-10-02T13:28:37Z

sklearn/utils/estimator_checks.py

+
+    estimator = clone(estimator_orig)
+
+    has_classes = 'classes' in signature(estimator.partial_fit).parameters


why is this needed?

I was using this to detect if parital_fit accepted classes. Turns out all classifiers has this in partial_fit, so I updated with just is_classifer.

NicolasHug · 2020-10-02T13:29:09Z

sklearn/utils/estimator_checks.py

+        func = getattr(estimator, method, None)
+        if func is None:
+            continue


Nit but I find that using hasattr is cleaner

NicolasHug · 2020-10-02T13:34:11Z

sklearn/base.py

@@ -410,7 +418,7 @@ def _validate_data(self, X, y=None, reset=True,
        """

        if y is None:
-            if self._get_tags()['requires_y']:
+            if reset and self._get_tags()['requires_y']:


this means that if y isn't passed the second time partial_fit is called, then the error won't be properly raised. Do we want that?

This kind of makes this inconvenient for calling in non-fit methods. For predict, we would set reset=False and not provide a y. This if statement almost assumes that _validate_data is only called in fit or partial_fit.

We spoke about it here: #18001

A solution would be to add a requires_y kwargs to _validate_data:

def _validate_data(self, X, y=None, reset=True, requires_y='auto', validate_separately=False, **check_params):

Where requires_y='auto' means using the tag and it can also be a bool.

IIRC our chat with @amueller a few months ago, the cleanest solution we had agreed on was to call _validate_data only in fit and in the first partial_fit, and to define another method for the rest (predict, transform, subsequent partial_fit, etc.). This might require changing _validate_data a little, which is fine since it's private

This means all fit calls must always explicitly pass y.

You mean that y must be explicitly passed to fit, or that y must be explicitly passed to _validate_data in fit?

Either way, I'm not sure I understand the reason.

I updated the PR to use the __NO_OP placeholder. This means that it will be the callers responsibly to call this with y to get the requires_y check:

self._validate_data(X, None) # will check that `y` is consistent with `require_y` tag self._validate_data(X) # will ignore `requires_y` tag

The alternative is to recommend setting y (in _validate_data) everywhere in fit. In other words:

def fit(self, X, y=None): # note that `y` is set explicitly to `None`, because the docstring # says `y` is ignored X = self._validate_data(X, None)

This way the requires_y will always be checked with y.

This means that it will be the callers responsibly to call this with y to get the requires_y check

I think that's fine because the reason we introduced the tag was to give a decent error message when y should be passed (e.g. for supervized estimators), but the user left it to None. Without the tag, they would get a weird tuple unpacking error with X, y = _validate_data(X, y). In other words, the tag is only useful when we pass both X and y, so that's fine to ignore it when we just pass X.

Could you remove the use of the requires_y parameter? I still see it in a few places ;)

I think I am okay with delegating the responsibility of passing y to the _validate_data's caller:

If the caller passes y, then _validate_data will validate y.

If the caller does not pass y, then do nothing (in terms of y).

Edit: I wrote this before github updated my UI with your above message.

Could you remove the use of the requires_y parameter? I still see it in a few places ;)

All gone now!

thomasjpfan · 2020-10-02T17:48:05Z

interesting choice of a first module to fix :D tests failing :)

It was a module with a transformer + regressor + classifier and something with partial_fit. :)

ogrisel

LGTM @thomasjpfan! The new check is nice.

…other_methods

NicolasHug

Thanks @thomasjpfan , I think this LGTM.

The approach is a bit unconventional, so maybe we should wait for more eyeballs. In particular I think that the approval from @ogrisel was for a different implementation.

NicolasHug · 2020-10-05T20:25:39Z

sklearn/base.py

-        y : array-like of shape (n_samples,), default=None
-            The targets. If None, `check_array` is called on `X` and
-            `check_X_y` is called otherwise.
+        y : array-like of shape (n_samples,), default=__NO_Y


lol I think this might be the only case where it would make sense to use optional instead of indicating default=.... But we can leave it as-is since it's still private.

NicolasHug · 2020-10-05T20:28:04Z

sklearn/base.py

+              requires_y tag is ignored. This is a default placeholder and is
+              never meant to be explicitly set.
+            - Otherwise, both `X` and `y` are checked with either `check_array`
+              or `check_X_y`.


Suggested change

or `check_X_y`.

or `check_X_y` depending on `validate_separately`

glemaitre · 2020-10-06T07:59:07Z

Using __NO_Y looks good as well. It would only supper nice to get a developer guide to have these documented in a narration but it is outside of the scope here.

ogrisel

What do you think of implementing the following slight variation to make this solution clearer. Other than that LGTM.

ogrisel · 2020-10-06T08:34:15Z

sklearn/base.py

@@ -156,6 +156,8 @@ class BaseEstimator:
    at the class level in their ``__init__`` as explicit keyword
    arguments (no ``*args`` or ``**kwargs``).
    """
+    # used by _validate_data when `y` is not validated
+    __NO_Y = '__NO_Y'


If you really want to keep a module level constant as a default placeholder, I would rather use single underscore. It's enough to tell it's private. But I think the code would be clearer without defining a module level constant.

ogrisel · 2020-10-06T08:55:53Z

sklearn/base.py

@@ -378,7 +384,7 @@ def _check_n_features(self, X, reset):
                                       self.n_features_in_)
                )

-    def _validate_data(self, X, y=None, reset=True,
+    def _validate_data(self, X, y=__NO_Y, reset=True,


I would just set y="no_validation" instead of defining a module level constant as default placeholder with a rather implicit name.

That would also make the docstring easier to understand.

…other_methods

ogrisel · 2020-10-07T09:35:41Z

Thanks @thomasjpfan! Merging.

ogrisel · 2020-10-07T10:25:43Z

So I guess no we have to open as many PRs as there are modules to update :)

…scikit-learn#18514)

thomasjpfan added 5 commits October 1, 2020 19:43

ENH Enables validate_data for non-fit methods

89ae918

REV Less diffs

ba22dee

TST Improves test

e180abd

TST Improves test

fd03030

DOC Adds docs

c40d37e

github-actions bot added module:neural_network module:utils labels Oct 2, 2020

NicolasHug reviewed Oct 2, 2020

View reviewed changes

thomasjpfan added 7 commits October 2, 2020 12:05

TST Update with more feature setting

f8ffd88

TST Fixes tests

18454c1

DOC Adds comment

0c2ea46

CLN Uses hasattr

907dce9

CLN Adds requires_y kwargs

8e8d11d

CLN Uses requires_y='use_tag'

bd150f6

DOC add check_classifiers_train to docs

1ed5421

ogrisel approved these changes Oct 5, 2020

View reviewed changes

thomasjpfan mentioned this pull request Oct 5, 2020

ENH Adds Column name consistency #18010

Merged

thomasjpfan added 5 commits October 5, 2020 15:33

Merge remote-tracking branch 'upstream/master' into validate_data_in_…

54ddc43

…other_methods

CLN Change signature of

3735ba0

DOC Removes note

a865265

CLN Fully removes requires_y

4ec2b62

DOC Improves docs

24636d4

NicolasHug approved these changes Oct 5, 2020

View reviewed changes

ogrisel reviewed Oct 6, 2020

View reviewed changes

thomasjpfan added 3 commits October 6, 2020 12:40

Merge remote-tracking branch 'upstream/master' into validate_data_in_…

e0e7fab

…other_methods

CLN uses 'no_validation'

e397a74

DOC Update docstring to no_validation

64b78fc

FIX Check for string first

c21e02d

ogrisel merged commit 8731cd7 into scikit-learn:master Oct 7, 2020

ogrisel mentioned this pull request Oct 7, 2020

MNT n_features_in_ consistency in decomposition #18557

Merged

This was referenced Oct 9, 2020

ENH Checks n_features_in_ in preprocessing module #18577

Merged

ENH Adds n_features_in_ checks to linear and svm modules #18578

Merged

ENH Adds n_features_in_ checks to impute module #18580

Merged

amrcode pushed a commit to amrcode/scikit-learn that referenced this pull request Oct 19, 2020

ENH Uses _validate_data in other methods in the neural_network module (…

f0f5781

…scikit-learn#18514)

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

ENH Uses _validate_data in other methods in the neural_network module (…

5e5a46a

…scikit-learn#18514)

This was referenced Feb 1, 2021

ENH Adds n_features_in_ to ensemble module #19326

Merged

Track SLEP10: Add n_features_in_ to all modules #19333

Closed

lorentzenchr mentioned this pull request Feb 18, 2021

ENH Adds n_features_in_ to naive_bayes #19485

Merged

	call to `partial_fit`. All other methods that validates `X`
	call to `partial_fit`. All other methods that validate `X`

		@@ -3121,6 +3121,66 @@ def check_requires_y_none(name, estimator_orig, strict_mode=True):
		warnings.warn(warning_msg, FutureWarning)


		def check_n_features_in_after_fitting(name, estimator_orig, strict_mode=True):


		estimator = clone(estimator_orig)

		has_classes = 'classes' in signature(estimator.partial_fit).parameters

	or `check_X_y`.
	or `check_X_y` depending on `validate_separately`

Uh oh!

ENH Uses _validate_data in other methods in the neural_network module #18514

ENH Uses _validate_data in other methods in the neural_network module #18514

Uh oh!

Conversation

thomasjpfan commented Oct 2, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Other comments?

Uh oh!

adrinjalali commented Oct 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan commented Oct 2, 2020

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Oct 6, 2020

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Oct 7, 2020

Uh oh!

ogrisel commented Oct 7, 2020

Uh oh!

Uh oh!

thomasjpfan Oct 5, 2020 •

edited

Loading

thomasjpfan Oct 5, 2020 •

edited

Loading

ogrisel Oct 6, 2020 •

edited

Loading