-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Uses _validate_data in other methods in the neural_network module #18514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Uses _validate_data in other methods in the neural_network module #18514
Conversation
interesting choice of a first module to fix :D tests failing :) |
sklearn/base.py
Outdated
@@ -360,6 +360,10 @@ def _check_n_features(self, X, reset): | |||
If True, the `n_features_in_` attribute is set to `X.shape[1]`. | |||
Else, the attribute must already exist and the function checks | |||
that it is equal to `X.shape[1]`. | |||
.. note:: | |||
It is recommended to call reset=True in `fit` and in the first | |||
call to `partial_fit`. All other methods that validates `X` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call to `partial_fit`. All other methods that validates `X` | |
call to `partial_fit`. All other methods that validate `X` |
same below
@@ -3121,6 +3121,66 @@ def check_requires_y_none(name, estimator_orig, strict_mode=True): | |||
warnings.warn(warning_msg, FutureWarning) | |||
|
|||
|
|||
def check_n_features_in_after_fitting(name, estimator_orig, strict_mode=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once all modules are supported, should this be merge with the already-existing check_n_features_in
check?
We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then? (If yes let's document it here and next to N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE please)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we should merge it into check_n_features_in
.
We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then?
I think you are referring to check_estimators_partial_fit_n_features
. This new check adds two new requirements on top of check_estimators_partial_fit_n_features
:
n_features_in_
is set during the first call topartial_fit
.- More strict when it comes to the error message.
I updated the comment with the above message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more referring to e.g. check_classifiers_train
:
msg = ("The classifier {} does not raise an error when the number of "
"features in {} is different from the number of features in "
"fit.")
but we can keep it as-is and remove later (or not, as long as it passes)
sklearn/utils/estimator_checks.py
Outdated
|
||
estimator = clone(estimator_orig) | ||
|
||
has_classes = 'classes' in signature(estimator.partial_fit).parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using this to detect if parital_fit
accepted classes
. Turns out all classifiers has this in partial_fit
, so I updated with just is_classifer
.
sklearn/utils/estimator_checks.py
Outdated
func = getattr(estimator, method, None) | ||
if func is None: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but I find that using hasattr is cleaner
sklearn/base.py
Outdated
@@ -410,7 +418,7 @@ def _validate_data(self, X, y=None, reset=True, | |||
""" | |||
|
|||
if y is None: | |||
if self._get_tags()['requires_y']: | |||
if reset and self._get_tags()['requires_y']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that if y
isn't passed the second time partial_fit
is called, then the error won't be properly raised. Do we want that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of makes this inconvenient for calling in non-fit methods. For predict, we would set reset=False
and not provide a y
. This if statement almost assumes that _validate_data
is only called in fit
or partial_fit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke about it here: #18001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solution would be to add a requires_y
kwargs to _validate_data
:
def _validate_data(self, X, y=None, reset=True, requires_y='auto',
validate_separately=False, **check_params):
Where requires_y='auto'
means using the tag and it can also be a bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC our chat with @amueller a few months ago, the cleanest solution we had agreed on was to call _validate_data
only in fit
and in the first partial_fit
, and to define another method for the rest (predict
, transform
, subsequent partial_fit
, etc.). This might require changing _validate_data
a little, which is fine since it's private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means all fit calls must always explicitly pass y.
You mean that y
must be explicitly passed to fit
, or that y
must be explicitly passed to _validate_data
in fit
?
Either way, I'm not sure I understand the reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to use the __NO_OP
placeholder. This means that it will be the callers responsibly to call this with y
to get the requires_y
check:
self._validate_data(X, None) # will check that `y` is consistent with `require_y` tag
self._validate_data(X) # will ignore `requires_y` tag
The alternative is to recommend setting y
(in _validate_data
) everywhere in fit
. In other words:
def fit(self, X, y=None):
# note that `y` is set explicitly to `None`, because the docstring
# says `y` is ignored
X = self._validate_data(X, None)
This way the requires_y
will always be checked with y
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that it will be the callers responsibly to call this with y to get the requires_y check
I think that's fine because the reason we introduced the tag was to give a decent error message when y
should be passed (e.g. for supervized estimators), but the user left it to None. Without the tag, they would get a weird tuple unpacking error with X, y = _validate_data(X, y)
. In other words, the tag is only useful when we pass both X and y, so that's fine to ignore it when we just pass X.
Could you remove the use of the requires_y
parameter? I still see it in a few places ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am okay with delegating the responsibility of passing y
to the _validate_data
's caller:
- If the caller passes
y
, then_validate_data
will validatey
. - If the caller does not pass
y
, then do nothing (in terms ofy
).
Edit: I wrote this before github updated my UI with your above message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove the use of the requires_y parameter? I still see it in a few places ;)
All gone now!
It was a module with a transformer + regressor + classifier and something with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @thomasjpfan! The new check is nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan , I think this LGTM.
The approach is a bit unconventional, so maybe we should wait for more eyeballs. In particular I think that the approval from @ogrisel was for a different implementation.
sklearn/base.py
Outdated
y : array-like of shape (n_samples,), default=None | ||
The targets. If None, `check_array` is called on `X` and | ||
`check_X_y` is called otherwise. | ||
y : array-like of shape (n_samples,), default=__NO_Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol I think this might be the only case where it would make sense to use optional
instead of indicating default=...
. But we can leave it as-is since it's still private.
sklearn/base.py
Outdated
requires_y tag is ignored. This is a default placeholder and is | ||
never meant to be explicitly set. | ||
- Otherwise, both `X` and `y` are checked with either `check_array` | ||
or `check_X_y`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or `check_X_y`. | |
or `check_X_y` depending on `validate_separately` |
Using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of implementing the following slight variation to make this solution clearer. Other than that LGTM.
sklearn/base.py
Outdated
@@ -156,6 +156,8 @@ class BaseEstimator: | |||
at the class level in their ``__init__`` as explicit keyword | |||
arguments (no ``*args`` or ``**kwargs``). | |||
""" | |||
# used by _validate_data when `y` is not validated | |||
__NO_Y = '__NO_Y' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to keep a module level constant as a default placeholder, I would rather use single underscore. It's enough to tell it's private. But I think the code would be clearer without defining a module level constant.
sklearn/base.py
Outdated
@@ -378,7 +384,7 @@ def _check_n_features(self, X, reset): | |||
self.n_features_in_) | |||
) | |||
|
|||
def _validate_data(self, X, y=None, reset=True, | |||
def _validate_data(self, X, y=__NO_Y, reset=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just set y="no_validation"
instead of defining a module level constant as default placeholder with a rather implicit name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would also make the docstring easier to understand.
Thanks @thomasjpfan! Merging. |
So I guess no we have to open as many PRs as there are modules to update :) |
Reference Issues/PRs
Related to #18010
What does this implement/fix? Explain your changes.
This PR adds the use of
_validate_data
to non-fit methods.N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE
is used to ignore modules so that we can work on this issues with smaller PRs. Currently, this PR only adds_validate_data
to theneural_network
module.Other comments?
While we work on the details of #18010, we can add
_validate_data
in non-fit methods concurrently. #18010 will be "free" if we can have_validate_data
in non-fit methods.CC @NicolasHug @adrinjalali @amueller