-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT Introduction of n_features_in_ attr with _validate_data mtd #16112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7d9dcc4
f117745
95b330c
e56592b
3bdcb5c
8ecc690
60e4cea
ff19f22
a44318b
42249fb
abdc94e
a50e76f
62fc42e
6845788
3246436
ee2598b
6a14e4b
3f2d44f
25fda0f
9bdfb65
be76ef4
b464f86
70dc4ed
988f9c4
fd9b72c
4f3d6ff
08f7192
5a41275
f0e7b41
193fda1
5b20a4c
a49e5ea
968fbff
e4faf13
908aea6
a88a4c5
f3fb539
4b7b758
53027d3
c5dfbbd
a1aea70
9ecc396
e9c3104
9292c84
e11b0bb
6846bea
60c5108
615140e
fe052e6
9a205dd
da6a15f
84e62ce
0e81156
d4d92bc
40cd141
b3251fe
7e73a24
2f448aa
9e25594
a6a344d
9e0c3d7
edc6545
d7963e7
d344f04
d6f0451
b917d72
08a8d72
4fb756e
d62394b
511c395
c270884
39d8371
9effdbf
4f8ca86
a1045e0
2c70b1b
a101d2d
dd983c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,8 @@ | |
|
||
from . import __version__ | ||
from .utils import _IS_32BIT | ||
from .utils.validation import check_X_y | ||
from .utils.validation import check_array | ||
|
||
_DEFAULT_TAGS = { | ||
'non_deterministic': False, | ||
|
@@ -343,6 +345,72 @@ def _get_tags(self): | |
collected_tags.update(more_tags) | ||
return collected_tags | ||
|
||
def _check_n_features(self, X, reset): | ||
"""Set the `n_features_in_` attribute, or check against it. | ||
|
||
Parameters | ||
---------- | ||
X : {ndarray, sparse matrix} of shape (n_samples, n_features) | ||
The input samples. | ||
reset : bool | ||
If True, the `n_features_in_` attribute is set to `X.shape[1]`. | ||
Else, the attribute must already exist and the function checks | ||
that it is equal to `X.shape[1]`. | ||
""" | ||
n_features = X.shape[1] | ||
|
||
if reset: | ||
self.n_features_in_ = n_features | ||
else: | ||
if not hasattr(self, 'n_features_in_'): | ||
raise RuntimeError( | ||
ogrisel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"The reset parameter is False but there is no " | ||
"n_features_in_ attribute. Is this estimator fitted?" | ||
) | ||
if n_features != self.n_features_in_: | ||
raise ValueError( | ||
'X has {} features, but this {} is expecting {} features ' | ||
'as input.'.format(n_features, self.__class__.__name__, | ||
self.n_features_in_) | ||
) | ||
|
||
def _validate_data(self, X, y=None, reset=True, **check_params): | ||
"""Validate input data and set or check the `n_features_in_` attribute. | ||
|
||
Parameters | ||
---------- | ||
X : {array-like, sparse matrix, dataframe} of shape \ | ||
(n_samples, n_features) | ||
The input samples. | ||
y : array-like of shape (n_samples,), default=None | ||
The targets. If None, `check_array` is called on `X` and | ||
`check_X_y` is called otherwise. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously, check_X_y would enforce that y was not None in many estimators. What happens if the user passes y=None now (e.g. through a Pipeline where y is not a required parameter)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the implication is that now this will break in tuple unpacking. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't understand this. Tuple unpacking of which function call and for what kind of input? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you're correct Joel. Calls to Calling So instead of the nicer error in I'm not sure what to do here, apart e.g. adding a new parameter and replacing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 overall but rather than The "force" terminology makes me think that the validation method can automatically convert There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Inheritance then? How does forcing the caller to pass y help us check when the user has indirectly provided null y? Do you rather mean that _validate_data should always return an X, y pair? That would be good in some ways. It would produce a better error message than tuple unpacking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed my solution is useless on its own, we would still need the I guess inheritance could almost work, i.e. using our current Mixins, but how do we deal with meta-estimators? I feel like a tag could work here. We can define a For the meta-estimators, we can have adhoc handling, e.g. for pipelines that would be something like That's not a trivial change though and it should be done in another PR which should be merged first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think the proposal above would change the internal validation. With my proposal above, we can add a common check that makes sure a proper error is raised when requires_y is True but None is passed. My main concerns for using estimator tags outside of estimator checks was that it was impossible to override them in child classes more than once (fixed now), and that it required deprecation cycle / warnings to give time to third party developer to adapt (#13565 (comment)). But that's a non-issue there: we're already introducing a common check with warnings for 2 cycles, so we might as well have 2 of them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it would be good if there was a programmatic way to figure out if an estimator requires y (cc @thomasjpfan). I think it's a valid question whether we want to use estimator tags for that. While it would be the first time we use estimator_tags in an estimator, I think it's a good use. Right now, we actually use Whether we want to make this decision a prereq for this PR is another question. I opened #16468 for tracking adding a tag on requiring y for fitting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The only concern I have with returning both |
||
reset : bool, default=True | ||
Whether to reset the `n_features_in_` attribute. | ||
If False, the input will be checked for consistency with data | ||
provided when reset was last True. | ||
**check_params : kwargs | ||
Parameters passed to :func:`sklearn.utils.check_array` or | ||
:func:`sklearn.utils.check_X_y`. | ||
|
||
Returns | ||
------- | ||
out : {ndarray, sparse matrix} or tuple of these | ||
The validated input. A tuple is returned if `y` is not None. | ||
""" | ||
|
||
if y is None: | ||
X = check_array(X, **check_params) | ||
out = X | ||
else: | ||
X, y = check_X_y(X, y, **check_params) | ||
out = X, y | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Couldn't we change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At first glance, I'm not sure of the benefits? It would just be a less obvious way to call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The main benefit would be to have an explicit and standard There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it be better directly in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not both to avoid duplication? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I follow @ogrisel. Currently, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to change check_X_y in this PR. There is enough to review here |
||
|
||
if check_params.get('ensure_2d', True): | ||
self._check_n_features(X, reset=reset) | ||
|
||
return out | ||
|
||
|
||
class ClassifierMixin: | ||
"""Mixin class for all classifiers in scikit-learn.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always wonder if I should write the way you have it here, or try to access the attribute and catch the AttributeError exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to avoid
try/except
when I can but that's personal pref.I guess
hasattr
exists for a reason so we might as well just use it ^^It might also be faster since it's written in C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always get the same question. FYI, I pretty much like the discussion here: https://codecalamity.com/is-it-easier-to-ask-forgiveness-in-python/#when-to-use-what