-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] MNT requires_y tag with y=None validation #16622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ranch 'master' of github.com:scikit-learn/scikit-learn into n_features_in
It can work out, but making it neat would take more work than I could put
in at the time. It frustrated me that some classes inherit from
RegressorMixin despite being optionally unsupervised. Admittedly the whole
idea of being optionally supervised in cross decomposition invalidates the
point of this tag.
|
Do you feel the same about a |
No, that more explicitly matches what the tag is used to ensure. |
The tag is now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @NicolasHug
Co-Authored-By: Adrin Jalali <adrin.jalali@gmail.com>
…n into is_supervised_tag
Hi @scikit-learn/core-devs, time to merge this one? |
+1 on my side
maybe @ogrisel or @GaelVaroquaux want to have a look?
… |
Wonderful. More precise/explicit, and I think also a shorter PR. Thanks so much for your efforts, @NicolasHug. Only remaining issue: should the new tag be mentioned in what's new? |
Thanks @NicolasHug |
thanks heaps @NicolasHug |
Follow up to #16112
EDIT: the tag is now
requires_y
This PR adds a
is_supervised
tag and uses it to raise an error in_validate_data()
if:a non-supervised estimator is passed y!=NoneThis PR is mostly motivated by the fact that since #16112, supervised estimators that are passed y=None will fail with a tuple unpacking error instead of a nicer error message.
This comes with a bunch of complications:
This forces supervised estimators to call
_validate_data(X, y)
, instead of validating X and y separately. Since callingcheck_array(X, ...)
and thencheck_array(y, ...)
isn't in general equivalent to callingcheck_X_y(X, y, ...)
, I had to introduce a way to check X and y separately when calling_validate_data(X, y, ...)
. This is really ugly. It will also definitely not work for third-party estimators that inherit from ours.Some estimators like OneClassSVM, IsolationForest, and RandomTreesEmbedding are unsupervised, but they generate a randomy
that is passed down to their base classes and validated there with_validate_data(X, y)
. This makes_validate_data
fail because of the check mentioned above. So we need custom checks in the validation of the base classes in each of these. This is really, really ugly.Since
_validate_data
assumes the tag exists, this means that all estimators usingvalidate_data
must also support the tag. IMO, that's fine.