-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX Check and correct the input_tags.sparse flag #30187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Check and correct the input_tags.sparse flag #30187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tag
estimator check instead of making the existing check_estimator_sparse_array
tag-aware?
Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.
I tried to make Actually the new test is done with the assumption that the Perhaps a cleaner alternative as suggested by @glemaitre would be that
|
I am a bit afraid about the breaking change induced by entirely changing the meaning of the +1 for updating your PR to implement what you suggest above and see what it practically entails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a first round of review just looking at the test. I'll check now the changes in the different estimator to know if I'm getting surprise by what I'll see.
doc/whats_new/upcoming_changes/sklearn.utils/30187.enhancement.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think that I made a pass on all estimators. I think that sometimes, we can move the definition in the base class.
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
@antoinebaker do not hesitate to ping me when you want another review. |
I followed @glemaitre suggestion for setting the We now define As a lesser of two evils the new definition will have false negatives (Pipelines accepting sparse data wrongly tagged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There a small conflict to solve.
Otherwise, it looks good to me. I'm fine with the current policy for the Pipeline
here.
@@ -2654,6 +2656,7 @@ def fit(self, X, y): | |||
|
|||
def __sklearn_tags__(self): | |||
tags = super().__sklearn_tags__() | |||
tags.input_tags.sparse = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that it's because you set it to True on the base class. I wonder if we should try to always set this only to non default values. It would mean that we'd have to leave it to the default value on the base class and set it to True on most of its child classes here. What do you think @glemaitre ?
# WARNING: the sparse tag can be incorrect. | ||
# Some Pipelines accepting sparse data are wrongly tagged sparse=False. | ||
# For example Pipeline([PCA(), estimator]) accepts sparse data | ||
# even if the estimator doesn't as PCA outputs a dense array. | ||
tags.input_tags.sparse = all( | ||
get_tags(step).input_tags.sparse | ||
for name, step in self.steps | ||
if step != "passthrough" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder if we aren't expecting too much from the sparse tag. Maybe we should narrow its scope to the estimator level, regardless of whether or not its sub-estimator(s), if any, do support sparse input. In that case sparse means "this estimator can handle sparse input but its inner estimator(s) (if it's a meta-estimator) might not in which case it's undefined behavior". What do you think @glemaitre and @adrinjalali ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed irl with Guillaume, an other option is to change sparse
to accept True
, False
and Maybe
(or Undefined
, name tbd).
def __sklearn_tags__(self): | ||
tags = super().__sklearn_tags__() | ||
try: | ||
tags.input_tags.sparse = all( | ||
get_tags(trans).input_tags.sparse | ||
for name, trans in self.transformer_list | ||
if trans not in {"passthrough", "drop"} | ||
) | ||
except Exception: | ||
# If `transformer_list` does not comply with our API (list of tuples) | ||
# then it will fail. In this case, we assume that `sparse` is False | ||
# but the parameter validation will raise an error during `fit`. | ||
pass # pragma: no cover | ||
return tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my previous comment is also motivated by this pattern that shows-up a few times. We're trying to inspect the full compute graph ahead of time which is something scikit-learn is really not good at by design.
Also, I think we should include it in 1.6.1 because it fixes tags that we just made public |
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.
Yep it is indeed better than what we had. We can later improve and discussed having a ternary option instead of the binary one. |
Thanks @antoinebaker. It was a big one ;) |
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai> Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any help or guidance on this issue is very much appreciated.
@@ -689,7 +689,7 @@ def rmatvec(b): | |||
|
|||
def __sklearn_tags__(self): | |||
tags = super().__sklearn_tags__() | |||
tags.input_tags.sparse = True | |||
tags.input_tags.sparse = not self.positive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My team changed to scikit-learn v1.6.1 this week. We had v1.5.1 before. Our code crashes in this exact line with the error "Unexpected <class 'AttributeError'>. 'LinearRegression' object has no attribute 'positive'".
We cannot deploy in production because of this. I am desperate enough to come here to ask for help. I do not understand why it would complain that the attribute does not exist given that we were using v1.5.1 before and the attribute has existed for 4 years now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ItsIronOxide feel free to open a new issue with a reproducer, and ping me. Happy to have a look and help out.
Reference Issues/PRs
Fixes #30139
What does this implement/fix? Explain your changes.
The test
check_estimator_sparse_tag
is added.It checks that the
input_tags.sparse
tag of an estimator is consistent:input_tags.sparse=True
the estimator can be fitted on sparse datainput_tags.sparse=False
the estimator must raise a error when fitted on sparse dataThe
input_tags.sparse
flag was edited for a lot of estimators.