Skip to content

FIX Check and correct the input_tags.sparse flag #30187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jan 2, 2025

Conversation

antoinebaker
Copy link
Contributor

Reference Issues/PRs

Fixes #30139

What does this implement/fix? Explain your changes.

The test check_estimator_sparse_tag is added.
It checks that the input_tags.sparse tag of an estimator is consistent:

  • if input_tags.sparse=True the estimator can be fitted on sparse data
  • if input_tags.sparse=False the estimator must raise a error when fitted on sparse data

The input_tags.sparse flag was edited for a lot of estimators.

Copy link

github-actions bot commented Oct 31, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b54a408. Link to the linter CI: here

@antoinebaker antoinebaker changed the title add input_tags.sparse and test FIX Check and correct the input_tags.sparse flag Oct 31, 2024
@antoinebaker antoinebaker marked this pull request as ready for review November 4, 2024 12:57
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tagestimator check instead of making the existing check_estimator_sparse_array tag-aware?

Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Nov 4, 2024

Thanks for the PR. Any reason to decide to introduce a new check_estimator_sparse_tagestimator check instead of making the existing check_estimator_sparse_array tag-aware?

Since estimator checks are already costly to run, I would rather avoid proliferation and redundancy when possible.

I tried to make check_estimator_sparse_array tag-aware, but didn't manage to combine the old behavior (testing on various sparse formats, and if the estimator doesn't support the sparse format checking that it fails gracefully with the appropriate error) and what the new test is doing (checking that sparse tag is equivalent to raising or not the error).

Actually the new test is done with the assumption that the sparse flag is equivalent to accept csr input. It seems that the csr format is kind of a default (if the estimator can accept sparse input it always include csr, but it may reject other sparse formats), I didn't encounter a counterexample in the estimators.

Perhaps a cleaner alternative as suggested by @glemaitre would be that

  1. the sparse tag is the list of supported sparse formats instead of a boolean
  2. make validate_data rely on the sparse estimator tag instead of the accept_sparse argument.

@ogrisel
Copy link
Member

ogrisel commented Nov 5, 2024

Perhaps a cleaner alternative as suggested by @glemaitre would be that
the sparse tag is the list of supported sparse formats instead of a boolean
make validate_data rely on the sparse estimator tag instead of the accept_sparse argument.

I am a bit afraid about the breaking change induced by entirely changing the meaning of the .input_tags.sparse tag but if that many estimators needed a fix in the first place, it probably means that nobody was actually relying on that tag anywhere within our outside the scikit-learn code base anyway so we are free to make the changes we want.

+1 for updating your PR to implement what you suggest above and see what it practically entails.

@glemaitre glemaitre self-requested a review November 15, 2024 14:03
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a first round of review just looking at the test. I'll check now the changes in the different estimator to know if I'm getting surprise by what I'll see.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I think that I made a pass on all estimators. I think that sometimes, we can move the definition in the base class.

@glemaitre
Copy link
Member

@antoinebaker do not hesitate to ping me when you want another review.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Nov 26, 2024

I followed @glemaitre suggestion for setting the sparse tag in Pipeline.

We now definesparse = all steps accept sparse instead of the previous one sparse = first non passthrough step accepts sparse. Both are incorrect, but determining the correct flag seems a nightmare in general as some transformers may change the sparse nature of the data.

As a lesser of two evils the new definition will have false negatives (Pipelines accepting sparse data wrongly tagged sparse=False) while the old one had false positives (Pipelines rejecting sparse data wrongly tagged sparse=True).

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There a small conflict to solve.

Otherwise, it looks good to me. I'm fine with the current policy for the Pipeline here.

@@ -2654,6 +2656,7 @@ def fit(self, X, y):

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.sparse = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that it's because you set it to True on the base class. I wonder if we should try to always set this only to non default values. It would mean that we'd have to leave it to the default value on the base class and set it to True on most of its child classes here. What do you think @glemaitre ?

Comment on lines +1229 to +1237
# WARNING: the sparse tag can be incorrect.
# Some Pipelines accepting sparse data are wrongly tagged sparse=False.
# For example Pipeline([PCA(), estimator]) accepts sparse data
# even if the estimator doesn't as PCA outputs a dense array.
tags.input_tags.sparse = all(
get_tags(step).input_tags.sparse
for name, step in self.steps
if step != "passthrough"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if we aren't expecting too much from the sparse tag. Maybe we should narrow its scope to the estimator level, regardless of whether or not its sub-estimator(s), if any, do support sparse input. In that case sparse means "this estimator can handle sparse input but its inner estimator(s) (if it's a meta-estimator) might not in which case it's undefined behavior". What do you think @glemaitre and @adrinjalali ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed irl with Guillaume, an other option is to change sparse to accept True, False and Maybe (or Undefined, name tbd).

Comment on lines +2127 to +2140
def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
try:
tags.input_tags.sparse = all(
get_tags(trans).input_tags.sparse
for name, trans in self.transformer_list
if trans not in {"passthrough", "drop"}
)
except Exception:
# If `transformer_list` does not comply with our API (list of tuples)
# then it will fail. In this case, we assume that `sparse` is False
# but the parameter validation will raise an error during `fit`.
pass # pragma: no cover
return tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my previous comment is also motivated by this pattern that shows-up a few times. We're trying to inspect the full compute graph ahead of time which is something scikit-learn is really not good at by design.

@jeremiedbb
Copy link
Member

Also, I think we should include it in 1.6.1 because it fixes tags that we just made public

@jeremiedbb jeremiedbb added this to the 1.6.1 milestone Dec 12, 2024
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.

@glemaitre
Copy link
Member

I think there are still things to improve with the sparse tag but it can be done in a separate PR. I'm in favor of merging this one as is to not delay the 1.6.1 release further.

Yep it is indeed better than what we had. We can later improve and discussed having a ternary option instead of the binary one.

@glemaitre glemaitre merged commit 446adff into scikit-learn:main Jan 2, 2025
30 checks passed
@glemaitre
Copy link
Member

Thanks @antoinebaker. It was a big one ;)

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jan 8, 2025
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
jeremiedbb added a commit that referenced this pull request Jan 9, 2025
Co-authored-by: Guillaume Lemaitre <guillaume@probabl.ai>
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Copy link

@ItsIronOxide ItsIronOxide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any help or guidance on this issue is very much appreciated.

@@ -689,7 +689,7 @@ def rmatvec(b):

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.sparse = True
tags.input_tags.sparse = not self.positive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My team changed to scikit-learn v1.6.1 this week. We had v1.5.1 before. Our code crashes in this exact line with the error "Unexpected <class 'AttributeError'>. 'LinearRegression' object has no attribute 'positive'".

We cannot deploy in production because of this. I am desperate enough to come here to ask for help. I do not understand why it would complain that the attribute does not exist given that we were using v1.5.1 before and the attribute has existed for 4 years now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItsIronOxide feel free to open a new issue with a reproducer, and ping me. Happy to have a look and help out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The input_tags.sparse flag is often incorrect
6 participants