Skip to content

Check sample weight equivalence on sparse data #30137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

antoinebaker
Copy link
Contributor

@antoinebaker antoinebaker commented Oct 23, 2024

Reference Issues/PRs

Related to issues #30131 and #16298
Continuation of PR #29818

What does this implement/fix? Explain your changes.

Following #30040 (comment) we would like to add a new check_sample_weight_equivalence_on_sparse_data in the common estimator checks, similar to check_sample_weight_equivalence but on sparse data, as several estimators (for example LinearRegression) handle differently sparse or dense data.

TODO

  • filter on estimators accepting sparse inputs by using the tags system ('tags.input_tags.sparse=True`) instead of catching the error will be done in a follow-up PR.

Copy link

github-actions bot commented Oct 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 74462f6. Link to the linter CI: here

@antoinebaker
Copy link
Contributor Author

The proposed change (add sparse_container as an argument and catch ValueError/TypeError if the estimator does not support the sparse array) is not satisfying.

Ideally @ogrisel what test API would we like to have ? I'm currently thinking of:

  • keep check_sample_weight_equivalence(name, estimator) as is (only dense data)
  • add check_sample_weight_equivalence_on_sparse_data(name, estimator) that would iterate on the sparse containers supported by the estimator

Is there a proper way to list the sparse containers supported by an estimator ?

@@ -1086,7 +1090,7 @@ def check_sample_weights_shape(name, estimator_orig):


@ignore_warnings(category=FutureWarning)
def check_sample_weight_equivalence(name, estimator_orig):
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container=None):

Copy link
Member

@ogrisel ogrisel Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it wouldn't be better to have two functions for the dense and sparse inputs:

check_sample_weight_equivalence_on_dense_data (or just check_sample_weight_equivalence) and check_sample_weight_equivalence_on_sparse_data.

This way, we could configure different hyperparameter configs in PER_ESTIMATOR_CHECK_PARAMS if needed.

On the other hand, this might introduce many redundant entries in that dict.

Any opinion @adrinjalali?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we had another case where a parameter would change the behavior of the test and it made sense to create two tests for them. I think it makes sense to try to keep the signature of the tests to name, estimator_orig and the rest should be defined by the test itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two distinct check_sample_weight_equivalence_on_dense_data and check_sample_weight_equivalence_on_sparse_data names would be more convenient for filtering with the -k flag in pytest (we can then easily do only dense, only sparse, or both).

@ogrisel ogrisel added the Developer API Third party developer API related label Oct 23, 2024
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @antoinebaker this PR is still marked as "draft" but I do not see unaddressed TODO item anymore.

@antoinebaker
Copy link
Contributor Author

antoinebaker commented Oct 30, 2024

Sorry @ogrisel I forgot to add the TODO item. I still need to yield the tests only for estimators accepting sparse inputs (by filtering on the tag.input_tags.sparse) instead of catching the error if the estimator does not accept sparse inputs. But for this I first need to fix #30139 (currently working on a draft PR for it)

@antoinebaker
Copy link
Contributor Author

@lesteve I have one merged PR #29818 that changes the function check_sample_weights_invariance and rename it check_sample_weight_equivalence, and a follow up one (actually this one) which then replaces check_sample_weight_equivalence by two functions check_sample_weight_equivalence_on_dense_data and check_sample_weight_equivalence_on_sparse_data.

What is the recommended pratice in this case for the new changelog system ?
For example should I have two entries:

PR1.api.rst: func A was renamed B
PR2.api.rst: func B was replaced by B1 and B2

or delete the previous changelog and consolidate in the new one:

PR2.api.rst: func A was replaced by B1 and B2

@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2024

or delete the previous changelog and consolidate in the new one.

If I recall what @lesteve said elsewhere correctly, it's probably best to keep per-PR incremental log entries.

Maybe we could add a TODO note in the new entry with a reference to the PR number of the older change to remember to do this consolidation of the changelog at the time of the release.

@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2024

Even better would be to update the file with the changelog of entry from #29818 in this PR to use the new names and same contents as the contents of a new file for the changelog entry of #30137.

When the towncrier command is run, the consolidation of the two files will happen automatically if they have the same contents.

@ogrisel
Copy link
Member

ogrisel commented Nov 5, 2024

Since the sparse tag fix is going to take longer than anticipated, I would rather decouple the two PRs and consider a review (remove the draft flag) while keeping the except TypeError: branch in the new estimator check function for the time being.

I think TypeError is specific enough and in particular will not collide with the actual AssertionError that we expect to be raised most of the time, the thing we are actually testing in this check is not working the way it should.

@antoinebaker antoinebaker marked this pull request as ready for review November 6, 2024 08:38
@ogrisel
Copy link
Member

ogrisel commented Nov 6, 2024

I clicked the "Update branch" button to merge main and hopefully get #30169 to fix the erring macOS CI.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am +1 for merging as such.

This PR does cause some redundancy in the tags._xfail_checks declaration, though. But I don't see any better way. Since some estimators (in particular linear models) use very different code-path for sparse and dense data, I think this is still worth it.

Any second opinion @adrinjalali @jeremiedbb @snath-xoc?

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just have one question below

@adrinjalali
Copy link
Member

@ogrisel this LGTM. But do you mind merging the xfail checks PR first? I think you can merge conflicts with that PR here much easier than I can do there.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

integer (including zero) weights.
- :func:`utils.estimator_checks.check_sample_weights_invariance`
replaced by
:func:`check_sample_weight_equivalence_on_dense_data`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more like

Suggested change
:func:`check_sample_weight_equivalence_on_dense_data`
:func:`~utils.estimator_checks.check_sample_weight_equivalence_on_dense_data`

but we don't really need to annotate them (or even have a changelog for them) since they're not really public and don't get rendered in the API pages. So this would only raise a warning in sphinx build.

That said, I'm in favor of adding nice docstrings to tests and adding them to the API pages (not in this PR). However, this changelog needs to remove the :func: directives, or the mentions of them completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly the sphinx doc, the tilde suppresses the cross-reference, and I should therefore add it for all the functions of this PR as they are not rendered on the API pages ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tilde only changes the rendered version from the full qual path to only rendering the name of the function, but still trying to resolve the hyperlink, which raises a warning in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification ! I remove the :func: directives completely.

Comment on lines +167 to +170
# FIXME: filter on tags.input_tags.sparse
# (estimator accepts sparse arrays)
# once issue #30139 is fixed.
yield check_sample_weight_equivalence_on_sparse_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the check already be there? I'm not sure if I understand why we don't have it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, the PR #30187 fixing the tags.input_tags.sparse needs to be merged.

@ogrisel ogrisel merged commit c713ff4 into scikit-learn:main Nov 22, 2024
30 checks passed
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Developer API Third party developer API related module:utils
Projects
Development

Successfully merging this pull request may close these issues.

4 participants