Check sample weight equivalence on sparse data #30137

antoinebaker · 2024-10-23T10:16:54Z

Reference Issues/PRs

Related to issues #30131 and #16298
Continuation of PR #29818

What does this implement/fix? Explain your changes.

Following #30040 (comment) we would like to add a new check_sample_weight_equivalence_on_sparse_data in the common estimator checks, similar to check_sample_weight_equivalence but on sparse data, as several estimators (for example LinearRegression) handle differently sparse or dense data.

TODO

~~filter on estimators accepting sparse inputs by using the tags system ('tags.input_tags.sparse=True`) instead of catching the error~~ will be done in a follow-up PR.

github-actions · 2024-10-23T10:18:14Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 74462f6. Link to the linter CI: here}

antoinebaker · 2024-10-23T10:46:59Z

The proposed change (add sparse_container as an argument and catch ValueError/TypeError if the estimator does not support the sparse array) is not satisfying.

Ideally @ogrisel what test API would we like to have ? I'm currently thinking of:

keep check_sample_weight_equivalence(name, estimator) as is (only dense data)
add check_sample_weight_equivalence_on_sparse_data(name, estimator) that would iterate on the sparse containers supported by the estimator

Is there a proper way to list the sparse containers supported by an estimator ?

ogrisel · 2024-10-23T14:12:13Z

sklearn/utils/estimator_checks.py

@@ -1086,7 +1090,7 @@ def check_sample_weights_shape(name, estimator_orig):


 @ignore_warnings(category=FutureWarning)
-def check_sample_weight_equivalence(name, estimator_orig):
+def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):


Suggested change

def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):

def _check_sample_weight_equivalence(name, estimator_orig, sparse_container=None):

I wonder if it wouldn't be better to have two functions for the dense and sparse inputs:

check_sample_weight_equivalence_on_dense_data (or just check_sample_weight_equivalence) and check_sample_weight_equivalence_on_sparse_data.

This way, we could configure different hyperparameter configs in PER_ESTIMATOR_CHECK_PARAMS if needed.

On the other hand, this might introduce many redundant entries in that dict.

Any opinion @adrinjalali?

yeah we had another case where a parameter would change the behavior of the test and it made sense to create two tests for them. I think it makes sense to try to keep the signature of the tests to name, estimator_orig and the rest should be defined by the test itself.

I think the two distinct check_sample_weight_equivalence_on_dense_data and check_sample_weight_equivalence_on_sparse_data names would be more convenient for filtering with the -k flag in pytest (we can then easily do only dense, only sparse, or both).

sklearn/utils/estimator_checks.py

sklearn/utils/_test_common/instance_generator.py

…_equivalence_on_sparse_data

ogrisel

LGTM. @antoinebaker this PR is still marked as "draft" but I do not see unaddressed TODO item anymore.

antoinebaker · 2024-10-30T15:56:44Z

Sorry @ogrisel I forgot to add the TODO item. I still need to yield the tests only for estimators accepting sparse inputs (by filtering on the tag.input_tags.sparse) instead of catching the error if the estimator does not accept sparse inputs. But for this I first need to fix #30139 (currently working on a draft PR for it)

antoinebaker · 2024-11-04T10:19:55Z

@lesteve I have one merged PR #29818 that changes the function check_sample_weights_invariance and rename it check_sample_weight_equivalence, and a follow up one (actually this one) which then replaces check_sample_weight_equivalence by two functions check_sample_weight_equivalence_on_dense_data and check_sample_weight_equivalence_on_sparse_data.

What is the recommended pratice in this case for the new changelog system ?
For example should I have two entries:

PR1.api.rst: func A was renamed B
PR2.api.rst: func B was replaced by B1 and B2

or delete the previous changelog and consolidate in the new one:

PR2.api.rst: func A was replaced by B1 and B2

ogrisel · 2024-11-04T13:49:39Z

or delete the previous changelog and consolidate in the new one.

If I recall what @lesteve said elsewhere correctly, it's probably best to keep per-PR incremental log entries.

Maybe we could add a TODO note in the new entry with a reference to the PR number of the older change to remember to do this consolidation of the changelog at the time of the release.

ogrisel · 2024-11-04T15:08:17Z

Even better would be to update the file with the changelog of entry from #29818 in this PR to use the new names and same contents as the contents of a new file for the changelog entry of #30137.

When the towncrier command is run, the consolidation of the two files will happen automatically if they have the same contents.

ogrisel · 2024-11-05T15:16:22Z

Since the sparse tag fix is going to take longer than anticipated, I would rather decouple the two PRs and consider a review (remove the draft flag) while keeping the except TypeError: branch in the new estimator check function for the time being.

I think TypeError is specific enough and in particular will not collide with the actual AssertionError that we expect to be raised most of the time, the thing we are actually testing in this check is not working the way it should.

ogrisel · 2024-11-06T15:02:41Z

I clicked the "Update branch" button to merge main and hopefully get #30169 to fix the erring macOS CI.

ogrisel

I think I am +1 for merging as such.

This PR does cause some redundancy in the tags._xfail_checks declaration, though. But I don't see any better way. Since some estimators (in particular linear models) use very different code-path for sparse and dense data, I think this is still worth it.

Any second opinion @adrinjalali @jeremiedbb @snath-xoc?

jeremiedbb

LGTM. I just have one question below

sklearn/utils/estimator_checks.py

adrinjalali · 2024-11-07T09:04:49Z

@ogrisel this LGTM. But do you mind merging the xfail checks PR first? I think you can merge conflicts with that PR here much easier than I can do there.

…_equivalence_on_sparse_data

adrinjalali

Otherwise LGTM.

adrinjalali · 2024-11-16T10:19:19Z

doc/whats_new/upcoming_changes/sklearn.utils/29818.api.rst

-  integer (including zero) weights.
+- :func:`utils.estimator_checks.check_sample_weights_invariance`
+  replaced by
+  :func:`check_sample_weight_equivalence_on_dense_data`


It would be more like

Suggested change

:func:`check_sample_weight_equivalence_on_dense_data`

:func:`~utils.estimator_checks.check_sample_weight_equivalence_on_dense_data`

but we don't really need to annotate them (or even have a changelog for them) since they're not really public and don't get rendered in the API pages. So this would only raise a warning in sphinx build.

That said, I'm in favor of adding nice docstrings to tests and adding them to the API pages (not in this PR). However, this changelog needs to remove the :func: directives, or the mentions of them completely.

If I understood correctly the sphinx doc, the tilde suppresses the cross-reference, and I should therefore add it for all the functions of this PR as they are not rendered on the API pages ?

The tilde only changes the rendered version from the full qual path to only rendering the name of the function, but still trying to resolve the hyperlink, which raises a warning in this case.

Thanks for the clarification ! I remove the :func: directives completely.

doc/whats_new/upcoming_changes/sklearn.utils/30137.api.rst

adrinjalali · 2024-11-16T10:22:12Z

sklearn/utils/estimator_checks.py

+            # FIXME: filter on tags.input_tags.sparse
+            # (estimator accepts sparse arrays)
+            # once issue #30139 is fixed.
+            yield check_sample_weight_equivalence_on_sparse_data


shouldn't the check already be there? I'm not sure if I understand why we don't have it here.

Not yet, the PR #30187 fixing the tags.input_tags.sparse needs to be merged.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker added 2 commits October 23, 2024 10:28

fix typo

316816c

add sparse_container argument

ec513ea

github-actions bot added the module:utils label Oct 23, 2024

use CSR_CONTAINERS

eb7e21a

ogrisel reviewed Oct 23, 2024

View reviewed changes

ogrisel added the Developer API Third party developer API related label Oct 23, 2024

split test in dense and sparse

e559033

ogrisel reviewed Oct 25, 2024

View reviewed changes

sklearn/utils/_test_common/instance_generator.py Outdated Show resolved Hide resolved

antoinebaker added 4 commits October 29, 2024 17:53

add fixme comment

4446fc7

remove redundant config

227a5ca

Merge remote-tracking branch 'upstream/main' into check_sample_weight…

fd822f7

…_equivalence_on_sparse_data

remove xfail tag

97b3518

ogrisel approved these changes Oct 30, 2024

View reviewed changes

Merge branch 'main' into check_sample_weight_equivalence_on_sparse_data

6156532

update changelog

460b70e

antoinebaker marked this pull request as ready for review November 6, 2024 08:38

Merge branch 'main' into check_sample_weight_equivalence_on_sparse_data

f6eaff6

ogrisel approved these changes Nov 6, 2024

View reviewed changes

jeremiedbb approved these changes Nov 6, 2024

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

use csr array or matrix

0aed708

Merge remote-tracking branch 'upstream/main' into check_sample_weight…

2cd4389

…_equivalence_on_sparse_data

add xfails

e54e782

adrinjalali reviewed Nov 16, 2024

View reviewed changes

antoinebaker and others added 3 commits November 18, 2024 09:44

Merge branch 'main' into check_sample_weight_equivalence_on_sparse_data

ee7bc44

fix changelog

d7b832a

changelog without func directives

74462f6

ogrisel merged commit c713ff4 into scikit-learn:main Nov 22, 2024
30 checks passed

jameslamb mentioned this pull request Nov 29, 2024

[ci] [python-package] [R-package] adapt to scikit-learn check_sample_weight_equivalence changes, stop testing against R 3.6 on Linux microsoft/LightGBM#6733

Merged

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 4, 2024

Check sample weight equivalence on sparse data (scikit-learn#30137)

151e706

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb pushed a commit that referenced this pull request Dec 6, 2024

Check sample weight equivalence on sparse data (#30137)

0f334d4

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

virchan pushed a commit to virchan/scikit-learn that referenced this pull request Dec 9, 2024

Check sample weight equivalence on sparse data (scikit-learn#30137)

5691658

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

antoinebaker mentioned this pull request Jan 8, 2025

Filtering on the sparse tag to yield checks #30608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check sample weight equivalence on sparse data #30137

Check sample weight equivalence on sparse data #30137

antoinebaker commented Oct 23, 2024 •

edited by ogrisel

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

antoinebaker commented Oct 23, 2024

ogrisel Oct 23, 2024

ogrisel Oct 23, 2024 •

edited

Loading

adrinjalali Oct 23, 2024

antoinebaker Oct 23, 2024

ogrisel left a comment

antoinebaker commented Oct 30, 2024 •

edited

Loading

antoinebaker commented Nov 4, 2024

ogrisel commented Nov 4, 2024

ogrisel commented Nov 4, 2024

ogrisel commented Nov 5, 2024 •

edited

Loading

ogrisel commented Nov 6, 2024

ogrisel left a comment

jeremiedbb left a comment

adrinjalali commented Nov 7, 2024

adrinjalali left a comment

adrinjalali Nov 16, 2024

antoinebaker Nov 18, 2024

adrinjalali Nov 19, 2024

antoinebaker Nov 20, 2024

adrinjalali Nov 16, 2024

antoinebaker Nov 18, 2024

	def _check_sample_weight_equivalence(name, estimator_orig, sparse_container):
	def _check_sample_weight_equivalence(name, estimator_orig, sparse_container=None):

	:func:`check_sample_weight_equivalence_on_dense_data`
	:func:`~utils.estimator_checks.check_sample_weight_equivalence_on_dense_data`

Check sample weight equivalence on sparse data #30137

Check sample weight equivalence on sparse data #30137

Conversation

antoinebaker commented Oct 23, 2024 • edited by ogrisel Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Oct 23, 2024 • edited Loading

✔️ Linting Passed

antoinebaker commented Oct 23, 2024

Choose a reason for hiding this comment

ogrisel Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

antoinebaker commented Oct 30, 2024 • edited Loading

antoinebaker commented Nov 4, 2024

ogrisel commented Nov 4, 2024

ogrisel commented Nov 4, 2024

ogrisel commented Nov 5, 2024 • edited Loading

ogrisel commented Nov 6, 2024

ogrisel left a comment

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

adrinjalali commented Nov 7, 2024

adrinjalali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoinebaker commented Oct 23, 2024 •

edited by ogrisel

Loading

github-actions bot commented Oct 23, 2024 •

edited

Loading

ogrisel Oct 23, 2024 •

edited

Loading

antoinebaker commented Oct 30, 2024 •

edited

Loading

ogrisel commented Nov 5, 2024 •

edited

Loading