-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Check sample weight equivalence on sparse data #30137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check sample weight equivalence on sparse data #30137
Conversation
The proposed change (add Ideally @ogrisel what test API would we like to have ? I'm currently thinking of:
Is there a proper way to list the sparse containers supported by an estimator ? |
@@ -1086,7 +1090,7 @@ def check_sample_weights_shape(name, estimator_orig): | |||
|
|||
|
|||
@ignore_warnings(category=FutureWarning) | |||
def check_sample_weight_equivalence(name, estimator_orig): | |||
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container): | |
def _check_sample_weight_equivalence(name, estimator_orig, sparse_container=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it wouldn't be better to have two functions for the dense and sparse inputs:
check_sample_weight_equivalence_on_dense_data
(or just check_sample_weight_equivalence
) and check_sample_weight_equivalence_on_sparse_data
.
This way, we could configure different hyperparameter configs in PER_ESTIMATOR_CHECK_PARAMS
if needed.
On the other hand, this might introduce many redundant entries in that dict.
Any opinion @adrinjalali?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we had another case where a parameter would change the behavior of the test and it made sense to create two tests for them. I think it makes sense to try to keep the signature of the tests to name, estimator_orig
and the rest should be defined by the test itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two distinct check_sample_weight_equivalence_on_dense_data
and check_sample_weight_equivalence_on_sparse_data
names would be more convenient for filtering with the -k flag in pytest (we can then easily do only dense, only sparse, or both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @antoinebaker this PR is still marked as "draft" but I do not see unaddressed TODO item anymore.
Sorry @ogrisel I forgot to add the TODO item. I still need to yield the tests only for estimators accepting sparse inputs (by filtering on the |
@lesteve I have one merged PR #29818 that changes the function What is the recommended pratice in this case for the new changelog system ?
or delete the previous changelog and consolidate in the new one:
|
If I recall what @lesteve said elsewhere correctly, it's probably best to keep per-PR incremental log entries. Maybe we could add a TODO note in the new entry with a reference to the PR number of the older change to remember to do this consolidation of the changelog at the time of the release. |
Even better would be to update the file with the changelog of entry from #29818 in this PR to use the new names and same contents as the contents of a new file for the changelog entry of #30137. When the |
Since the sparse tag fix is going to take longer than anticipated, I would rather decouple the two PRs and consider a review (remove the draft flag) while keeping the I think |
I clicked the "Update branch" button to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am +1 for merging as such.
This PR does cause some redundancy in the tags._xfail_checks
declaration, though. But I don't see any better way. Since some estimators (in particular linear models) use very different code-path for sparse and dense data, I think this is still worth it.
Any second opinion @adrinjalali @jeremiedbb @snath-xoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I just have one question below
@ogrisel this LGTM. But do you mind merging the xfail checks PR first? I think you can merge conflicts with that PR here much easier than I can do there. |
…_equivalence_on_sparse_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
integer (including zero) weights. | ||
- :func:`utils.estimator_checks.check_sample_weights_invariance` | ||
replaced by | ||
:func:`check_sample_weight_equivalence_on_dense_data` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more like
:func:`check_sample_weight_equivalence_on_dense_data` | |
:func:`~utils.estimator_checks.check_sample_weight_equivalence_on_dense_data` |
but we don't really need to annotate them (or even have a changelog for them) since they're not really public and don't get rendered in the API pages. So this would only raise a warning in sphinx build.
That said, I'm in favor of adding nice docstrings to tests and adding them to the API pages (not in this PR). However, this changelog needs to remove the :func:
directives, or the mentions of them completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly the sphinx doc, the tilde suppresses the cross-reference, and I should therefore add it for all the functions of this PR as they are not rendered on the API pages ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tilde only changes the rendered version from the full qual path to only rendering the name of the function, but still trying to resolve the hyperlink, which raises a warning in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification ! I remove the :func:
directives completely.
# FIXME: filter on tags.input_tags.sparse | ||
# (estimator accepts sparse arrays) | ||
# once issue #30139 is fixed. | ||
yield check_sample_weight_equivalence_on_sparse_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the check already be there? I'm not sure if I understand why we don't have it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, the PR #30187 fixing the tags.input_tags.sparse
needs to be merged.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Related to issues #30131 and #16298
Continuation of PR #29818
What does this implement/fix? Explain your changes.
Following #30040 (comment) we would like to add a new
check_sample_weight_equivalence_on_sparse_data
in the common estimator checks, similar tocheck_sample_weight_equivalence
but on sparse data, as several estimators (for exampleLinearRegression
) handle differently sparse or dense data.TODO
filter on estimators accepting sparse inputs by using the tags system ('tags.input_tags.sparse=True`) instead of catching the errorwill be done in a follow-up PR.