Skip to content

TST Add sample order invariance to estimator_checks #17598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

anhqngo
Copy link

@anhqngo anhqngo commented Jun 15, 2020

Reference Issues/PRs

Fixes #8695

What does this implement/fix? Explain your changes.

Added check_methods_sample_order_variance function, which checks for method invariance under sample order (i.e. results should not change when sample before or after applying the method).

For example, if we randomize the indices, the results should still be the same:

idx = np.random.permutation(X.shape[0])
assert_allclose_dense_sparse({method}(X)[idx], {method}(X[idx]))

Any other comments?

The original issue mentions using assert_array_equal to check for invariance. While this method works for predict, decision_function, score_samples, and predict_proba, it fails for transform. Therefore, I have opted for assert_allclose_dense_sparse with atol=1e-9 instead of assert_array_equal, which passes the tests for all methods.

Furthermore, the original issue uses random sampling, but since we already have check_methods_subset_invariance already, I just shuffled the indices instead of random sampling.

@anhqngo anhqngo changed the title Add sample order invariance to estimator_checks [WIP] Add sample order invariance to estimator_checks Jun 18, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngojason9, this is very useful! For the failing estimaotr,

  • we can skip RadiusNeighborsTransformer as it returns a (n_samples, n_samples) matrix and so won't verify this property. You can do that with _xfail_checks estimator tag: https://scikit-learn.org/stable/developers/develop.html#estimator-tags
  • For stochastic models including BernoulliRBM, maybe defaults, the dataset or tolerances would need to be adjusted.

@anhqngo
Copy link
Author

anhqngo commented Jun 19, 2020

Thanks @rth for the suggestion. I added _xfail tags for the transform method in RadiusNeighborsTransformer and KNeighborsTransformer classes, as well as score_samples method in BernoulliRBM since they are non-deterministic. Let me know what you think!

@anhqngo anhqngo requested a review from rth June 19, 2020 18:52
@anhqngo anhqngo changed the title [WIP] Add sample order invariance to estimator_checks Add sample order invariance to estimator_checks Jun 19, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngojason9 ! A few last comments otherwise LGTM.

if hasattr(estimator, "n_components"):
estimator.n_components = 1
if hasattr(estimator, "n_clusters"):
estimator.n_clusters = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have 1 clusters for this test though? The result is always going to be the same cluster, component so we won't be testing much?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm I'm not entirely sure what n_clusters should be instead. Are you suggesting that if the estimator has 2 clusters, then we should test each cluster individually? Apologies if I totally misunderstood your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mean that with n_clusters=1, for instance KMeans.predict whether there is sample invariance or not will predict that one cluster, since there is no alternative values. Could we try changing this to 2 and see if tests still pass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rth I tried changing to 2 and the tests still pass. I also tried deleting that block of code altogether and the tests also pass. How would you like me to proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set it to 2.

anhqngo and others added 2 commits June 22, 2020 10:10
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. Could you add an entry in whats new since it will be run by the check_estimator for third-party libraries?

@glemaitre glemaitre changed the title Add sample order invariance to estimator_checks TST Add sample order invariance to estimator_checks Jun 23, 2020
@glemaitre
Copy link
Member

@rth could you have a final look at this one.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting estimator.n_clusters = 2 would be preferable I think otherwise LGTM

if hasattr(estimator, "n_components"):
estimator.n_components = 1
if hasattr(estimator, "n_clusters"):
estimator.n_clusters = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set it to 2.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we have decided that we can break check_estimator backwards compatibility since new checks can be disabled by downstream libraries?

@flosincapite
Copy link

Hi @ngojason9 ! It looks like the CI is failing but this PR is otherwise okay. @rth, is that correct? If so, Jason, please make the necessary fixes to ensure CI passes.

@rth
Copy link
Member

rth commented Jul 17, 2020

I assume we have decided that we can break check_estimator backwards compatibility since new checks can be disabled by downstream libraries?

That and we should ideally merge some version of #17361 before the next release.

please make the necessary fixes to ensure CI passes.

Merging master in might be enough. Once we fix #17943 that is.

@rth
Copy link
Member

rth commented Aug 7, 2020

@ngojason9 Could you please merge upstream/master to resolve CI issues?

@glemaitre glemaitre self-assigned this Aug 24, 2020
@glemaitre glemaitre removed their assignment Aug 24, 2020
@glemaitre
Copy link
Member

@ngojason9 we finally introduce #17361 in master. So if you could merge master into your branch and make the above change, then we will be able to merge this PR.

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ngojason9, three approvals already! Once the lint fixed this PR is ready for merging!

@@ -1168,6 +1169,41 @@ def check_methods_subset_invariance(name, estimator_orig, strict_mode=True):
atol=1e-7, err_msg=msg)


@ignore_warnings(category=FutureWarning)
def check_methods_sample_order_invariance(name, estimator_orig, strict_mode=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def check_methods_sample_order_invariance(name, estimator_orig, strict_mode=True):
def check_methods_sample_order_invariance(name, estimator_orig,
strict_mode=True):

@glemaitre
Copy link
Member

I fixed the linting issue in #18570 because I could not push directly in this branch.
Everything was green so I merged it.

Thanks @ngojason9 for the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests for sample order invariance in estimator_checks
6 participants