Skip to content

[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

wdevazelhes
Copy link
Contributor

@wdevazelhes
Copy link
Contributor Author

I'll also include a test for enforce_estimator_tags_X. However, I noticed that enforce_estimator_tags_y does not have any test, should I include one too ?

@wdevazelhes
Copy link
Contributor Author

I think I'll need to change the message in the Chi2Samplers from ""Entries of X must be non-negative."" into "Negative values in data passed to", to pass check_fit_non_negative for those

William de Vazelhes added 2 commits August 21, 2019 11:55
@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Aug 21, 2019

I'm having trouble figuring out what to do to fix the tests: the problem comes from AdditiveChi2Sampler and SkewedChi2Sampler : they should work on kernel matrices (so positive if I understand correctly), at fit and transform EDIT: they work on positive datasets Xs. However the fit actually is stateless (only uses n_features), and as such does not forbid negative X. Should it ? If it should not, then it means we don't need the flag ensures_positive_X, but then there's a problem in check_estimators_pickle (according to this CI report: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=6740), because the test will not ensure_positive_X and therefore at the transform step it fails (because X has negative elements). Should we separate ensure_positive_X_fit and ensure_positive_X_transform ?

@wdevazelhes
Copy link
Contributor Author

I guess a solution is to not flag them as requires_positive_X (because at fit they don't strictly enforce this), but to still pass the test to to use pairwise_estimator_convert_X in the tests ?

@wdevazelhes
Copy link
Contributor Author

Ah my bad, pairwise_estimator_convert_X was already enforced in check_estimators_pickle, so there's something I don't understand

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Aug 21, 2019

Ah my bad again, turns out SkewedChi2Sampler and AdditiveChi2Sampler actually don't take kernel matrices but actual matrices of points...

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Aug 21, 2019

I'm back to the following reformulated question then: #14705 (comment)

I'm having trouble figuring out what to do to fix the tests: the problem comes from AdditiveChi2Sampler and SkewedChi2Sampler: they should work on kernel matrices (so positive if I understand correctly), at fit and transform EDIT: they work on positive datasets Xs. However the fit actually is stateless (only uses n_features), and as such does not forbid negative X. Should it ? If it should not, then it means we don't need the flag ensures_positive_X, but then there's a problem in check_estimators_pickle (according to this CI report: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=6740), because the test will not ensure_positive_X and therefore at the transform step it fails (because X has negative elements). Should we separate ensure_positive_X_fit and ensure_positive_X_transform ?

@wdevazelhes
Copy link
Contributor Author

wdevazelhes commented Aug 21, 2019

In the end, since the Chi squared kernel works for positive points (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.chi2_kernel.html#sklearn.metrics.pairwise.chi2_kernel), I guess it makes sense to forbid passing negative data to fit, even if only n_features is used ?

@wdevazelhes
Copy link
Contributor Author

I just added a modification to ensure positivity at fit time too
It made me think: do you think a test check_transform_non_negative should also be added in addition to check_fit_non_negative ?

@wdevazelhes
Copy link
Contributor Author

Actually I saw that SkewedChi2Sampler allows for negative values, so the things I said are only for AdditiveChi2Sampler

@wdevazelhes
Copy link
Contributor Author

However tests now still fail for SkewedChi2Sampler because X has values lower than a constant c (SkewedChi2Sampler takes only values higher than c). I'm not sure how to deal with that, should I create a new estimator tag for this case ?

@amueller
Copy link
Member

Hm I don't think it's worth creating a tag for that. And it's a kind of weird requirement. Not sure if anyone actually uses this and whether we should just deprecate it...

@amueller
Copy link
Member

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

@wdevazelhes
Copy link
Contributor Author

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

Allright, done (see commit 851887d)

Otherwise I was also thinking since we're talking about X, maybe an X_types estimator tag like higher_than_c could be used (maybe less heavy than a whole new estimator tag) ?

@@ -112,11 +113,6 @@ def test_skewed_chi2_sampler():
assert np.isfinite(kernel_approx).all(), \
'NaNs found in the approximate Gram matrix'

# test error is raised on when inputs contains values smaller than -c
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to remove this since now we test for positivity and it's already tested by check_fit_non_negative ? Although now that I think about it check_fit_non_negative, would only test the fit part, whereas the below would test the transform part, so I'll let it here but adapt it to check just positivity (and not > -c).
But then it reminds me: should we have a check_transform_non_negative as well in common tests ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supposed to raise an error on negative values, only on values smaller than -c.

raise ValueError("X may not contain entries smaller than"
" -skewedness.")

check_non_negative(X, "SkewedChi2Sampler (input X)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you replace the test? Now it will fail with things that previously worked.

Copy link
Contributor Author

@wdevazelhes wdevazelhes Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought you meant to actually change the be behavior of the estimator (and indeed make it more restrictive) here:

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

Because the thing is AFAIU I can't just add a tag requires_positive_X but still let the estimator accept negative values (>-c), because check_fit_non_negative will then fail (a common test that checks that any estimator that has a requires_positive_X tag throws an error when given negative X) (actually it will fail because of the error message, but even if I change the error message into something like "Negative values in data passed to" (which I could tweak into sth like "Negative values in data passed to SkewedChi2Sampler after adding the constant c", which would be more exact) the test could still fail in principle because it could in principle test only on negative values that are >-c (it does not bc the negative values are random with a variance apparently high enough but logically it could)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @amueller I would not change the behavior of the estimator to make the common tests pass. It just says that requires_positive_X is not what is adapted here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, so should I create a new safe tag here ? Like requires_X_higher_than_constant or sth, that I could enforce in enforce_estimator_tags_X ? Because we need to find a way for tests that by default have negative values to translate them into >c for SkewedChi2Sampler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe an X_types kind of tag ?

Y[0, 0] = -c / 2.

# approximate kernel mapping
transform = SkewedChi2Sampler(skewedness=c, n_components=1000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad sign if you have to change any tests, because that probably means that you changed behavior and are breaking someone's code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 don't do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I forgot about those comments... I just put the PR into WIP until there's a better solution (see comment above)

if _safe_tags(estimator, "requires_positive_X"):
# Create strictly positive X. The minimal increment above 0 is 1, as
# X could be of integer dtype.
X += 1 + abs(X.min())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not X -= X.min() - 1e-5 or something like that? I don't know why you would do abs. If your data had a minimum of 2 after this is has a minimum of 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree X -= X.min() - 1e-5 seems better, I just had taken this line from enforce_estimator_tags_y` above
Will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh the comment was about dtype int? hm not sure in how far that applies to X here. But if thests are passing then that's fine I guess ;) Is this ready? If so, please rename from WIP to MRG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I didn't see that comment, I guess this means if there is an integer X at one point we want to let it integer ? I don't know to what extent it applies indeed, but maybe in case we can let it with the int ? What about X -= X.min() - 1 ? So that it'll still be ok wrt to your comment https://github.com/scikit-learn/scikit-learn/pull/14705/files/c6751b454bdf84699954e3f6bfa7000db2d42d48#r316775147 but it keeps the integer ?

# Estimators with a `requires_positive_X` tag only accept strictly positive
# inputs
if _safe_tags(estimator, "requires_positive_X"):
# Create strictly positive X. The minimal increment above 0 is 1, as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment and the code don't really agree, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, I forgot the comment :p (see #14705 (comment))

@wdevazelhes
Copy link
Contributor Author

I think if tests pass we can merge

@wdevazelhes wdevazelhes changed the title [WIP] Enforce positiveness in tests using enforce_estimator_tags_X [MRG] Enforce positiveness in tests using enforce_estimator_tags_X Sep 22, 2019
@wdevazelhes wdevazelhes changed the title [MRG] Enforce positiveness in tests using enforce_estimator_tags_X [WIP] Enforce positiveness in tests using enforce_estimator_tags_X Sep 22, 2019
@agramfort
Copy link
Member

@wdevazelhes it does not pass (yet)

@@ -1168,8 +1167,7 @@ def _check_X_y(self, X, y):
# force_all_finite=True)
X, y = check_X_y(X, y, accept_sparse=False, force_all_finite=True)
X, y = check_X_y(X, y, dtype='int')
if np.any(X < 0):
raise ValueError("X must not contain negative values.")
check_non_negative(X, "CategoricalNB (input X)")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, check_non_negative finds the min and checks if it's negative, wouldn't it be faster to use np.any ? (in theory it would just need to iterate until finding one nonnegative element but I don't know if that's what is done by any)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%timeit is your friend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, with timeit np.min() is faster. I forgot but in np.any we still need to do the X>0 pass, which amounts to one pass on the whole data that's why it'll not be faster... I didn't find any numpy function that would prevent to do this whole pass so I guess np.min is good

In[1]: import numpy as np

In [2]: a = np.random.randn(10000, 10000)              
                                                                                    
In [3]: %timeit np.any(a)                                                                                                                  
124 ms ± 808 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit np.min(a)                                                                                                                  
56.5 ms ± 69.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@wdevazelhes
Copy link
Contributor Author

Tests are green now (they were failing in test_rbf_sampler because I updated some global dataset X and y inplace, and looks like the rbf_sampler didn't like the dataset updated by another test so I updated a copy)

@wdevazelhes wdevazelhes changed the title [WIP] Enforce positiveness in tests using enforce_estimator_tags_X [MRG] Enforce positiveness in tests using enforce_estimator_tags_X Sep 26, 2019
@wdevazelhes wdevazelhes changed the title [MRG] Enforce positiveness in tests using enforce_estimator_tags_X [WIP] Enforce positiveness in tests using enforce_estimator_tags_X Sep 28, 2019
@wdevazelhes
Copy link
Contributor Author

Any update on #14705 (comment) ?

@agramfort
Copy link
Member

@wdevazelhes you need to revert the API change in SkewedChi2Sampler. Also see how to fix the merge conflicts

@agramfort
Copy link
Member

this is now made irrelevant after #16286

@agramfort agramfort closed this Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants