-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705
Conversation
I'll also include a test for |
I think I'll need to change the message in the Chi2Samplers from ""Entries of X must be non-negative."" into "Negative values in data passed to", to pass |
I'm having trouble figuring out what to do to fix the tests: the problem comes from |
I guess a solution is to not flag them as |
Ah my bad, |
Ah my bad again, turns out |
I'm back to the following reformulated question then: #14705 (comment)
|
In the end, since the Chi squared kernel works for positive points (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.chi2_kernel.html#sklearn.metrics.pairwise.chi2_kernel), I guess it makes sense to forbid passing negative data to |
I just added a modification to ensure positivity at fit time too |
Actually I saw that |
However tests now still fail for |
Hm I don't think it's worth creating a tag for that. And it's a kind of weird requirement. Not sure if anyone actually uses this and whether we should just deprecate it... |
Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine. |
Allright, done (see commit 851887d) Otherwise I was also thinking since we're talking about |
@@ -112,11 +113,6 @@ def test_skewed_chi2_sampler(): | |||
assert np.isfinite(kernel_approx).all(), \ | |||
'NaNs found in the approximate Gram matrix' | |||
|
|||
# test error is raised on when inputs contains values smaller than -c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to remove this since now we test for positivity and it's already tested by check_fit_non_negative
? Although now that I think about it check_fit_non_negative
, would only test the fit
part, whereas the below would test the transform
part, so I'll let it here but adapt it to check just positivity (and not > -c).
But then it reminds me: should we have a check_transform_non_negative
as well in common tests ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not supposed to raise an error on negative values, only on values smaller than -c
.
raise ValueError("X may not contain entries smaller than" | ||
" -skewedness.") | ||
|
||
check_non_negative(X, "SkewedChi2Sampler (input X)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you replace the test? Now it will fail with things that previously worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I thought you meant to actually change the be behavior of the estimator (and indeed make it more restrictive) here:
Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.
Because the thing is AFAIU I can't just add a tag requires_positive_X
but still let the estimator accept negative values (>-c), because check_fit_non_negative
will then fail (a common test that checks that any estimator that has a requires_positive_X
tag throws an error when given negative X
) (actually it will fail because of the error message, but even if I change the error message into something like "Negative values in data passed to" (which I could tweak into sth like "Negative values in data passed to SkewedChi2Sampler after adding the constant c", which would be more exact) the test could still fail in principle because it could in principle test only on negative values that are >-c (it does not bc the negative values are random with a variance apparently high enough but logically it could)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @amueller I would not change the behavior of the estimator to make the common tests pass. It just says that requires_positive_X is not what is adapted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, so should I create a new safe tag here ? Like requires_X_higher_than_constant
or sth, that I could enforce in enforce_estimator_tags_X
? Because we need to find a way for tests that by default have negative values to translate them into >c
for SkewedChi2Sampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe an X_types
kind of tag ?
Y[0, 0] = -c / 2. | ||
|
||
# approximate kernel mapping | ||
transform = SkewedChi2Sampler(skewedness=c, n_components=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bad sign if you have to change any tests, because that probably means that you changed behavior and are breaking someone's code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 don't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I forgot about those comments... I just put the PR into WIP until there's a better solution (see comment above)
sklearn/utils/estimator_checks.py
Outdated
if _safe_tags(estimator, "requires_positive_X"): | ||
# Create strictly positive X. The minimal increment above 0 is 1, as | ||
# X could be of integer dtype. | ||
X += 1 + abs(X.min()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not X -= X.min() - 1e-5
or something like that? I don't know why you would do abs. If your data had a minimum of 2
after this is has a minimum of 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree X -= X.min() - 1e-5 seems better, I just had taken this line from
enforce_estimator_tags_y` above
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh the comment was about dtype int? hm not sure in how far that applies to X here. But if thests are passing then that's fine I guess ;) Is this ready? If so, please rename from WIP to MRG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I didn't see that comment, I guess this means if there is an integer X at one point we want to let it integer ? I don't know to what extent it applies indeed, but maybe in case we can let it with the int ? What about X -= X.min() - 1
? So that it'll still be ok wrt to your comment https://github.com/scikit-learn/scikit-learn/pull/14705/files/c6751b454bdf84699954e3f6bfa7000db2d42d48#r316775147 but it keeps the integer ?
# Estimators with a `requires_positive_X` tag only accept strictly positive | ||
# inputs | ||
if _safe_tags(estimator, "requires_positive_X"): | ||
# Create strictly positive X. The minimal increment above 0 is 1, as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment and the code don't really agree, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, I forgot the comment :p (see #14705 (comment))
I think if tests pass we can merge |
@wdevazelhes it does not pass (yet) |
@@ -1168,8 +1167,7 @@ def _check_X_y(self, X, y): | |||
# force_all_finite=True) | |||
X, y = check_X_y(X, y, accept_sparse=False, force_all_finite=True) | |||
X, y = check_X_y(X, y, dtype='int') | |||
if np.any(X < 0): | |||
raise ValueError("X must not contain negative values.") | |||
check_non_negative(X, "CategoricalNB (input X)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, check_non_negative
finds the min and checks if it's negative, wouldn't it be faster to use np.any
? (in theory it would just need to iterate until finding one nonnegative element but I don't know if that's what is done by any)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%timeit is your friend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, with timeit np.min()
is faster. I forgot but in np.any
we still need to do the X>0
pass, which amounts to one pass on the whole data that's why it'll not be faster... I didn't find any numpy function that would prevent to do this whole pass so I guess np.min
is good
In[1]: import numpy as np
In [2]: a = np.random.randn(10000, 10000)
In [3]: %timeit np.any(a)
124 ms ± 808 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit np.min(a)
56.5 ms ± 69.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Tests are green now (they were failing in |
Any update on #14705 (comment) ? |
@wdevazelhes you need to revert the API change in SkewedChi2Sampler. Also see how to fix the merge conflicts |
this is now made irrelevant after #16286 |
See #14680 (comment)