[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705

wdevazelhes · 2019-08-21T09:10:05Z

wdevazelhes · 2019-08-21T09:23:09Z

I'll also include a test for enforce_estimator_tags_X. However, I noticed that enforce_estimator_tags_y does not have any test, should I include one too ?

wdevazelhes · 2019-08-21T09:54:57Z

I think I'll need to change the message in the Chi2Samplers from ""Entries of X must be non-negative."" into "Negative values in data passed to", to pass check_fit_non_negative for those

wdevazelhes · 2019-08-21T12:03:26Z

I'm having trouble figuring out what to do to fix the tests: the problem comes from AdditiveChi2Sampler and ~~SkewedChi2Sampler~~ : ~~they should work on kernel matrices (so positive if I understand correctly), at fit and transform~~ EDIT: they work on positive datasets Xs. However the fit actually is stateless (only uses n_features), and as such does not forbid negative X. Should it ? If it should not, then it means we don't need the flag ensures_positive_X, but then there's a problem in check_estimators_pickle (according to this CI report: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=6740), because the test will not ensure_positive_X and therefore at the transform step it fails (because X has negative elements). Should we separate ensure_positive_X_fit and ensure_positive_X_transform ?

wdevazelhes · 2019-08-21T12:22:44Z

I guess a solution is to not flag them as requires_positive_X (because at fit they don't strictly enforce this), but to still pass the test to to use pairwise_estimator_convert_X in the tests ?

wdevazelhes · 2019-08-21T12:26:58Z

Ah my bad, pairwise_estimator_convert_X was already enforced in check_estimators_pickle, so there's something I don't understand

wdevazelhes · 2019-08-21T12:33:24Z

Ah my bad again, turns out ~~SkewedChi2Sampler~~ and AdditiveChi2Sampler actually don't take kernel matrices but actual matrices of points...

wdevazelhes · 2019-08-21T12:39:09Z

I'm back to the following reformulated question then: #14705 (comment)

I'm having trouble figuring out what to do to fix the tests: the problem comes from AdditiveChi2Sampler and ~~SkewedChi2Sampler~~: ~~they should work on kernel matrices (so positive if I understand correctly), at fit and transform~~ EDIT: they work on positive datasets Xs. However the fit actually is stateless (only uses n_features), and as such does not forbid negative X. Should it ? If it should not, then it means we don't need the flag ensures_positive_X, but then there's a problem in check_estimators_pickle (according to this CI report: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=6740), because the test will not ensure_positive_X and therefore at the transform step it fails (because X has negative elements). Should we separate ensure_positive_X_fit and ensure_positive_X_transform ?

wdevazelhes · 2019-08-21T12:40:40Z

In the end, since the Chi squared kernel works for positive points (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.chi2_kernel.html#sklearn.metrics.pairwise.chi2_kernel), I guess it makes sense to forbid passing negative data to fit, even if only n_features is used ?

wdevazelhes · 2019-08-21T12:57:02Z

I just added a modification to ensure positivity at fit time too
It made me think: do you think a test check_transform_non_negative should also be added in addition to check_fit_non_negative ?

wdevazelhes · 2019-08-21T13:22:46Z

Actually I saw that SkewedChi2Sampler allows for negative values, so the things I said are only for AdditiveChi2Sampler

wdevazelhes · 2019-08-21T14:11:38Z

However tests now still fail for SkewedChi2Sampler because X has values lower than a constant c (SkewedChi2Sampler takes only values higher than c). I'm not sure how to deal with that, should I create a new estimator tag for this case ?

amueller · 2019-08-21T15:35:33Z

Hm I don't think it's worth creating a tag for that. And it's a kind of weird requirement. Not sure if anyone actually uses this and whether we should just deprecate it...

amueller · 2019-08-21T19:27:08Z

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

wdevazelhes · 2019-08-22T15:04:04Z

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

Allright, done (see commit 851887d)

Otherwise I was also thinking since we're talking about X, maybe an X_types estimator tag like higher_than_c could be used (maybe less heavy than a whole new estimator tag) ?

wdevazelhes · 2019-08-22T15:06:45Z

sklearn/tests/test_kernel_approximation.py

@@ -112,11 +113,6 @@ def test_skewed_chi2_sampler():
    assert np.isfinite(kernel_approx).all(), \
        'NaNs found in the approximate Gram matrix'

-    # test error is raised on when inputs contains values smaller than -c


We should be able to remove this since now we test for positivity and it's already tested by check_fit_non_negative ? Although now that I think about it check_fit_non_negative, would only test the fit part, whereas the below would test the transform part, so I'll let it here but adapt it to check just positivity (and not > -c).
But then it reminds me: should we have a check_transform_non_negative as well in common tests ?

It's not supposed to raise an error on negative values, only on values smaller than -c.

amueller · 2019-08-22T16:28:46Z

sklearn/kernel_approximation.py

-            raise ValueError("X may not contain entries smaller than"
-                             " -skewedness.")
-
+        check_non_negative(X, "SkewedChi2Sampler (input X)")


Why did you replace the test? Now it will fail with things that previously worked.

Sorry, I thought you meant to actually change the be behavior of the estimator (and indeed make it more restrictive) here:

Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.

Because the thing is AFAIU I can't just add a tag requires_positive_X but still let the estimator accept negative values (>-c), because check_fit_non_negative will then fail (a common test that checks that any estimator that has a requires_positive_X tag throws an error when given negative X) (actually it will fail because of the error message, but even if I change the error message into something like "Negative values in data passed to" (which I could tweak into sth like "Negative values in data passed to SkewedChi2Sampler after adding the constant c", which would be more exact) the test could still fail in principle because it could in principle test only on negative values that are >-c (it does not bc the negative values are random with a variance apparently high enough but logically it could)

I agree with @amueller I would not change the behavior of the estimator to make the common tests pass. It just says that requires_positive_X is not what is adapted here.

I agree, so should I create a new safe tag here ? Like requires_X_higher_than_constant or sth, that I could enforce in enforce_estimator_tags_X ? Because we need to find a way for tests that by default have negative values to translate them into >c for SkewedChi2Sampler

Or maybe an X_types kind of tag ?

amueller · 2019-08-22T16:29:40Z

sklearn/tests/test_kernel_approximation.py

-    Y[0, 0] = -c / 2.
+
+    # approximate kernel mapping
+    transform = SkewedChi2Sampler(skewedness=c, n_components=1000,


It's a bad sign if you have to change any tests, because that probably means that you changed behavior and are breaking someone's code.

+1 don't do this.

I agree, I forgot about those comments... I just put the PR into WIP until there's a better solution (see comment above)

amueller · 2019-08-22T16:32:14Z

sklearn/utils/estimator_checks.py

+    if _safe_tags(estimator, "requires_positive_X"):
+        # Create strictly positive X. The minimal increment above 0 is 1, as
+        # X could be of integer dtype.
+        X += 1 + abs(X.min())


why not X -= X.min() - 1e-5 or something like that? I don't know why you would do abs. If your data had a minimum of 2 after this is has a minimum of 5

I agree X -= X.min() - 1e-5 seems better, I just had taken this line from enforce_estimator_tags_y` above
Will do

ahh the comment was about dtype int? hm not sure in how far that applies to X here. But if thests are passing then that's fine I guess ;) Is this ready? If so, please rename from WIP to MRG.

Ah yes, I didn't see that comment, I guess this means if there is an integer X at one point we want to let it integer ? I don't know to what extent it applies indeed, but maybe in case we can let it with the int ? What about X -= X.min() - 1 ? So that it'll still be ok wrt to your comment https://github.com/scikit-learn/scikit-learn/pull/14705/files/c6751b454bdf84699954e3f6bfa7000db2d42d48#r316775147 but it keeps the integer ?

amueller · 2019-09-18T16:21:37Z

sklearn/utils/estimator_checks.py

+    # Estimators with a `requires_positive_X` tag only accept strictly positive
+    # inputs
+    if _safe_tags(estimator, "requires_positive_X"):
+        # Create strictly positive X. The minimal increment above 0 is 1, as


The comment and the code don't really agree, right?

That's right, I forgot the comment :p (see #14705 (comment))

wdevazelhes · 2019-09-22T15:07:42Z

I think if tests pass we can merge

agramfort · 2019-09-23T19:49:03Z

@wdevazelhes it does not pass (yet)

wdevazelhes · 2019-09-26T12:32:08Z

sklearn/naive_bayes.py

@@ -1168,8 +1167,7 @@ def _check_X_y(self, X, y):
        #                    force_all_finite=True)
        X, y = check_X_y(X, y, accept_sparse=False, force_all_finite=True)
        X, y = check_X_y(X, y, dtype='int')
-        if np.any(X < 0):
-            raise ValueError("X must not contain negative values.")
+        check_non_negative(X, "CategoricalNB (input X)")


By the way, check_non_negative finds the min and checks if it's negative, wouldn't it be faster to use np.any ? (in theory it would just need to iterate until finding one nonnegative element but I don't know if that's what is done by any)

%timeit is your friend

That's right, with timeit np.min() is faster. I forgot but in np.any we still need to do the X>0 pass, which amounts to one pass on the whole data that's why it'll not be faster... I didn't find any numpy function that would prevent to do this whole pass so I guess np.min is good

In[1]: import numpy as np In [2]: a = np.random.randn(10000, 10000) In [3]: %timeit np.any(a) 124 ms ± 808 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit np.min(a) 56.5 ms ± 69.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

wdevazelhes · 2019-09-26T12:33:37Z

Tests are green now (they were failing in test_rbf_sampler because I updated some global dataset X and y inplace, and looks like the rbf_sampler didn't like the dataset updated by another test so I updated a copy)

wdevazelhes · 2019-10-08T15:45:13Z

Any update on #14705 (comment) ?

agramfort · 2019-11-17T16:15:07Z

@wdevazelhes you need to revert the API change in SkewedChi2Sampler. Also see how to fix the merge conflicts

agramfort · 2020-03-04T14:16:42Z

this is now made irrelevant after #16286

Enforce positiveness in tests

ff3d3fc

wdevazelhes mentioned this pull request Aug 21, 2019

[MRG] Use tag requires_positive_X for NMF + ComplementNB #14680

Merged

William de Vazelhes added 3 commits August 21, 2019 11:35

Add a test

eb17c47

Add default behaviour to the test

e0150c5

Add requires_positive_X to Chi2Samplers

2aafc8c

William de Vazelhes added 2 commits August 21, 2019 11:55

fix flake8

f4c0510

Fix typo

fccc3b0

Ensure positiveness for chi2 samplers

c5a1d5b

William de Vazelhes added 3 commits August 21, 2019 15:00

Use the right syntax to check positiveness

18d7aa4

Fix pre-existing probable typo

22a3e0c

Remove the need for positiveness for SkewedChi2 kernel

0971fdd

Ensure positivity for SkewedChi2Sampler

851887d

Put back the test for positivity on transform

272605b

wdevazelhes commented Aug 22, 2019

View reviewed changes

Be clearer on the test

c6751b4

wdevazelhes mentioned this pull request Aug 22, 2019

[WIP] Common test for equivalence between sparse and dense matrices. #13246

Closed

7 tasks

amueller reviewed Aug 22, 2019

View reviewed changes

Address scikit-learn#14705 (comment)

fae9be6

amueller mentioned this pull request Sep 18, 2019

IsotonicRegression with 2D input and n_features=1 #15012

Closed

amueller reviewed Sep 18, 2019

View reviewed changes

wdevazelhes and others added 3 commits September 21, 2019 21:55

See scikit-learn#14705 (comment)

7c265ec

Merge branch 'master' into use_requires_positive_X

e32312e

remove blank spaces

eda414c

wdevazelhes changed the title ~~[WIP] Enforce positiveness in tests using enforce_estimator_tags_X~~ [MRG] Enforce positiveness in tests using enforce_estimator_tags_X Sep 22, 2019

wdevazelhes changed the title ~~[MRG] Enforce positiveness in tests using enforce_estimator_tags_X~~ [WIP] Enforce positiveness in tests using enforce_estimator_tags_X Sep 22, 2019

wdevazelhes added 5 commits September 25, 2019 13:30

copy dataset to avoid inplace modification

0af7acb

make function private

5543190

Merge branch 'master' into use_requires_positive_X

50615d5

merge with master and update CategoricalNB too

2d7818d

update error messages

4af9f3d

wdevazelhes commented Sep 26, 2019

View reviewed changes

wdevazelhes changed the title ~~[WIP] Enforce positiveness in tests using enforce_estimator_tags_X~~ [MRG] Enforce positiveness in tests using enforce_estimator_tags_X Sep 26, 2019

wdevazelhes changed the title ~~[MRG] Enforce positiveness in tests using enforce_estimator_tags_X~~ [WIP] Enforce positiveness in tests using enforce_estimator_tags_X Sep 28, 2019

github-actions bot added module:naive_bayes module:utils labels Mar 2, 2020

agramfort closed this Mar 4, 2020

Uh oh!

[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705

[WIP] Enforce positiveness in tests using enforce_estimator_tags_X #14705

Uh oh!

Conversation

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Aug 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 21, 2019

Uh oh!

amueller commented Aug 21, 2019

Uh oh!

amueller commented Aug 21, 2019

Uh oh!

wdevazelhes commented Aug 22, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdevazelhes Aug 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdevazelhes commented Sep 22, 2019

Uh oh!

agramfort commented Sep 23, 2019

Uh oh!

Choose a reason for hiding this comment

wdevazelhes commented Aug 21, 2019 •

edited

Loading

wdevazelhes commented Aug 21, 2019 •

edited

Loading

wdevazelhes commented Aug 21, 2019 •

edited

Loading

wdevazelhes commented Aug 21, 2019 •

edited

Loading

wdevazelhes Aug 23, 2019 •

edited

Loading