[MRG] MNT Api only mode in check_estimator #18582

NicolasHug · 2020-10-09T15:43:16Z

This PR introduces the possibility to only run API-compatibility checks in check_estimator and parametrize_with_checks, as discussed in the previous monthly meeting. This is a follow-up to #17361, though I had to rename strict_mode=True into
api_only=False.

I had to make a bunch of decisions on what we consider to be an API check. In general, my line of thought was:

API checks:
- checks for __init__, set/get_params(), attributes etc
- ensuring basic errors that may lead to wrong results (e.g. discrepancies between n_samples and len(sample_weight))
Non-API checks:
- ensuring other errors (for user-friendlyness)
- error message checking
- prediction checks (discrimination power, minimal score, consistency between predict_proba and decision_function)

However I don't feel strongly about any of these, so please feel free to directly push to this PR if you disagree ;)

CC @jnothman @rth @thomasjpfan @ogrisel @adrinjalali @glemaitre

…i_only_mode

thomasjpfan

At some point, I think we need to have docstrings for at least the API-only checks, so we can include it in the docs. This way there is a human readable description of "what is the API".

thomasjpfan · 2020-10-10T14:59:50Z

sklearn/utils/estimator_checks.py

-    'check_n_features_in',
+# set of checks that do not check API-compatibility. They are ignored when
+# api_only is True.
+_NON_API_CHECKS = set([


I think we should hard code the complement of this set, _API_CHECKS so it is easier to tell "what is the API".

In the docs, we can point to _API_CHECKS to be "what the API is".

I don't think we can immediately do that, because there are 3 kinds of checks, not 2:

the ones that don't check the API at all (in this dict)

the ones that only check the API

the ones that check the API in some parts, and do some other kinds of checks in other parts.

1 and 2 don't use the api_only parameter. The checks in 3 are the only ones to use it (we agreed in previous PRs that this was the easiest way to go).

separating 2 and 3 would be a lot of work, probably worth it I agree, but out of scope here.

separating 2 and 3 would be a lot of work, probably worth it I agree, but out of scope here.

I see defining the API mainly about communication. Having 3 makes it harder to communicate what the API is.

I guess as a first step, we can add this api_only=True, and state "Your estimator is API compliant if you pass check_estimator(..., api_only=True)". In a future PR, we can separate 2 and 3 and then we can firmly say "These are the checks that test the API".

Agreed. Merging this PR isn't the end of the road, it's only the beginning.
Which is why I think we should keep things as simple as possible here, and improve incrementally in the future.

NicolasHug · 2020-10-10T15:13:59Z

At some point, I think we need to have docstrings for at least the API-only checks, so we can include it in the docs. This way there is a human readable description of "what is the API".

Totally agree, the checks need much better docs. Most of them were completely undocumented. In this PR I tried to document them at least a little bit to ease the review.

jnothman

I like this redefinition of "strict". I wonder whether I would have designed it as mode="compatible" vs mode="strict". This is looking good to me, but I've not had time to review in full that the implementation matches the designation of what is "API only".

adrinjalali

Thanks @NicolasHug

Haven't gone through all the tests yet. But I guess I'd include quite a few more tests in API tests than what we have now. And there are a bunch of tests which have api_only=False defaults, but also have if not api_only sections, not sure if that makes sense.

doc/developers/develop.rst

doc/glossary.rst

sklearn/tests/test_common.py

adrinjalali · 2020-10-17T09:14:15Z

sklearn/tests/test_common.py

+    # When api_only is True, check_fit2d_1sample is skipped along
+    # with the rest of the xfail_checks
+    with pytest.warns(SkipTestWarning, match=skip_match):
+        check_estimator(NuSVC(), api_only=True)


if we're calling this with api_only=True, then why is there a need to ignore a check_fit2d_1sample is not an API check warning?

Sorry I don't understand the question. We're not ignoring a warning here, we're making sure it's raised.

The SkipTestWarning is raised by check_fit2d_1sample because it is a non-API check and we have set api_only=True. There are other warnings that are raised by check_estimator, we just assert that at least this one is raised.

What I mean here is, as a user, if I call check_estimator(..., api_only=True), I'm explicitly saying I only want the api_only tests. Then why would I get a warning which says check_fit2d_1sample is not an API check? Should I just not be getting that warning?

We're treating these checks as if they were in the _xfail_checks tag and for this tag we decided to ignore with a warning. It makes things much simpler to re-use that logic rather than implement a new check-ignoring logic (I believe there are discussions in previous PRs like #17361)

I personally don't have a strong opinion on whether the ignored checks should raise a warning or not. Note that for parametrize_with_checks, which is sort of the sister of check_estimators, those checks would be marked as XFAIL, so pytest would still output them in stdout saying something like "this was xfailed" or "this was xpassed".

Maybe @rth @thomasjpfan can remind us why we decided to ignore with a warning instead of just ignoring?

I think this is better to raise a skip warning. It allows us to know which tests were not run and for which reason.

adrinjalali · 2020-10-17T09:15:04Z

sklearn/tests/test_common.py

+    # this case, check_fit_non_negative() will not check the exact error
+    # messsage. (We still assert that the warning from
+    # check_fit2d_1sample is raised)
+    with pytest.warns(SkipTestWarning, match=skip_match):


same question here

same answer ;)

sklearn/utils/estimator_checks.py

…i_only_mode

NicolasHug

Thanks @adrinjalali for the review, I replied in the comments.

I'd include quite a few more tests in API tests than what we have now

I believe most of these already are API checks. Any check that isn't in the NON_API_CHECKS will (at least partially) check the API. The value of the api_only param determines which parts of these checks are part of the API. Some checks are actually full API checks, in which case api_only isn't even used.

And there are a bunch of tests which have api_only=False defaults, but also have if not api_only sections, not sure if that makes sense

We want the default api_only=False to be False because we want the check to be strict by default. All checks need to have this default (whether they use it or not) so that calls to checks (with partials, etc.) don't get too complicated. The value of the default is not an indiation of whether the check is part of the API checks or not. I agree this is not completely ideal but this is the solution we went for, after quite a lot of alternatives (#17361 (review)). We also need to think about backward compatibility here.

In general the whole check suite could use a big re-design. Even the "name" parameters mostly don't make sense and are usually not used. I'd love to see that happen, but that's not in scope here.

doc/developers/develop.rst

doc/glossary.rst

NicolasHug · 2020-10-17T12:28:52Z

sklearn/tests/test_common.py

+    # When api_only is True, check_fit2d_1sample is skipped along
+    # with the rest of the xfail_checks
+    with pytest.warns(SkipTestWarning, match=skip_match):
+        check_estimator(NuSVC(), api_only=True)


Sorry I don't understand the question. We're not ignoring a warning here, we're making sure it's raised.

The SkipTestWarning is raised by check_fit2d_1sample because it is a non-API check and we have set api_only=True. There are other warnings that are raised by check_estimator, we just assert that at least this one is raised.

NicolasHug · 2020-10-17T12:29:21Z

sklearn/tests/test_common.py

+    # this case, check_fit_non_negative() will not check the exact error
+    # messsage. (We still assert that the warning from
+    # check_fit2d_1sample is raised)
+    with pytest.warns(SkipTestWarning, match=skip_match):


same answer ;)

sklearn/utils/estimator_checks.py

…i_only_mode

glemaitre · 2020-10-28T08:27:59Z

The split between API/non-API seems OK with me.

I think that we should split the test into these 2 groups later on. It would allow having nicer documentation that we can expose publicly as well as pointed out by @thomasjpfan

glemaitre · 2020-10-28T10:47:29Z

mode="compatible" vs mode="strict"

I am wondering if it would not be a good solution. This might allow grouping in more fine-grained the test maybe.

glemaitre · 2020-10-28T10:55:00Z

sklearn/utils/estimator_checks.py

@@ -816,7 +816,7 @@ def check_estimator_sparse_data(name, estimator_orig, strict_mode=True):


 @ignore_warnings(category=FutureWarning)
-def check_sample_weights_pandas_series(name, estimator_orig, strict_mode=True):
+def check_sample_weights_pandas_series(name, estimator_orig, api_only=False):


My latest comment regarding several mode would be nice here. One mode would be the interoperability of array-like.
One might want to be API and array-like compliant while not be strict on the error message raised. In this case maybe mode is not a good name but I have in mind something like model=["api", "interoperability"]

sklearn/utils/estimator_checks.py

glemaitre

I think I am fine with the change. I would maybe prefer to introduce a list of str instead of single keyword thought. WDYT @NicolasHug ?

sklearn/utils/estimator_checks.py

…i_only_mode

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…to api_only_mode

NicolasHug · 2020-10-28T22:06:55Z

I would maybe prefer to introduce a list of str instead of single keyword thought.

The mode="compatible" vs mode="strict" suggestion (and variations) is preferable if we ever plan to have more than 2 levels of checks. For now we only have 2 levels (API and non-API).

No strong opinion, but I'm not sure having lots of check levels is something we want to get into.

glemaitre

I am fine with this step forward.

glemaitre · 2020-11-06T15:54:41Z

I just tested the approach in LightGBM. Once microsoft/LightGBM#3533 is merged (changes that were required after announcing some deprecation in 0.23), the api_only is working. I will try to do the same in xgboost.

ogrisel · 2020-11-06T16:37:29Z

(changes that were required after announcing some deprecation in 0.23)

Did we announced in 0.22 or 0.23?

ogrisel · 2020-11-06T16:38:13Z

Maybe also hdbscan and UMAP?

glemaitre · 2020-11-06T17:49:11Z

we announce in 0.23 for changing in 0.24

jeremiedbb · 2020-11-09T13:42:52Z

I tested check_estimator on cuml estimators and all fail because they don't have estimator tags (they do not even inherit from BaseEstimator). I wonder if having estimator tags is mandatory to be a compatible estimator. Do you think check_estimator could somehow work on estimators without tags, maybe by assuming default tags in this case (seems brittle though...) ?

glemaitre · 2020-11-09T13:51:24Z

Uhm, this is weird regarding the estimator tag. We have a default tag so probably the test will be run in this case. If they break, I would say that this is fine. It forces the use of tags to bypass the test then.

…

On Mon, 9 Nov 2020 at 14:43, Jérémie du Boisberranger < ***@***.***> wrote: I tested check_estimator on cuml estimators and all fail because they don't have estimator tags (they do not even inherit from BaseEstimator). I wonder if having estimator tags is mandatory to be a compatible estimator. Do you think check_estimator could somehow work on estimators without tags, maybe by assuming default tags in this case (seems brittle though...) ? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18582 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32PYQDMMX4KWV67OQMD3SO7WWZANCNFSM4SKHRKWA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

…i_only_mode

NicolasHug · 2020-11-23T18:06:32Z

(please edit if I missed anything)

We just had a meeting with @glemaitre @ogrisel @jeremiedbb @amueller @adrinjalali @thomasjpfan

With the approaching release and the potential for a cleaner/more modular implementation (#18750), we decided on the following (among other things):

not merge this and revert [MRG] Prototype 4 for strict check_estimator mode #17361
In effect, this keeps the status quo regarding the "api only" checks for 1 version, which is slightly inconvenient for third parties, but this allows us to not introduce changes that we might need to deprecate.
break checks into more unitary checks, in order to only have "full api checks" and "full non-api checks". We'll consider that all the checks are private, so that we don't need to worry about potentially breaking backward compat. Considering them private would be too constraining.
try to clean and simplify estimator_checks.py. For example we don't need to pass the estimator's name to any check.
implement the strict/api_only mode via ENH allows checks generator to be pluggable #18750 or similar

NicolasHug added 8 commits September 28, 2020 18:50

WIP

4f50975

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

544919c

…i_only_mode

WIP

ef04fce

WIP

94b069e

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

dac7574

…i_only_mode

some more

e4f889f

ooops

6fda30c

some more

db71e0f

NicolasHug changed the title ~~Api only mode~~ [MRG] MNT Api only mode in check_estimator Oct 9, 2020

github-actions bot added the module:utils label Oct 9, 2020

whatsnew

b4b8138

thomasjpfan reviewed Oct 10, 2020

View reviewed changes

jnothman reviewed Oct 12, 2020

View reviewed changes

NicolasHug added this to the 0.24 milestone Oct 12, 2020

adrinjalali reviewed Oct 17, 2020

View reviewed changes

NicolasHug added 2 commits October 17, 2020 08:51

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

7ac5387

…i_only_mode

addressed comments

41393fa

NicolasHug commented Oct 17, 2020

View reviewed changes

NicolasHug added 2 commits October 23, 2020 14:57

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

e7aeb4f

…i_only_mode

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

f6f6aee

…i_only_mode

NicolasHug mentioned this pull request Oct 26, 2020

check_fit2d_1sample and check_fit2d_1feature expect very specific error message #12734

Closed

glemaitre reviewed Oct 28, 2020

View reviewed changes

sklearn/utils/estimator_checks.py Show resolved Hide resolved

glemaitre reviewed Oct 28, 2020

View reviewed changes

sklearn/utils/estimator_checks.py Show resolved Hide resolved

glemaitre reviewed Oct 28, 2020

View reviewed changes

sklearn/utils/estimator_checks.py Show resolved Hide resolved

NicolasHug added 2 commits October 28, 2020 18:00

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

d001fc5

…i_only_mode

make pickle full API check

cb66293

NicolasHug and others added 2 commits October 28, 2020 18:02

Apply suggestions from code review

1ff8887

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Merge branch 'api_only_mode' of github.com:NicolasHug/scikit-learn in…

59446cc

…to api_only_mode

glemaitre approved these changes Nov 3, 2020

View reviewed changes

glemaitre self-assigned this Nov 6, 2020

jeremiedbb mentioned this pull request Nov 9, 2020

Compatibility with scikit-learn API rapidsai/cuml#3125

Closed

ogrisel mentioned this pull request Nov 9, 2020

TST introduce _safe_tags for estimator not inheriting from BaseEstimator #18797

Merged

NicolasHug mentioned this pull request Nov 9, 2020

ENH allows checks generator to be pluggable #18750

Open

glemaitre mentioned this pull request Nov 11, 2020

[NoMRG] evaluate minimal implementation for sklearn estimator #18811

Closed

NicolasHug added 2 commits November 23, 2020 12:43

Merge branch 'master' of github.com:scikit-learn/scikit-learn into ap…

79bf45f

…i_only_mode

forgot to pass api_only to subcheck

e0f9931

NicolasHug removed this from the 0.24 milestone Nov 23, 2020

NicolasHug closed this Nov 25, 2020

Uh oh!

[MRG] MNT Api only mode in check_estimator #18582

[MRG] MNT Api only mode in check_estimator #18582

Uh oh!

Conversation

NicolasHug commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Oct 10, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Oct 28, 2020

Uh oh!

glemaitre commented Oct 28, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NicolasHug commented Oct 28, 2020

Uh oh!

glemaitre left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NicolasHug commented Oct 9, 2020 •

edited

Loading

glemaitre left a comment •

edited

Loading

NicolasHug commented Nov 23, 2020 •

edited

Loading