Skip to content

[MRG] ENH Disassemble check estimator #14381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Aug 28, 2019

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #11622
Alternative to #13843

What does this implement/fix? Explain your changes.

Libraries will be able to write the following to run their tests:

from itertools import chain

@pytest.mark.parameterize('estimator, check',
    chain.from_iterable(check_estimator(est, generate_only=False)
                        for est in estimators))
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

Any other comments?

The name argument of the checks are set by check_estimator.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please add a what's new entry as well.

generate_only : bool, optional (default=True)
When `True`, checks are evaluated when `check_estimator` is called.
When `False`, `check_estimator` generates the checks and the estimator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add versionadded

@@ -98,17 +100,15 @@ def _rename_partial(val):

@pytest.mark.parametrize(
"estimator, check",
_generate_checks_per_estimator(_yield_all_checks,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this function now?

@pytest.mark.parameterize(
'estimator, check',
chain.from_iterable(check_estimator(est, generate_only=False)
for est in estimators))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issues with this is that because estimator is an instance and not a class it's not going to be rendered nicely in the list of tests. That's why we use id=_rename_partial. I guess there is no easy way of fixing this for users...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without our _rename_partial, pytest names everything estimator213-check41. It could be slightly better if we yield the name as well from check_estimator, which will result in ARDRegression-estimator211-check21.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should really try to have the tests have sensible names. This is super useless otherwise I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not fix the fixme in _rename_partial and move it to estimator_checks and make it public and use it here? That solves the problem, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using print_changed_only adds newlines sometimes, which leads to pytest names with newlines in them.

Making _rename_partial is a good idea, it would need another name, maybe nicer_check_estimator_id? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using print_changed_only and remove the newlines? Also: if you're testing stuff that's that deeply nested there's no way to give it a good name.

@thomasjpfan thomasjpfan changed the title [MRG] ENH Disassemble check estimator [WIP] ENH Disassemble check estimator Jul 16, 2019
@thomasjpfan thomasjpfan changed the title [WIP] ENH Disassemble check estimator [MRG] ENH Disassemble check estimator Jul 16, 2019
@thomasjpfan
Copy link
Member Author

We can not use yield in check_estimator because python will automatically conver the function to a generator. This happens even if the yield is guarded by an if.

This PR was updated to return a list of checks, rather than a generator.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan !


from itertools import chain

@pytest.mark.parameterize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.parameterize(
@pytest.mark.parametrize(

@amueller
Copy link
Member

looks good a part from typos and setting the ids.

@@ -265,7 +265,31 @@ def _yield_all_checks(name, estimator):
yield check_fit_idempotent


def check_estimator(Estimator):
def readable_check_estimator_ids(val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just set_check_estimator_ids or make_check_estimator_ids?

@@ -265,7 +265,31 @@ def _yield_all_checks(name, estimator):
yield check_fit_idempotent


def check_estimator(Estimator):
def readable_check_estimator_ids(val):
"""Create readable pytest ids for `check_estimator` when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also used internally without reference to check_estimator.

How about

"""Create pytest ids for checks.

Returns string representations for pytest tests, to be used as id.
Use together with ``check_estimator(..., generate_only=True)``
"""

Though that makes documenting val possibly more difficult. But on the other hand you could just directly document what val needs to be instead of defining it implicitly.

Also: missing Returns section.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if you actually tested the code that you recommend others will use ;)


from itertools import chain
import pytest
from sklearn.utils.estimator_check import check_estimator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn.utils.estimator_check import check_estimator
from sklearn.utils.estimator_checks import check_estimator

from itertools import chain
import pytest
from sklearn.utils.estimator_check import check_estimator
from sklearn.utils.estimator_check import readable_check_estimator_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn.utils.estimator_check import readable_check_estimator_ids
from sklearn.utils.estimator_checks import readable_check_estimator_ids

pass


@pytest.mark.parametrize("val, expected", [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to contradict the statement below ;)

@amueller
Copy link
Member

amueller commented Jul 22, 2019

Looks good apart from the pytest soft dependency issue and the name of the function.
I would replace "readable", or just remove it, because it doesn't really add any information.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of adding this syntactic sugar?

def parametrize_with_checks(estimators):
    if hasattr(estimators, 'fit'):
        estimators = [estimators]
    return pytest.mark.parametrize(
        ['estimator', 'check'],
        check_estimator(estimator, generate_only=True),
        ids=set_check_estimator_ids)

Then

from sklearn.utils.estimator_checks import parametrize_with_checks
from sklearn.linear_model import LogisticRegression

@parametrize_with_checks(LogisticRegression)
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

Then we could make set_check_estimator_ids private.

@rth
Copy link
Member

rth commented Aug 22, 2019

WDYT of adding this syntactic sugar?

+1 though maybe with a itertools.chain somewhere in parametrize_with_checks as check_estimator still doesn't support lists of estimators I think?

@jnothman
Copy link
Member

jnothman commented Aug 22, 2019 via email

@jnothman
Copy link
Member

jnothman commented Aug 22, 2019 via email

@rth
Copy link
Member

rth commented Aug 22, 2019

Having this interface actually means that we probably don't need the
awkward thing where check_estimator generates the estimator as well as the
checks.

parametrize_with_checks is highly interlinked with pytest (and depends on pytest), while check_estimator(estimator, generate_only=True), is much more general and could used for other applications.

For instance to programmatically generate a list of checks for a given application, manually skip a few selected ones and run the rest with pytest.parametrize. That's an important application for contrib projects. I think it would be worth keeping it.

@jnothman
Copy link
Member

jnothman commented Aug 22, 2019 via email

@thomasjpfan
Copy link
Member Author

Currently we still need the estimator passed back because of how we do set_checking_parameters 😅

@jnothman
Copy link
Member

Currently we still need the estimator passed back because of how we do set_checking_parameters 😅

I don't actually think we mean the same thing...

But yes, it's easier to have check_estimator generate pairs also to be unambiguous in the case of classes vs instances, etc.

@amueller amueller added the High Priority High priority issues and pull requests label Aug 26, 2019
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Aug 26, 2019

Overall I like having the syntactic sugar, but the hard dependency on pytest is a little unsettling.

My comment was about whether it should generate (estimator, check) pairs or
just check.

@jnothman Can you expand on this?

@jnothman
Copy link
Member

jnothman commented Aug 26, 2019 via email

@thomasjpfan
Copy link
Member Author

Thank you for the clarification! As you said, the concern with only generating the checks is that it becomes harder to distinguish the class checks and the instance checks.

I will move forward with the parametrize_with_checks decorator, we would need to document how it names the tests since set_check_estimator_ids will become private.

@jnothman
Copy link
Member

jnothman commented Aug 26, 2019 via email

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to mention the new decorator in the developer guide.

@jnothman
Copy link
Member

@rth let's merge?

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @thomasjpfan !

@rth rth merged commit d9a12aa into scikit-learn:master Aug 28, 2019
@jnothman
Copy link
Member

jnothman commented Aug 28, 2019 via email

@amueller
Copy link
Member

amueller commented Sep 3, 2019

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow disassembled use of check_estimator
5 participants