Skip to content

[MRG] Clone estimator between fits in sklearn/utils/estimator_checks.py::check_supervised_y_2d #10978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

omtinez
Copy link

@omtinez omtinez commented Apr 14, 2018

Reference Issues/PRs

Fixes #10977

What does this implement/fix? Explain your changes.

Test function check_supervised_y_2d reuses an already fitted estimator which could leave potential side effects. This PR clones the estimator between fits to prevent any potential side effects from affecting the results of this test.

Any other comments?

Please see #10977 for a sample estimator that would fail this test, even though it correctly handles what check_supervised_y_2d should be testing for.

@amueller
Copy link
Member

I don't think this is necessary as we clone explicitly in every test. This helper function is only used in some tests, so I don't see why we should do the cloning in this one.

@omtinez
Copy link
Author

omtinez commented Apr 30, 2018

During the test, the order of (relevant) events is:

  1. set_random_state()
  2. fit()
  3. set_random_state()
  4. fit()

The problem is that, if the estimator is not cloned between 2 and 3, then the test will fail for estimators that store and reuse a random state. This is because set_random_state() simply does estimator.set_params(random_state=random_state)[1]. If an estimator has already been initialized and fitted, we should not expect that changing the initial random state leads to re-initializing every single internal variable, including the estimator's RandomState object (used as per the guidelines for developers[2]).

It is not unreasonable for an estimator to check if an internal random_number_generator_ object has been initialized and, only in the case it has not, re-initialize it with the random_state seed. Otherwise, estimators would have to also store a variable holding the previous random_state seed in order to compare it on every call of fit() to see if it has changed...

This is important for estimators that might implement partial_fit() and have lots of optimizations built-in to avoid re-doing a lot of work.

All in all, adding this single line of code should not affect the result of what this test is supposed to be checking for, and does not have enough side-effects to justify not including it.

[1] https://github.com/scikit-learn/scikit-learn/blob/96a02f3/sklearn/utils/testing.py#L667
[2] http://scikit-learn.org/stable/developers/utilities.html#validation-tools

@amueller
Copy link
Member

What test it that? It should probably be cloned unless it's testing that consecutive fits lead to identical results.

If an estimator has already been initialized and fitted, we should not expect that changing the initial random state leads to re-initializing every single internal variable, including the estimator's RandomState object (used as per the guidelines for developers[2]).

No, that's not true. We should expect that entirely. That is what the dev guide says. partial_fit is a different story.

@amueller
Copy link
Member

yes it has side-effects: it clones the estimator in a helper function that is unrelated to cloning an estimator.

@omtinez
Copy link
Author

omtinez commented Apr 30, 2018

What test it that? It should probably be cloned unless it's testing that consecutive fits lead to identical results.

The test is check_supervised_y_2d. My interpretation of the test is that it is mostly checking for dimensionality of inputs/outputs and data conversion warnings.

We should expect that entirely

I think that it's not a reasonable expectation for estimators to completely re-initialize everything upon a call to fit(), specially for things related to random state, and considering that there is a lack of coverage for partial_fit() scenarios. I think that a much more reasonable expectation is for an estimator to require very few things in fit() in addition to what may be done in partial_fit(). Personally, I think that set_random_state(estimator) should also clone the estimator but I think the scope of that change is probably larger and I have not seen any issues in any of the other testing functions besides the one in question.

yes it has side-effects

I respectfully disagree. clone(estimator) is widely used across many of the functions not related to cloning, including this same function just a few lines before my suggested change, and cloning at this particular step should not affect the result of this test for all existing estimators already passing it.

@amueller
Copy link
Member

I think that it's not a reasonable expectation for estimators to completely re-initialize everything upon a call to fit()

Why? This is not the approach scikit-learn is taking. As you can see in the development doc, fit should completely reset an estimator.

@amueller
Copy link
Member

Sorry I misunderstood the proposal, this is actually fine. I thought you wanted to modify the helper multioutput_estimator_convert_y_2d not the test. This indeed has no side effect. But it also shouldn't be necessary.

@amueller
Copy link
Member

See http://scikit-learn.org/stable/developers/contributing.html#random-numbers
Which says

The reason for this setup is reproducibility: when an estimator is fit twice to the same data, it should produce an identical model both times, hence the validation in fit, not init.

@amueller
Copy link
Member

@jnothman
Copy link
Member

jnothman commented Apr 30, 2018 via email

@omtinez
Copy link
Author

omtinez commented May 1, 2018

Thanks for the pointers, @amueller. I understand the technical reasons behind the model of instantiation (even though it makes building custom estimators a bit painful) and I think that some of those conceptual models break when you introduce partial_fit() into the picture.

I simply think that having to check for each possible parameter that can be sent to set_parameters() changing upon every fit() call makes life unnecessarily hard for developers of custom estimators. And, even though I don't agree with it, if it is decided to be the way forward, that behavior should not be tested as part of this function but rather it should be a different function that tests specifically for that.

@amueller
Copy link
Member

amueller commented May 1, 2018

@omtinez why does it make creating models painful? And why would it break if you introduce partial_fit? We have very well-defined semantics.

It's relatively simple if you adhere to the following rules: don't do anything in __init__ and don't overwrite attributes in fit.
It's really not that hard. You never actually need to check if anything has changed. And that is definitely the way it has been for years and I don't see why we should change these semantics now.

@jnothman I thought we did. If this is the only test that breaks, then maybe we don't? I can't see a test doing that.

@omtinez can you maybe provide your estimators code? It's quite strange if it breaks only in this place, it probably violates the sklearn API in several ways.

@omtinez
Copy link
Author

omtinez commented May 3, 2018

@amueller the complaint about making things painful is, I think, pretty difficult to address due to many decisions stemming from the technical choices being made around set_params. To be a little less vague, I think that it goes against common use of Python as a language and some IDEs and linters will even warn you when you do things that are actually required by sklearn such as not initializing (or even declaring) any variables set during fit() in __init__(). But that's a much longer discussion and not as low-hanging fruit as this PR, so let's take it one step at a time :-)

To illustrate the point I'm trying to make, the following code would fail this test:

def __init__(self, random_state=0):
    self.random_state = random_state

def fit(self, X, y):
    if not hasattr(self, 'random_number_generator_'):
        # I can't use util `check_random_state` here because reasons
        self.random_number_generator_ = ...
    ...

Instead, if I understand the problem correctly, this is what actually needs to happen:

def __init__(self, random_state=0):
    self.random_state = random_state
    self._prev_random_state = random_state

def fit(self, X, y):
    if self._prev_random_state != self.random_state:
        delattr(self, 'random_number_generator_')
    if not hasattr(self, 'random_number_generator_'):
        # I can't use util `check_random_state` here because reasons
        self.random_number_generator_ = ...
        self._prev_random_state = self.random_state
    ...

All that said, I think that the biggest issue here is that the test failing has nothing to do with any of this. This test is supposed to be testing whether or not a specific warning is thrown. My fix is a single line of code that does not affect the result of what the test was actually designed to check, and it should not make any previously passing estimators fail it. I surely hope not every single line of code goes under this much scrutiny, otherwise timely progress is doomed :-)

@omtinez
Copy link
Author

omtinez commented May 3, 2018

For a more complete example of an estimator that should be able to pass this test (although fail quite a few others, which are ignored in order to make the code and idea a little more easy to understand) please see the sample estimator in #10977

@jnothman
Copy link
Member

jnothman commented May 3, 2018 via email

@omtinez
Copy link
Author

omtinez commented May 3, 2018

Mainly performance reasons. This makes more sense in the context of partial_fit(), but I can also imagine how an estimator would try to prevent e.g. loading a very large file multiple times regardless of partial_fit().

I think an estimator should be expected to return the same predictions after clone() followed by fit() for reproducibility reasons. I don't think that we should expect estimators to be identical if there's no clone() between fits. It makes writing estimators that try to reuse code between fit() and partial_fit() unnecessarily complex, and adds no real value since that's what clone() is for. I am currently developing a handful of estimators and this is the only test that fails when I use the exact same function for fit() and partial_fit(), which is very unfortunate.

Even if you still disagree with all the above, it should not be tested in this function. It is bad unit testing practices and it took me some time to track down where the issue was coming from as an estimator developer.

Edited: accidentally said fit_predict instead of partial_fit above

@amueller
Copy link
Member

amueller commented May 20, 2018

and adds no real value since that's what clone() is for

No, not really. Users are never really expected to use clone.

The code example you gave should just be

def fit(self, X, y):
    self.random_number_generator_ = ...

There should never be a reason to do hasattr in fit. Can you give a concrete example of where our current API makes things tricky?

I'd take a PR that ensure this is tested somewhere else and then remove the test here.

@omtinez
Copy link
Author

omtinez commented May 20, 2018

The current API makes things tricky for two reasons:

  1. Not initializing or even declaring anything in __init__ is against common usage of Python and triggers several warnings in linters and IDEs.
  2. Using hasattr in partial_fit is perfectly reasonable. Saying that there is no use for hasattr in fit means that all estimators must have separate fit and partial_fit implementations.

I think that the ML community will find more and more frequently the need to work with estimators that are able to take batches of data, aka estimators that implement partial_fit-like methods. This is easy to do in Tensorflow and Pytorch. Forcing estimators to have separate fit and partial_fit implementations for scikit-learn adds friction in exchange of, in my opinion, very little value.

The tests provided by sklearn.utils.estimator_checks.check_estimator are very useful, and every single one of the tests can be used for estimators that implement partial_fit except for check_supervised_y_2d (which, had it been correctly written, would be testing only one thing and not fail in this scenario). Fixing this but adding another test that prevents estimators from sharing a common fit/partial_fit implementation makes check_estimator not useful again.

Not making things easier for developers of estimators results in many libraries having scikit-like APIs but non-compliant implementations (as per check_estimator). If we could be a little more flexible, then different libraries could all be tested against the same standard and we would see a lot more "scikit-compliant estimators". I'm more than happy to contribute towards that and make this testing suite more robust and inclusive with PRs like this one and, if you are open to it, more to come. I'd love to see scikit-learn able to provide the plumbing for all other ML libraries out there, so everyone benefits from one common API.

@amueller
Copy link
Member

I think there is little merit discussing 1) at this point.

I don't understand 2). What do you mean by sharing an implementation of fit and partial_fit? That makes no sense as they have very different semantics. You can call fit on one dataset, and the fit on another dataset, and the second fit will be completely independent of the first fit. If you do the same with partial_fit, it will raise an error.

Fixing this but adding another test that prevents estimators from sharing a common fit/partial_fit implementation makes check_estimator not useful again.

There are two separate issues here: the organizations of the common tests, and the API conventions. The API conventions right now are very clear. You argue they are not good. We can talk about that. But they are very clear right now. And the common tests enforce these conventions.

Right now this test test two things. I agree, that's not great.
The reason that the other tests pass is that it's very hard to write a test that guarantees that consecutive calls to fit are independent. I'm happy to improve the tests to be more clear in enforcing current API rules. But that's a completely separate discussion from changing those rules.

@amueller
Copy link
Member

amueller commented May 20, 2018

And with respect to "friction", what we are talking about here is having a method called initialize_estimator and calling it conditionally in partial_fit and calling it unconditionally in fit.
The standard way to do this is:

def _fit(self, X, y):
    # actual fitting code
    pass

def _initialize_estimator(self):
    # initialize attributes
    pass

def fit(self, X, y):
    self.initialize_estimator()
    self._fit(X, y)
    return self

def partial_fit(self, X, y, classes):
    if self._partial_fit_first_call():
        self.initialize_estimator()
    self._fit(X, y)
    return self

If this is a very common pattern for you, create a baseclass with that code, done.

@amueller
Copy link
Member

amueller commented Aug 5, 2019

I think this can be closed as we merged #12328. If you disagree, feel free to comment and reopen.

@amueller amueller closed this Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test - sklearn/utils/estimator_checks.py::check_supervised_y_2d does not clone estimator
4 participants