Skip to content

check_estimator is stricter than what is stated in the Estimator API doc #16241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jan 27, 2020 · 30 comments · Fixed by #17361
Open

check_estimator is stricter than what is stated in the Estimator API doc #16241

ogrisel opened this issue Jan 27, 2020 · 30 comments · Fixed by #17361
Labels
Milestone

Comments

@ogrisel
Copy link
Member

ogrisel commented Jan 27, 2020

This issue was raised in a discussion regarding the LightGBM scikit-learn compatible estimators: microsoft/LightGBM#2628 (comment)

The problem is that check_estimator complains about private attributes set in the __init__ of a scikit-learn estimator while our documentation just state the following (while not explicitly prohibiting setting private attributes in __init__):

The arguments accepted by __init__ should all be keyword arguments with a default value. In other words, a user should be able to instantiate an estimator without passing any arguments to it. The arguments should all correspond to hyperparameters describing the model or the optimisation problem the estimator tries to solve. These initial arguments (or parameters) are always remembered by the estimator. Also note that they should not be documented under the “Attributes” section, but rather under the “Parameters” section for that estimator.

In addition, every keyword argument accepted by __init__ should correspond to an attribute on the instance. Scikit-learn relies on this to find the relevant attributes to set on an estimator when doing model selection.

from: https://scikit-learn.org/0.22/developers/develop.html#instantiation

The strict check is defined in sklearn.utils.estimator_checks.check_no_attributes_set_in_init.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 27, 2020

I think that setting private attributes in __init__ is fine as long as the public attributes defined as constructor params (that is the model hyperparametrs) can be get/set at any moment either via getattr / setattr or via get_params / set_params and that subsequent calls to the fit / predict / transform automatically take those changes into account.

@ogrisel ogrisel changed the title check_estimator is stricter that what is stated on Estimator API doc check_estimator is stricter that what is stated in the Estimator API doc Jan 27, 2020
@ogrisel
Copy link
Member Author

ogrisel commented Jan 27, 2020

Arguably it's hard to write an automated tool that would check what I described above in #16241 (comment) but maybe it's fine if check_estimator does not check every single detail of what we write in the API contract document.

@jnothman
Copy link
Member

jnothman commented Jan 27, 2020 via email

@thomasjpfan
Copy link
Member

While we figure out a solution to this, should we remove check_no_attributes_set_in_init from check_estimator and only run it for our estimators?

@jnothman
Copy link
Member

jnothman commented Jan 30, 2020 via email

@StrikerRUS
Copy link
Contributor

Hello guys!

Is there any progress on this? Will it be possible to resolve this issue by next release?

@segatrade
Copy link

Thank you for 0.22+ support preparing :)

@jnothman
Copy link
Member

jnothman commented Mar 4, 2020 via email

@h-vetinari
Copy link

+1 to prioritizing this. Fixing unintentional breakage of downstream packages should rank pretty high on the list of todos, IMHO.

@rth
Copy link
Member

rth commented Mar 26, 2020

@h-vetinari As mentioned above, a PR for the stict option for check_estimator would be very welcome. I would be happy to review it if someone makes it. There was some related work in #14929 that is now stalled.

@h-vetinari
Copy link

@rth
Thanks for the quick response. While I understand how open source development works, and how swamped everyone is, I think that breaking downstream users should be considered a regression and receive higher priority than "we'll accept PRs".

I do not know the code-base, so I don't know where to start or what would be necessary. However, I'd be willing to try with some pointers of what needs to be done.

@rth
Copy link
Member

rth commented Mar 26, 2020

Thanks for the quick response. While I understand how open source development works, and how swamped everyone is, I think that breaking downstream users should be considered a regression and receive higher priority than "we'll accept PRs".

Can you explain how this is breaking workflows for downstream users?

In my understanding the issue is that running check_estimators on third party estimators would produce false positives (i.e. incorrect errors). Running check_estimator is certainly encouraged by maintainers but in no way required to use those third party libraries. If third party estimator breaks in some way while used in combination with scikit-learn meta estimators (e.g. GridSearchCV), that means that it likely needs fixing, independently of this issue. So for me this is certainly an annoying issue, but in no way critical nor a regression. Unless I'm missing something.

I do not know the code-base, so I don't know where to start or what would be necessary. However, I'd be willing to try with some pointers of what needs to be done.

The general idea is to add a strict=False parameter to the check_estimator function. You then run check_estimator(estimator, strict=False) on your estimator, see where the error occur (that you estimate to be incorrect), then pass this flag down to the relevant check (in the issue description it is check_no_attributes_set_in_init), and only run the check (or parts of the check) in question if strict=True (or adjust the check in general to work for your estimator when applicable). See also discussion in #13969.

@rth
Copy link
Member

rth commented Mar 26, 2020

Ahh, I see the discussion is from conda-forge/lightgbm-feedstock#21 . You can just skip the failing test IMO.

@h-vetinari
Copy link

@rth: Can you explain how this is breaking workflows for downstream users?

I'm just a middleman here (or rather end-user, trying to minimize the risk of silent breakages in my projects). From what I can make out of this comment from @StrikerRUS in microsoft/LightGBM#2628, I think it's that setting private attributes in __init__ is relied upon by lightgbm, and not compatible anymore with 0.22 (and this limitation is not explicitly documented either).

I can't judge the extent of the breakage, but it seems the lightgbm-folks feel strongly enough about it that they felt the need to preemptively state in their docs:

The last supported version of scikit-learn is 0.21.3. Our estimators are incompatible with newer versions.

I'm just trying to highlight this issue again.

@h-vetinari
Copy link

PS. Thanks for taking a look at it again, it's not taken for granted. I cannot promise anything regarding a PR, but I'll try to look into it.

@adrinjalali
Copy link
Member

Thanks for the quick response. While I understand how open source development works, and how swamped everyone is, I think that breaking downstream users should be considered a regression and receive higher priority than "we'll accept PRs".

We have tried to be accommodating to the capacity we have, regarding that specific lightgbm issue. Things take time and hopefully soon they'll be fixed. But as you say, it's an open source community with people volunteering their time to work on it. If people really need to use sklearn AND lightgbm, they can use the previous version. And if people really think such regressions "should" be considered high priority, they or their employer could support its development in one way or another, other than "why aren't you fixing it". I'd really appreciate if you could avoid such a tone here.

@rth
Copy link
Member

rth commented Mar 26, 2020

No it's fine @adrinjalali , there have been indeed been some miscommunication between this issue and microsoft/LightGBM#2628 which made it sound like scikit-learn 0.22 made a regression which prevents it from being used with lightgbm. There was a regression but it was fixed, this issue is benign as far as I can tell and lightgbm should just ignore the failing common test as I explained in microsoft/LightGBM#2628 (comment) . (As a side note I finally understand why so many users wanted this obscure and non critical change applied).

Thanks for pinging us on this @h-vetinari!

@h-vetinari
Copy link

@adrinjalali: I'd really appreciate if you could avoid such a tone here.

Tone unfortunately doesn't transmit well in a written medium. I'm sorry if it came across as demanding, I was in fact trying to negate this with the caveat (i.e. just calmly saying that the perceived priority should IMO be higher, regardless of how quickly that translates into available resources). And I did offer to look into it myself, although I'd be starting from scratch (aside from the kind pointers by @rth). Hope that clears the air a bit, so to speak.

@lorentzenchr lorentzenchr changed the title check_estimator is stricter that what is stated in the Estimator API doc check_estimator is stricter than what is stated in the Estimator API doc Oct 26, 2020
@ogrisel ogrisel modified the milestones: 0.24, 1.0 Dec 2, 2020
@ogrisel
Copy link
Member Author

ogrisel commented Dec 2, 2020

Moving milestone to the next major version (which is currently named 1.0).

@adrinjalali
Copy link
Member

Moving to 1.1, but happy to have this in if @glemaitre 's PR gets in fast.

@jeremiedbb
Copy link
Member

very unlikely to be part of 1.1, moving to 1.2

@betatim
Copy link
Member

betatim commented May 7, 2024

A lot of the discussion in this issue is around setting private attributes in __init__. However the current version of check_no_attributes_set_in_init already filters out attributes that start with a _.

So maybe we need an update regarding the problem that needs solving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment