Skip to content

Implications of FrozenEstimator on our API #29893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
thomasjpfan opened this issue Sep 19, 2024 · 15 comments
Open

Implications of FrozenEstimator on our API #29893

thomasjpfan opened this issue Sep 19, 2024 · 15 comments

Comments

@thomasjpfan
Copy link
Member

thomasjpfan commented Sep 19, 2024

With #29705, we have a simple way to freeze estimators, which means there is no need for cv="prefit". This also opens the door for #8350 to make Pipeline and FeatureUnion follow our conventions. This issue is to discuss the API implications of introducing FrozenEstimator. Here are the two I had in mind:

cv="prefit"

For the cv case, users pass a frozen estimator directly into cv:

rf = RandomForestClassifer()
rf.fit(X_train, y_train)
frozen_rf = FrozenEstimator(rf)

calibration = CalibratedClassifierCV(frozen_rf)
calibration.fit(X_calib, y_calib)

Making this change will simplify our codebase with cv="prefit"

compose.Pipeline

We introduce a new compose.Pipeline which follows our conventions with clone. (The current pipeline.Pipeline does not clone.)

from sklearn.compose import Pipeline

prep = ColumnTransformer(...)
prep.fit(X_train, y_train)
frozen_prep = FrozenEstimator(prep)

pipe = Pipeline([frozen_prep, LogisticRegression()])

pipe.fit(X_another, y_another)

In both cases, I like prefer the semantics of FrozenEstimator.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Sep 19, 2024
@thomasjpfan thomasjpfan added API RFC and removed Needs Triage Issue requires triage labels Sep 19, 2024
@adrinjalali
Copy link
Member

Was just talking with @glemaitre about this and we also very much agree with these points (or at least I thought Guillaume agrees 😅 ). I'd like to have the FrozenEstimator PR merged, and then open follow up PRs to deal with these implications.

This also deserves a @scikit-learn/core-devs ping.

@glemaitre
Copy link
Member

In addition of cv="prefit", we also have the option prefit=True for SelectFromModel.

So in general, I think that using the FrozenEstimator is the right way because:

  • it is explicit which estimator is frozen
  • it is less brittle when having nested objects (pipeline in column transformer inside a cross-validation). We never really got it right in the past because this is rather complex to not miss a cloning somewhere.
  • it allows to remove a parameter option that is probably of interest of "advanced" user reducing a bit of complexity
  • thinking of cv="prefit", the semantic of the parameter & value is not straightforward without reading the documentation (a CV is not really prefitted)
  • it simplifies the code internally because clone is handled by the FrozenEstimator and not anymore in each of the meta-estimator.

To me there is one downside:

  • we request to nest an estimator inside a new meta-estimator. However, taken that this is an advanced use-case, I think that this is a mitigated complexity considering the type of user that are going to use the pattern.

So given the pros/cons that I could come with, I'm pretty much in favor of introducing FrozeEstimator and deprecate cv="prefit" and prefit=True/False

@adam2392
Copy link
Member

To me there is one downside:

  • we request to nest an estimator inside a new meta-estimator. However, taken that this is an advanced use-case, I think that this is a mitigated complexity considering the type of user that are going to use the pattern.

So given the pros/cons that I could come with, I'm pretty much in favor of introducing FrozeEstimator and deprecate cv="prefit" and prefit=True/False

Would it be weird to add a freeze() / get_frozen() method to BaseEstimator, so that all estimators can effectively return a FrozenEstimator without having to import FrozenEstimator and do the nesting?

Perhaps it adds too much complexity and the user is confused because then there are multiple ways of "getting a frozen estimator"?

@adrinjalali
Copy link
Member

I wouldn't mind a def get_frozen(self): return FrozenEstimator(self) indeed. But I think we can move forward with the rest, and decide if we want to add this nicety later. Good suggestion though.

@adrinjalali
Copy link
Member

@ogrisel you might like the get_frozen() idea here if you don't want users to have to do an extra import.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2024

I wouldn't mind a def get_frozen(self): return FrozenEstimator(self) indeed.

No strong opinion on whether we need it or not, but if we introduce it, I would vote in favor of a top level sklearn.freeze to be consistent with our existing sklearn.clone.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2024

@ogrisel you might like the get_frozen() idea here if you don't want users to have to do an extra import.

You still need to import it (unless you already imported the top level sklearn for other things).

EDIT: this comment is wrong.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2024

Let's finalize the reviews of #30171 and #30172 first, and then let's open a PR to explore how sklearn.freeze (or sklearn.frozen.get_frozen) would feel in the examples.

@ogrisel
Copy link
Member

ogrisel commented Oct 30, 2024

Oh, I misread the proposals. I did not realize the proposal above was about introducing a method instead of a function... CalibratedClassifierCV(mode.freeze()).fit(X_cal, y_cal) would feel right for me. But it's expanding the common public API for scikit-learn estimators, possibly also for third-party estimator developers.

@adrinjalali
Copy link
Member

For third party estimators who inherit from BaseEstimator (which should be all of them at this point with our new requirements), then they don't really need to do anything.

@glemaitre
Copy link
Member

I feel that a function rather than a method is more natural with the current state of our code base. It would stand nicely next to clone.

I would find strange that cloning and freeze are not aligned in this regard (i.e. function vs method).

@adrinjalali
Copy link
Member

I would be more in favor of adding a clone to BaseEstimator rather than putting freeze beside clone 😁 .

I'd be somewhat against a sklearn.base.freeze as opposed to BaseEstimator.freeze, since the former requires the user to do an extra import anyway. And at the point we require an extra import, it could be FrozenEstimator itself.

@betatim
Copy link
Member

betatim commented Oct 30, 2024

My intuition is also that a standalone function would be nicer than adding a method.

@adrinjalali
Copy link
Member

adrinjalali commented Oct 30, 2024

If we do a function, in effect, we're changing

(1)

from sklearn.frozen import FrozenEstimator
MetaEstimator(estimator=FrozenEstimator(subestimator))

or

import sklearn
MetaEstimator(estimator=sklearn.frozen.FrozenEstimator(subestimator))

to

(2)

from sklearn import freeze
MetaEstimator(estimator=freeze(subestimator))

or

import sklearn
MetaEstimator(estimator=sklearn.freeze(subestimator))

while having a method, would mean

(3)

MetaEstimator(estimator=subestimator.freeze())

To me, (1) and (2) are not different enough to warrant the function, while (3) is a nicer enough experience that it's worth doing.

@glemaitre
Copy link
Member

I would be more in favor of adding a clone to BaseEstimator rather than putting freeze beside clone 😁.

Since this alternative avoids an import and looking at #30171 when it comes to usability, then I would agree that methods might be nicer to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants