-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA Callbacks base infrastructure + progress bars #27663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: callbacks
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in my side. @rth please ignore all the lock file changes. this does not require any attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for the response time @jeremiedbb . Hopefully I fixed my emails filters to see mentions from scikit-learn .
I read it though once, and the main thing I don't understand is how do you compute levels
dict. I would have expected the build_computation_tree
to build the computation tree, but actually it takes levels
as parameters which is already some form of that tree.
Say you have a ColumnTransformer with 2 pipelines inside first with 2 steps, the other one with 3 steps. How do you compute the computation tree in that case?
sklearn/base.py
Outdated
@@ -115,6 +116,10 @@ def _clone_parametrized(estimator, *, safe=True): | |||
|
|||
params_set = new_object.get_params(deep=False) | |||
|
|||
# attach callbacks to the new estimator | |||
if hasattr(estimator, "_callbacks"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this might break something for https://github.com/microsoft/FLAML/blob/0415638dd1e1d3149fb17fb8760520af975d16f6/flaml/automl/model.py#L1587 which also adds this attribute to scikit-learn's base estimator in their library. But there is no reserved private namespaces, so probably they could adapt if it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also break quite easily if the callback object keeps references to attributes of the old estimator. Why aren't we creating a copy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this might break something for https://github.com/microsoft/FLAML/blob/0415638dd1e1d3149fb17fb8760520af975d16f6/flaml/automl/model.py#L1587 which also adds this attribute to scikit-learn's base estimator in their library
I can change the name to _sk_callbacks or any derivative of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we creating a copy here?
The motivation is a use case like this
monitoring = Monitoring()
lr = LogisitcRegression()._set_callbacks(monitoring)
GridSearchCV(lr, param_grid).fit(X,y)
monitoring.plot()
The monitoring callback will gather information across all copies of logistic regression made in the grid search. If we made a copy of the callback in clone
, then we couldn't retrieve any information once the grid search is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object itself can disable copy by implementing __copy__
and __deep_copy__
, and then they would be in the space of "we know what we're doing" and we don't need to worry about it.
I think the kind of state you're referring to here, is something which can be outside the callback object, like a file / a database / an external singleton object, and the callback method just writes into that storage, and at the end one can use that data to plot/investigate/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the kind of state you're referring to here, is something which can be outside the callback object, like a file / a database / an external singleton object
good point and actually in my latest design of the callbacks I didn't rely on a shared state so you can ignore my previous comment :) And we can't get around copies anyway since the clones can happen in subprocesses (in a grid search for instance). I updated to clone the callbacks as any other param
sklearn/base.py
Outdated
""" | ||
if hasattr(sub_estimator, "_callbacks") and any( | ||
callback.auto_propagate for callback in sub_estimator._callbacks | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you call this method twice? Wouldn't this be raised on the second run. If so it feels a bit fragile. What's the harm of not checking this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should never be called twice. The meta estimator only propagates its callbacks once to its sub-estimator(s).
This error is important to have an informative error message if a user tries something like
lr = LogisiticRegression()._set_callbacks(ProgressBar)
GridSearchCV(lr, param_grid)
It would crash without telling the user what's the right way to do it.
- descr: str | ||
A description of the level | ||
- max_iter: int or None | ||
The number of its children. None means it's a leaf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait so max_iter
has nothing to do with the number of iterations of an algorithm. Why is it called that way then.
Also if it's expected to have only a few keys, typing wise might have been better to make it a dataclass or something rather than a dict, but no strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very shallow review.
sklearn/base.py
Outdated
@@ -115,6 +116,10 @@ def _clone_parametrized(estimator, *, safe=True): | |||
|
|||
params_set = new_object.get_params(deep=False) | |||
|
|||
# attach callbacks to the new estimator | |||
if hasattr(estimator, "_callbacks"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also break quite easily if the callback object keeps references to attributes of the old estimator. Why aren't we creating a copy here?
try: | ||
return fit_method(estimator, *args, **kwargs) | ||
finally: | ||
estimator._eval_callbacks_on_fit_end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is becoming larger than just a validation wrapper. We can simplify debugging and magic by having a BaseEstimator.fit
which calls self._fit(...)
and does all the common stuff before and after. That seems a lot better to understand and debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation when I introduced _fit_context
was not only for validation but to have a generic context manager to handle everything we need to do before and after fit. That's why I gave it this generic name.
Although having a consistent framework where we'd have a BaseEstimator.fit
and every child estimator implements _fit
is appealing, I think it goes far beyond the scope of this PR and requires rewriting a lot of estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw do you why BaseEstimator
does not implement fit
in the first place ?
Also, note that _fit_context
also handles partial_fit
, but I don't think we want BaseEstimator
to implement partial_fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseEstimator
doesn't implement fit
cause we don't generally have methods which raise NotImplementedError
. They're simply not there. But now that we have all this work, we can certainly have it in BaseEstimator
, and children only implement a __sklearn_fit__
kind of method instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's outside the scope of this PR. Using the existing context manager is just 1 line addition whereas implementing __sklearn_fit__
means countless PRs :)
+1 to merge this. I personally find that it would be more valuable to mark it as experimental (it's private anyway so far) let the users use this for a version or two, aggregate their feedback and iterate in future details if needed. Overall the API sounds reasonable to me. Rather than to approve a SLEP on this, only then to realize that users would have preferred something else, or that there are some weird edge cases for some estimator that need special handling. |
Note that this PRis targeting the
I agree and we already discussed that with @glemaitre. I plan to do that in a follow up PR. |
I still need to figure out the issue with the CI though |
Any news on this being merged? |
We currently working on the design and need an agreement among core developers. |
Any news on this? It would be amazing would this be added. |
Extracted from #22000
This PR implements a smaller portion of #22000, with only the base infrastructure for callbacks and the implementation of a single callback (progress bars). It targets the
callbacks
branch and the goal is to have the full callback implementation done in several smaller PRs merged in this branch before we merge it in main.This PR proposes significant changes compared to #22000 that should imo improve a lot its change of getting merged 😄
The main improvement is that it no longer requires writing on the disk, but instead relies on
multiprocessing.Managers
and queues. It simplifies the code a lot.In #22000 I adapted some estimators to work with the callbacks which I did not include here to keep the PR as light as possible. You can however experiment the callbacks on estimators that I wrote for testing purpose:
You add
sleep
calls in these testing estimators to simulate how it changes when the computations take longer.The plan is to then have several PRs to implement the other callbacks, adapt a few estimators to work with the callbacks, add some documentation and examples, add more tests...