-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] Estimator.iter_fits for efficient parameter search #2000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, univariate_selection knows which parameters don't require refit
I would like to stress the need to avoid methods proliferation. Every More sophistication can appear to hide sophistication from users, by For all the reasons above, I would like to raise the question: can the Also, it is important to weight the gains with the costs. A reduction by Finally, the way I like to work is usually to work as hard as I can with I hate sending a long email like this: you have been incredibly active, |
Thanks for your comments, Gaël. As far as this PR goes, I mostly wanted to record some thoughts related to the implementation I hacked up, without suggesting it was the best solution. You are worried about adding another method. I don't think this method needs to be public, nor does it frequently need a specialised implementation. You are probably right that a lot of this can be done by clever memoization, with some additional benefits. But it does not handle warm starts that require particular ordering. In any case, it requires classes to annotate the parameters that should be used as keys, much as in this model. Still, I concur, it would be a much cleaner, simpler interface if Finally, I appreciate -- and did so before your comment -- that I have tended dangerously towards complexity in a few instances, and I hope to learn to temper it. I have been interested in playing with some of the problems in the parameter search space, because they are broadly applicable to learning work; however, they don't fit neatly inside the Estimator API. |
PS: I don't think this would add sophistication for the end-user; none for the implementer of a new model, and marginally more (the addition of an attribute) if they wanted an estimator that didn't need to refit for some parameter changes. |
I completely agree that parameter search is critically important. I just We are doing the same in my lab, but we have always found that designing |
But it would make the GridSearchCV code even more complex, and thus |
Actually, the SearchCV code is hardly affected as long as no fussy parallelisation is needed. Alternative search strategies will land up treating a search as a sequence of smaller searches. In this case, memoization is a far-superior approach. To this extent, I think you're absolutely right it's the approach that should be taken to this problem, and I'm tempted to close this PR as a false direction. |
(And I should point out that the main contribution I've made out of a real need in my work was my first, which still hasn't been merged... perhaps because it's too complex.) |
I don't think that it should be closed, but maybe tagged with a 'CLOSED' |
But for that reason it is related to Andy's issue on the subject. It's always possible to come back to, and to reopen. Thanks again for a great alternative idea. |
This implements an approach to generalised cross validation (#1626) by allowing each estimator to control its fitting over a sequence of parameter settings.
estimator.iter_fits
returns an iterator (usually a generator) over parameters and models fit with those parameters.This PR is to open comment, rather than to finalize the implementation in the near term.
The default implementation allows an estimator to specify parameters which when changed do not require
fit
to be called again. Estimators where it is possible to warm start from the model attributes learnt with the previous parameters may utilise that fact; in particular, this approach emphasises that the same data is being used, where multiple calls tofit
does not. APipeline
resolves its ordering through a depth-first traversal of the parameters affecting each pipeline step (ordered by that step's estimator), and the use of generators means the output oftransform
from higher levels is retained in the stack, which adds a memory cost (it may be possible to optionally reduce this cost later) but cuts a lot of repeated work. The model iterator approach means regularization paths can be easily incorporated if rewritten as generators.Grid search (cv=3, n_jobs=1) over a simple pipeline of (StandardScaler, SelectKBest, ElasticNet) ran in 67% of the baseline time (repeated calls to
Pipeline.fit
) using this implementation, with the following call counts:StandardScaler.fit
StandardScaler.transform
SelectKBest.fit
SelectKBest.transform
ElasticNet.fit
(Note: many calls to
transform
are at predict time, which is not affected by this change.)Caveats and comments:
iter_fits
will often overwrite attributes of the same object in each iteration. It's therefore necessary to copy any data off the models or call any of their methods while iterating. This may make it a dangerous method for public use, and it may be worthwhile to require that each iteration yields a clone of the estimator, at some added expense and complication.iter_fits
is to group the list of parameter settings by the values of some "higher-order" parameters (those requiring substantial work to be done at each change) within which groups lower-order parameters are iterated. Thus:ndarray
is all of these), making the grouping operation and other generic parameter equality evaluations not trivial to codefit
time, each fold of a cv-search process may operate in parallel. To recombine the outputs, it is easiest if the reordering is deterministic, and therefore the same across all folds. Also because the reordering is coupled with fitting, parallelisation beyond one-process-per-fold may require arbitrary splits in the candidate sequence.