-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Description
I use GridSearchCV to optimize the hyperparameters of a pipeline. I set the param grid by inputing transformers or estimators at different steps of the pipeline, following the Pipeline documentation:
A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting to None.
I couldn't figure why dumping cv_results_ would take so much memory on disk. It happens that cv_results_['params'] and all cv_results_['param_*'] objects contains fitted estimators, as much as there are points on my grid.
This bug should happen only when n_jobs = 1 (which is my usecase).
I don't think this is intended (else the arguments and attributes refit and best_estimator wouldn't be used).
My guess is that during the grid search, those estimator's aren't cloned before use (which could be a problem if using the same grid search several times, because estimators passed in the next grid would be fitted...).