Python hangs while exploring a GridSearCV with n_jobs=-1

Dear all,

I have been encountering the issue mentioned in the description for a while and as the [10533](https://github.com/scikit-learn/scikit-learn/issues/10533) where the issue is discussed does not present a clear **solution for Windows**, I decided to post it here.

Firstly, **the piece of code that runs in two the issue** is the following:
`    qrf = RandomForestQuantileRegressor(n_jobs=-1)`

    parameters = {'n_estimators' : [30,40], # More estimators will always give a better result
                  'criterion': ['mae'],
                  'min_samples_split': [5,10,15],
                  'max_features' : [Xtrain.shape[1]//3], # For regression.
                  'verbose' : [50],
                  'random_state' : [0] # For reproducability of results.
                 }

    custom_cv = custom_cv_2folds(ntrain=Xtrain.shape[0],
                                 nvalid=Xvalid.shape[0])

    qrf_grid = GridSearchCV(qrf,
                            param_grid=parameters,
                            cv=custom_cv,
                            n_jobs=-1)

    qrf_grid.fit(Xtrainvalid, ytrainvalid)
    qrf_best = qrf_grid.best_estimator_ # We save the best estimator.`

where `custom_cv` is the following cross-validation split:
`def custom_cv_2folds(ntrain,nvalid): 
    # Indices for the training and validation partitions are yield.
    idx_train = np.arange(0,ntrain,dtype=int)
    idx_valid = np.arange(ntrain,ntrain+nvalid,dtype=int)
    yield idx_train, idx_valid`

In addition, I think there is no need to provide the data used since there are other issues similar to this one and they do not appear to be data-related.

As the issue has been going on for a while, how it appears to me has changed slightly, thus I will summarize its behaviour in each stage its evolution:

1. At the beginning, I was just running the provided code. The cell in the Jupyter notebook I was executing **hanged forever**. Shutting down the kernel and restarting it manually was unuseful since the task administrator show a 100% usage of the CPU after doing so and the only way to bring my computer back to normality was by rebooting it. In addition to that, sometimes the piece of code provided above might run in case the Grid was smaller. An example of a `GridSearchCV` that **does run successfully**:
`parameters = {'n_estimators' : list(range(10,40,10)),
              'criterion': ['mae'],
              'min_samples_split': list(range(2,11,4)),
              'max_features' : [Xtrain.shape[1]//3], 
              'verbose' : [50],
              'random_state' : [0] 
             }`
I think **`n_estimators` hyper-parameter plays a great role in this**.

2. In this second stage, I was reading about the issue and happened to come across the [10533](https://github.com/scikit-learn/scikit-learn/issues/10533). In which it was recommended to install `cloud pickle` and add the following to my code:
`%env LOKY_PICKLER='cloudpickle' 
import multiprocessing
multiprocessing.set_start_method('forkserver')` which I did.
However, as **I am not using Linux (Windows instead)**, there is no way for me to use that method and set the `forkserver`.
What I did instead, was trying to run my code again, and turned out that now, that the verbose output provided `Using LokyBackend` and the following message showed up:
**`FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: buffer source array is read-only`**

I started looking into it and tried different stuff using [parallel_backend](https://scikit-learn.org/stable/modules/generated/sklearn.utils.parallel_backend.html)

Now, my code runs with `with parallel_backend('threading')`. However, the executing of the trees happened to be unsorted unlike was happening in situations in which the `GridSearchCV` was successfully executed. For instance:
`[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
building tree 1 of 30building tree 1 of 40
building tree 2 of 30
building tree 2 of 40
building tree 3 of 30
building tree 3 of 40
building tree 4 of 30
building tree 4 of 40
building tree 5 of 40
building tree 5 of 30
building tree 1 of 30building tree 6 of 40
building tree 6 of 30building tree 2 of 30
building tree 7 of 40`

I cannot say if the `fit` has finished successfully since the training process takes quite a lot, but I will add that information as soon as it finishes.

**My questions now are**:
1. Is there a reason why this is happening?
2. Is there a workaround for Windows users?
3. I am rather concerned about the disordered executing of the trees during the training process. Do you know why is this happening? Do you think will it be unbeneficial for the model results?

Thanks a lot in advance, and sorry for the long post, but I added all information which I considered relevant so that the investigation had no lack of information.

In case you need me to provide any additional details, just let me know.

Óscar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Python hangs while exploring a GridSearCV with n_jobs=-1 #17543

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Python hangs while exploring a GridSearCV with n_jobs=-1 #17543

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions