Skip to content

Python hangs while exploring a GridSearCV with n_jobs=-1 #17543

@fuster-10

Description

@fuster-10

Dear all,

I have been encountering the issue mentioned in the description for a while and as the 10533 where the issue is discussed does not present a clear solution for Windows, I decided to post it here.

Firstly, the piece of code that runs in two the issue is the following:
qrf = RandomForestQuantileRegressor(n_jobs=-1)

parameters = {'n_estimators' : [30,40], # More estimators will always give a better result
              'criterion': ['mae'],
              'min_samples_split': [5,10,15],
              'max_features' : [Xtrain.shape[1]//3], # For regression.
              'verbose' : [50],
              'random_state' : [0] # For reproducability of results.
             }

custom_cv = custom_cv_2folds(ntrain=Xtrain.shape[0],
                             nvalid=Xvalid.shape[0])

qrf_grid = GridSearchCV(qrf,
                        param_grid=parameters,
                        cv=custom_cv,
                        n_jobs=-1)

qrf_grid.fit(Xtrainvalid, ytrainvalid)
qrf_best = qrf_grid.best_estimator_ # We save the best estimator.`

where custom_cv is the following cross-validation split:
def custom_cv_2folds(ntrain,nvalid): # Indices for the training and validation partitions are yield. idx_train = np.arange(0,ntrain,dtype=int) idx_valid = np.arange(ntrain,ntrain+nvalid,dtype=int) yield idx_train, idx_valid

In addition, I think there is no need to provide the data used since there are other issues similar to this one and they do not appear to be data-related.

As the issue has been going on for a while, how it appears to me has changed slightly, thus I will summarize its behaviour in each stage its evolution:

  1. At the beginning, I was just running the provided code. The cell in the Jupyter notebook I was executing hanged forever. Shutting down the kernel and restarting it manually was unuseful since the task administrator show a 100% usage of the CPU after doing so and the only way to bring my computer back to normality was by rebooting it. In addition to that, sometimes the piece of code provided above might run in case the Grid was smaller. An example of a GridSearchCV that does run successfully:
    parameters = {'n_estimators' : list(range(10,40,10)), 'criterion': ['mae'], 'min_samples_split': list(range(2,11,4)), 'max_features' : [Xtrain.shape[1]//3], 'verbose' : [50], 'random_state' : [0] }
    I think n_estimators hyper-parameter plays a great role in this.

  2. In this second stage, I was reading about the issue and happened to come across the 10533. In which it was recommended to install cloud pickle and add the following to my code:
    %env LOKY_PICKLER='cloudpickle' import multiprocessing multiprocessing.set_start_method('forkserver') which I did.
    However, as I am not using Linux (Windows instead), there is no way for me to use that method and set the forkserver.
    What I did instead, was trying to run my code again, and turned out that now, that the verbose output provided Using LokyBackend and the following message showed up:
    FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: ValueError: buffer source array is read-only

I started looking into it and tried different stuff using parallel_backend

Now, my code runs with with parallel_backend('threading'). However, the executing of the trees happened to be unsorted unlike was happening in situations in which the GridSearchCV was successfully executed. For instance:
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. [Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers. building tree 1 of 30building tree 1 of 40 building tree 2 of 30 building tree 2 of 40 building tree 3 of 30 building tree 3 of 40 building tree 4 of 30 building tree 4 of 40 building tree 5 of 40 building tree 5 of 30 building tree 1 of 30building tree 6 of 40 building tree 6 of 30building tree 2 of 30 building tree 7 of 40

I cannot say if the fit has finished successfully since the training process takes quite a lot, but I will add that information as soon as it finishes.

My questions now are:

  1. Is there a reason why this is happening?
  2. Is there a workaround for Windows users?
  3. I am rather concerned about the disordered executing of the trees during the training process. Do you know why is this happening? Do you think will it be unbeneficial for the model results?

Thanks a lot in advance, and sorry for the long post, but I added all information which I considered relevant so that the investigation had no lack of information.

In case you need me to provide any additional details, just let me know.

Óscar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions