Description
When running tests with pytest-xdist on a machine with 12 (physical) CPU machine, the use of OpenMP in HistGradientBoosting seem to lead to significant over-subscription,
pytest sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py -v
for me takes 0.85s. This runs 2 docstrings on training GBDT classifier and regressor on iris and boston datasets respectively.
- Running thin on 2 parallel processes with (
-n 2
) takes 56s (and 50 threads are created). - Running with 2 processes and
OMP_NUM_THREADS=2
takes 0.52s
While I understand the case of catastrophic oversubscription when N_CPU_THREADS**2
threads are created on a machine with many cores, here we create 2*N_CPU_THREADS
only as compared to 1*N_CPU_THREADS
and get a 10x slowdown.
Can someone reproduce it? Here using scikit-learn master, and a conda env on Linux with latest numpy scipy nomkl python=3.7
.
Because pytest-xdist uses its own parallelism system (not sure what it does exactly) I guess this won't be addressed by threadpoolctl #14979?
Edit: Originally reported in https://github.com/tomMoral/loky/issues/224