-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelization limited to 2 CPUs when importing torch before joblib on a slurm cluster #1420
Comments
Thanks for the report, this is indeed quite unexpected. I am not sure I can reproduce on my local machine. EDIT: I had not read "On my laptop, both work as expected.". Can you please try the following: import torch
from joblib import Parallel, delayed
import os
def heavy_func():
for _ in range(10000):
[i for i in range(10000)]
return os.getpid()
n_jobs = 40
results = Parallel(n_jobs=n_jobs)(delayed(heavy_func)() for _ in range(2*n_jobs))
print(len(set(results))) |
This gave me different numbers each time I ran the script. When torch is imported first: on 5 runs I got: When joblib is imported first: on 5 runs I got:
The example of my first message was on the Margaret cluster. Maybe you can try to reproduce the result on a Margaret node? |
Hi, I encounter the same issue in an ubuntu 20.04 machine:
The issue occurs with my own code, and I can also reproduce the issue with @aperezlebel 's code. |
Hi, same here.
Importing joblib before torch can resolve it. |
I think this is an issue where importing torch modify the CPU affinity.
It seems to be related to llvm-openmp>=16 (
pytorch/pytorch#101850 (comment)),
could you check if you have this installed? if so could you install the
earlier version and check it works properly?
Le jeu. 9 nov. 2023 à 07:00, Rining Wu ***@***.***> a écrit :
… Hi, same here.
- python 3.8.17
- torch 1.12.1
- joblib 1.3.2
Importing joblib before torch can resolve it.
—
Reply to this email directly, view it on GitHub
<#1420 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZKZ6M33UQ2THUEIAPKFZ3YDRWRBAVCNFSM6AAAAAAW6PLJ4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBTGIYTENRTGM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@tomMoral Thank you for this ref. It is useful. The |
Let's close it looks like there is a work-around for this. |
Hello, I encountered an issue importing torch before joblib on a slurm cluster. Instead of using all available CPUs, it only uses two (regardless of the total CPU count). On my laptop, both work as expected.
The problem occurs on all backends. On the "threading" backend, the reduced load is spread across CPUs, each having a few percent instead of 100%.
1. Normal behavior:
2. Problematic behavior:
Versions
The text was updated successfully, but these errors were encountered: