Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization limited to 2 CPUs when importing torch before joblib on a slurm cluster #1420

Closed
aperezlebel opened this issue Apr 14, 2023 · 7 comments

Comments

@aperezlebel
Copy link

aperezlebel commented Apr 14, 2023

Hello, I encountered an issue importing torch before joblib on a slurm cluster. Instead of using all available CPUs, it only uses two (regardless of the total CPU count). On my laptop, both work as expected.

The problem occurs on all backends. On the "threading" backend, the reduced load is spread across CPUs, each having a few percent instead of 100%.

1. Normal behavior:

from joblib import Parallel, delayed
import torch

def heavy_func():
    for _ in range(10000):
        [i for i in range(10000)]


n_jobs = 40
Parallel(n_jobs=n_jobs)(delayed(heavy_func)() for _ in range(2*n_jobs))

Screenshot 2023-04-14 at 15 56 48

2. Problematic behavior:

import torch
from joblib import Parallel, delayed


def heavy_func():
    for _ in range(10000):
        [i for i in range(10000)]


n_jobs = 40
Parallel(n_jobs=n_jobs)(delayed(heavy_func)() for _ in range(2*n_jobs))

Screenshot 2023-04-14 at 15 56 33

Versions

python                    3.10.10         he550d4f_0_cpython    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
pytorch                   2.0.0              py3.10_cpu_0    pytorch
slurm                     20.11.9
@ogrisel
Copy link
Contributor

ogrisel commented Apr 18, 2023

Thanks for the report, this is indeed quite unexpected.

I am not sure I can reproduce on my local machine. EDIT: I had not read "On my laptop, both work as expected.".

Can you please try the following:

import torch
from joblib import Parallel, delayed
import os


def heavy_func():
    for _ in range(10000):
        [i for i in range(10000)]
    return os.getpid()


n_jobs = 40
results = Parallel(n_jobs=n_jobs)(delayed(heavy_func)() for _ in range(2*n_jobs))
print(len(set(results)))

@aperezlebel
Copy link
Author

This gave me different numbers each time I ran the script.

When torch is imported first: on 5 runs I got: 27, 23, 20, 17, 16.

When joblib is imported first: on 5 runs I got: 4, 2, 4, 3, 4.

I am not sure I can reproduce on my local machine

The example of my first message was on the Margaret cluster. Maybe you can try to reproduce the result on a Margaret node?

@wondey-sh
Copy link

Hi, I encounter the same issue in an ubuntu 20.04 machine:

  • python 3.9.16 h2782a2a_0_cpython conda-forge
  • pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch
  • joblib 1.3.0 pyhd8ed1ab_1 conda-forge

The issue occurs with my own code, and I can also reproduce the issue with @aperezlebel 's code.

@wurining
Copy link

wurining commented Nov 9, 2023

Hi, same here.

  • python 3.8.17
  • torch 1.12.1
  • joblib 1.3.2

Importing joblib before torch can resolve it.

@tomMoral
Copy link
Contributor

tomMoral commented Nov 9, 2023 via email

@wurining
Copy link

wurining commented Nov 9, 2023

@tomMoral Thank you for this ref. It is useful.

The llvm-openmp version is 16.0.6, and I can set KMP_AFFINITY=disabled to make it run well as well.

pytorch/pytorch#99625 (comment)

@lesteve lesteve closed this as completed Feb 19, 2025
@lesteve
Copy link
Member

lesteve commented Feb 19, 2025

Let's close it looks like there is a work-around for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants