Skip to content

Unexpected slowness of code execution in the JupyterHub deployment (OpenMP oversubscription) #586

@ogrisel

Description

@ogrisel

As reported on the forum, the execution of some cells are significantly slower than expected (40s or more instead ~2s):

https://mooc-forums.inria.fr/moocsl/t/cross-validation-accuracy-reproducibility/7379/3

discussing the cross-validation of an HGB Classifier model in Exercise M1.05:

https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session02/jump_to_id/2081c92e3a4d4cc3b14db6e7e4220d58

I found the following problem on the configuration of the jupyterhub server:

  • There are 4 cores on the machine according to threadpoolctl.threadpool_info()
  • However the CFS quota ( /sys/fs/cgroup/cpu/cpu.cfs_quota_us) is set to 1x the CFS period (/sys/fs/cgroup/cpu/cpu.cfs_period_us) which means that only 1 CPU is usable per container.

I think we should allow for at least 2 CPUs per-container in the kubernetes CFS config (or even 4), even though we know they will be underused most of the time. And we should set the following environment variables accordingly:

OMP_NUM_THREADS=2
OPENBLAS_NUM_THREADS=2
LOKY_CPU_COUNT=2

but if cfs_quota_us is left to 1 x cfs_period_us, then we should instead set:

OMP_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
LOKY_CPU_COUNT=1

in the environment config to avoid any potential oversubscription problem.

I have also observed that the anti-oversubscription protection for HBG Classifier implemented in scikit-learn/scikit-learn#20477 and released as part of scikit-learn 1.0 is not working as expected because setting OMP_NUM_THREADS=1 at the beginning of the notebook or using threadpoolctl.threadpool_limit(limits=1) can change the duration from ~40s to ~6s in my tests. So OpenMP oversubscription is the main culprit here.

I would not have expected this because sklearn.utils._openmp_helpers._openmp_effective_n_threads() returns 1 (as expected) and I have checked that _openmp_effective_n_threads is called where appropriate in HistGradientBoostingClassifier.fit in the source code of the version of scikit-learn deployed on jupyterhub...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions