From 91411ea990bcac65d187054f7e0fef8895bef413 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Fri, 18 Nov 2022 15:06:49 +0100 Subject: [PATCH 1/8] DOC Update "Parallelism, resource management, and configuration" section --- doc/computing/parallelism.rst | 112 +++++++++++++++++----------------- 1 file changed, 56 insertions(+), 56 deletions(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index 382fa8938b5ca..bd72e6fee823a 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -10,22 +10,25 @@ Parallelism, resource management, and configuration Parallelism ----------- -Some scikit-learn estimators and utilities can parallelize costly operations -using multiple CPU cores, thanks to the following components: +Some scikit-learn estimators and utilities parallelize costly operations +using multiple CPU cores. -- via the `joblib `_ library. In - this case the number of threads or processes can be controlled with the - ``n_jobs`` parameter. -- via OpenMP, used in C or Cython code. +This can be either done: -In addition, some of the numpy routines that are used internally by -scikit-learn may also be parallelized if numpy is installed with specific -numerical libraries such as MKL, OpenBLAS, or BLIS. +- with higher-level parallelism via `joblib `_. + In this case, the number of threads or processes can be controlled with the + ``n_jobs`` parameter. +- with lower-level parallelism via OpenMP, used in C or Cython code. + In this case, parallelism is always done using threads and specifying + ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally + more performant by several orders of magnitude. +- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations + on arrays. -We describe these 3 scenarios in the following subsections. +We describe these 3 types of parallelism in the following subsections in more details. -Joblib-based parallelism -........................ +Higher-level parallelism with joblib +.................................... When the underlying implementation uses joblib, the number of workers (threads or processes) that are spawned in parallel can be controlled via the @@ -33,15 +36,16 @@ When the underlying implementation uses joblib, the number of workers .. note:: - Where (and how) parallelization happens in the estimators is currently - poorly documented. Please help us by improving our docs and tackle `issue - 14228 `_! + Where (and how) parallelization happens in the estimators using joblib by + specifying `n_jobs` is currently poorly documented. + Please help us by improving our docs and tackle `issue 14228 + `_! Joblib is able to support both multi-processing and multi-threading. Whether joblib chooses to spawn a thread or a process depends on the **backend** that it's using. -Scikit-learn generally relies on the ``loky`` backend, which is joblib's +scikit-learn generally relies on the ``loky`` backend, which is joblib's default backend. Loky is a multi-processing backend. When doing multi-processing, in order to avoid duplicating the memory in each process (which isn't reasonable with big datasets), joblib will create a `memmap @@ -70,40 +74,46 @@ that increasing the number of workers is always a good thing. In some cases it can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel (see oversubscription below). -OpenMP-based parallelism -........................ +Lower-level parallelism with OpenMP +................................... OpenMP is used to parallelize code written in Cython or C, relying on -multi-threading exclusively. By default (and unless joblib is trying to -avoid oversubscription), the implementation will use as many threads as -possible. +multi-threading exclusively. By default, the implementations using OpenMP +will use as many threads as possible, i.e. as many threads as logical cores. -You can control the exact number of threads that are used via the -``OMP_NUM_THREADS`` environment variable: +You can control the exact number of threads that are used either: -.. prompt:: bash $ + - via the ``OMP_NUM_THREADS`` environment variable, e.g. for instance when: + running a python script: + + .. prompt:: bash $ - OMP_NUM_THREADS=4 python my_script.py + OMP_NUM_THREADS=4 python my_script.py -Parallel Numpy routines from numerical libraries -................................................ + - or via `threadpoolctl` as explained by `this piece of documentation + `_. -Scikit-learn relies heavily on NumPy and SciPy, which internally call -multi-threaded linear algebra routines implemented in libraries such as MKL, -OpenBLAS or BLIS. +Parallel NumPy and SciPy routines from numerical libraries +.......................................................... -The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set -via the ``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, and -``BLIS_NUM_THREADS`` environment variables. +scikit-learn relies heavily on NumPy and SciPy, which internally call +multi-threaded linear algebra routines of BLAS implemented in libraries +such as MKL, OpenBLAS or BLIS. -Please note that scikit-learn has no direct control over these -implementations. Scikit-learn solely relies on Numpy and Scipy. +You can control the exact number of threads used by BLAS for each library +using environment variables, namely: + - ``MKL_NUM_THREADS`` sets the number of thread MKL uses, + - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses + - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses .. note:: - At the time of writing (2019), NumPy and SciPy packages distributed on - pypi.org (used by ``pip``) and on the conda-forge channel are linked - with OpenBLAS, while conda packages shipped on the "defaults" channel - from anaconda.org are linked by default with MKL. + At the time of writing (2022), NumPy and SciPy packages which are + distributed on pypi.org (i.e. the ones installed via ``pip install``) + and on the conda-forge channel (i.e. the ones installed via + ``conda install --channel conda-forge``) are linked with OpenBLAS, while + NumPy and SciPy packages packages shipped on the ``defaults`` conda + channel from Anaconda.org (i.e. the ones installed via ``conda install``) + are linked by default with MKL. Oversubscription: spawning too many threads @@ -120,8 +130,8 @@ with ``n_jobs=8`` over a OpenMP). Each instance of :class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads (since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which -leads to oversubscription of physical CPU resources and to scheduling -overhead. +leads to oversubscription of threads for physical CPU resources and thus +to scheduling overhead. Oversubscription can arise in the exact same fashion with parallelized routines from MKL, OpenBLAS or BLIS that are nested in joblib calls. @@ -146,14 +156,15 @@ Note that: only use ``_NUM_THREADS``. Joblib exposes a context manager for finer control over the number of threads in its workers (see joblib docs linked below). -- Joblib is currently unable to avoid oversubscription in a - multi-threading context. It can only do so with the ``loky`` backend - (which spawns processes). +- `threadpoolctl` internally manages the numbers of threads used by OpenMP + and BLAS implementations for scikit-learn implementations. You will find additional details about joblib mitigation of oversubscription in `joblib documentation `_. +You will find additional details about parallelism in numerical python libraries +in `this document from Thomas J. Fan `_. Configuration switches ----------------------- @@ -161,18 +172,7 @@ Configuration switches Python runtime .............. -:func:`sklearn.set_config` controls the following behaviors: - -`assume_finite` -~~~~~~~~~~~~~~~ - -Used to skip validation, which enables faster computations but may lead to -segmentation faults if the data contains NaNs. - -`working_memory` -~~~~~~~~~~~~~~~~ - -The optimal size of temporary arrays used by some algorithms. +:func:`sklearn.set_config` controls the following behaviors. .. _environment_variable: From 06b6340f5e21baef45edaa93af9b1c32b6dae5a1 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Mon, 21 Nov 2022 10:49:47 +0100 Subject: [PATCH 2/8] STY Break line for bullet point list --- doc/computing/parallelism.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index bd72e6fee823a..541b03d0898f1 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -102,6 +102,7 @@ such as MKL, OpenBLAS or BLIS. You can control the exact number of threads used by BLAS for each library using environment variables, namely: + - ``MKL_NUM_THREADS`` sets the number of thread MKL uses, - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses From 617b18a59c9f59a50accf41b8537cba4659bc0a7 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Mon, 21 Nov 2022 11:44:11 +0100 Subject: [PATCH 3/8] DOC Fix duplication Co-authored-by: Tim Head --- doc/computing/parallelism.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index 541b03d0898f1..8c9d5756f640e 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -83,7 +83,7 @@ will use as many threads as possible, i.e. as many threads as logical cores. You can control the exact number of threads that are used either: - - via the ``OMP_NUM_THREADS`` environment variable, e.g. for instance when: + - via the ``OMP_NUM_THREADS`` environment variable, for instance when: running a python script: .. prompt:: bash $ From 2549650b9a10d4a285f248bc90284dd683ca9999 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Tue, 22 Nov 2022 11:57:18 +0100 Subject: [PATCH 4/8] DOC Update comment to mention the joblib-based backends' use --- sklearn/metrics/pairwise.py | 18 ++++++++++++------ sklearn/neighbors/_base.py | 16 +++++++++++----- 2 files changed, 23 insertions(+), 11 deletions(-) diff --git a/sklearn/metrics/pairwise.py b/sklearn/metrics/pairwise.py index e6abd596b0000..1ccff8ae8c8b7 100644 --- a/sklearn/metrics/pairwise.py +++ b/sklearn/metrics/pairwise.py @@ -688,9 +688,12 @@ def pairwise_distances_argmin_min( values = values.flatten() indices = indices.flatten() else: - # TODO: once BaseDistanceReductionDispatcher supports distance metrics - # for boolean datasets, we won't need to fallback to - # pairwise_distances_chunked anymore. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # Turn off check for finiteness because this is costly and because arrays # have already been validated. @@ -800,9 +803,12 @@ def pairwise_distances_argmin(X, Y, *, axis=1, metric="euclidean", metric_kwargs ) indices = indices.flatten() else: - # TODO: once BaseDistanceReductionDispatcher supports distance metrics - # for boolean datasets, we won't need to fallback to - # pairwise_distances_chunked anymore. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # Turn off check for finiteness because this is costly and because arrays # have already been validated. diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py index 07b4e4d5996ff..99506bd2fe1ed 100644 --- a/sklearn/neighbors/_base.py +++ b/sklearn/neighbors/_base.py @@ -839,9 +839,12 @@ class from an array representing our data set and ask who's ) elif self._fit_method == "brute": - # TODO: should no longer be needed once ArgKmin - # is extended to accept sparse and/or float32 inputs. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. reduce_func = partial( self._kneighbors_reduce_func, n_neighbors=n_neighbors, @@ -1173,9 +1176,12 @@ class from an array representing our data set and ask who's ) elif self._fit_method == "brute": - # TODO: should no longer be needed once we have Cython-optimized - # implementation for radius queries, with support for sparse and/or - # float32 inputs. + # Joblib-based backend, which is used when user-defined callable + # are passed for metric. + + # This won't be used in the future once PairwiseDistancesReductions support: + # - DistanceMetrics which work on supposedly binary data + # - CSR-dense and dense-CSR case if 'euclidean' in metric. # for efficiency, use squared euclidean distances if self.effective_metric_ == "euclidean": From 5de3c0aba058e29ebfc89c42b13c63bf8303c0b8 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Tue, 22 Nov 2022 12:00:32 +0100 Subject: [PATCH 5/8] DOC Be more precise --- doc/computing/parallelism.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index 8c9d5756f640e..dd19eac47bd88 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -21,7 +21,7 @@ This can be either done: - with lower-level parallelism via OpenMP, used in C or Cython code. In this case, parallelism is always done using threads and specifying ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally - more performant by several orders of magnitude. + more performant than joblib-based implementations by up to two orders of magnitude. - with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations on arrays. From 35916a2c96bc1ac4265e9aa83423f81089f10e52 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Wed, 23 Nov 2022 09:40:58 +0100 Subject: [PATCH 6/8] DOC Make the instruction more explicit Co-authored-by: Olivier Grisel --- doc/computing/parallelism.rst | 24 ++++++++++++++++++++---- sklearn/neighbors/_base.py | 6 ++++-- 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index dd19eac47bd88..f34b675ad4aa3 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -13,7 +13,8 @@ Parallelism Some scikit-learn estimators and utilities parallelize costly operations using multiple CPU cores. -This can be either done: +Depending on the type of estimator and sometimes the values of the +constructor parameters, this is either done: - with higher-level parallelism via `joblib `_. In this case, the number of threads or processes can be controlled with the @@ -97,7 +98,7 @@ Parallel NumPy and SciPy routines from numerical libraries .......................................................... scikit-learn relies heavily on NumPy and SciPy, which internally call -multi-threaded linear algebra routines of BLAS implemented in libraries +multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries such as MKL, OpenBLAS or BLIS. You can control the exact number of threads used by BLAS for each library @@ -107,6 +108,16 @@ using environment variables, namely: - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses +Note that BLAS & LAPACK implementations can also be impacted by +`OMP_NUM_THREADS`. To check whether this is the case in your environment, +you can inspect how the number of threads effectively used by those libraries +is affected when running the the following command in a bash or zsh terminal +for different values of `OMP_NUM_THREADS`:: + +.. prompt:: bash $ + + OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy + .. note:: At the time of writing (2022), NumPy and SciPy packages which are distributed on pypi.org (i.e. the ones installed via ``pip install``) @@ -157,8 +168,13 @@ Note that: only use ``_NUM_THREADS``. Joblib exposes a context manager for finer control over the number of threads in its workers (see joblib docs linked below). -- `threadpoolctl` internally manages the numbers of threads used by OpenMP - and BLAS implementations for scikit-learn implementations. +- When joblib is configured to use the ``threading`` backend, there is no + mechanism to avoid oversubscriptions when calling into parallel native + libraries in the joblib-managed threads. +- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code + always use `threadpoolctl` internally to automatically adapt the numbers of + threads used by OpenMP and potentially nested BLAS calls so as to avoid + oversubscription. You will find additional details about joblib mitigation of oversubscription in `joblib documentation diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py index 99506bd2fe1ed..f54585d8f41de 100644 --- a/sklearn/neighbors/_base.py +++ b/sklearn/neighbors/_base.py @@ -842,7 +842,8 @@ class from an array representing our data set and ask who's # Joblib-based backend, which is used when user-defined callable # are passed for metric. - # This won't be used in the future once PairwiseDistancesReductions support: + # This won't be used in the future once PairwiseDistancesReductions + # supports: # - DistanceMetrics which work on supposedly binary data # - CSR-dense and dense-CSR case if 'euclidean' in metric. reduce_func = partial( @@ -1179,7 +1180,8 @@ class from an array representing our data set and ask who's # Joblib-based backend, which is used when user-defined callable # are passed for metric. - # This won't be used in the future once PairwiseDistancesReductions support: + # This won't be used in the future once PairwiseDistancesReductions + # supports: # - DistanceMetrics which work on supposedly binary data # - CSR-dense and dense-CSR case if 'euclidean' in metric. From 118bdf3a779c5dd5c5635c0e6fe252645b8eab14 Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Wed, 23 Nov 2022 10:22:32 +0100 Subject: [PATCH 7/8] DOC Trim bullet point list and add a paragraph of explanations Co-authored-by: Olivier Grisel --- doc/computing/parallelism.rst | 13 ++++++++----- sklearn/neighbors/_base.py | 4 ++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index f34b675ad4aa3..b32d997070568 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -17,15 +17,18 @@ Depending on the type of estimator and sometimes the values of the constructor parameters, this is either done: - with higher-level parallelism via `joblib `_. - In this case, the number of threads or processes can be controlled with the - ``n_jobs`` parameter. - with lower-level parallelism via OpenMP, used in C or Cython code. - In this case, parallelism is always done using threads and specifying - ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally - more performant than joblib-based implementations by up to two orders of magnitude. - with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations on arrays. +The `n_jobs` parameters of estimators always controls the amount of parallelism +managed by joblib (processes or threads depending on the joblib backend). +The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code +or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn +is always controlled by environment variables or `threadpoolctl` as explained below. +Note that some estimators can leverage all three kinds of parallelism at different +points of their training and prediction methods. + We describe these 3 types of parallelism in the following subsections in more details. Higher-level parallelism with joblib diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py index f54585d8f41de..3b01824a3a73a 100644 --- a/sklearn/neighbors/_base.py +++ b/sklearn/neighbors/_base.py @@ -843,7 +843,7 @@ class from an array representing our data set and ask who's # are passed for metric. # This won't be used in the future once PairwiseDistancesReductions - # supports: + # support: # - DistanceMetrics which work on supposedly binary data # - CSR-dense and dense-CSR case if 'euclidean' in metric. reduce_func = partial( @@ -1181,7 +1181,7 @@ class from an array representing our data set and ask who's # are passed for metric. # This won't be used in the future once PairwiseDistancesReductions - # supports: + # support: # - DistanceMetrics which work on supposedly binary data # - CSR-dense and dense-CSR case if 'euclidean' in metric. From ede91d210b1fd7bfd0d388069297e3f62c6ed2ea Mon Sep 17 00:00:00 2001 From: Julien Jerphanion Date: Wed, 23 Nov 2022 10:39:28 +0100 Subject: [PATCH 8/8] DOC Document `sklearn.{set_config,config_context}` and SKLEARN_PAIRWISE_DIST_CHUNK_SIZE --- doc/computing/parallelism.rst | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst index b32d997070568..97e3e2866083f 100644 --- a/doc/computing/parallelism.rst +++ b/doc/computing/parallelism.rst @@ -189,15 +189,16 @@ in `this document from Thomas J. Fan