From 91411ea990bcac65d187054f7e0fef8895bef413 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Fri, 18 Nov 2022 15:06:49 +0100
Subject: [PATCH 1/8] DOC Update "Parallelism, resource management, and
 configuration" section

---
 doc/computing/parallelism.rst | 112 +++++++++++++++++-----------------
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index 382fa8938b5ca..bd72e6fee823a 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -10,22 +10,25 @@ Parallelism, resource management, and configuration
 Parallelism
 -----------
 
-Some scikit-learn estimators and utilities can parallelize costly operations
-using multiple CPU cores, thanks to the following components:
+Some scikit-learn estimators and utilities parallelize costly operations
+using multiple CPU cores.
 
-- via the `joblib <https://joblib.readthedocs.io/en/latest/>`_ library. In
-  this case the number of threads or processes can be controlled with the
-  ``n_jobs`` parameter.
-- via OpenMP, used in C or Cython code.
+This can be either done:
 
-In addition, some of the numpy routines that are used internally by
-scikit-learn may also be parallelized if numpy is installed with specific
-numerical libraries such as MKL, OpenBLAS, or BLIS.
+- with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/>`_.
+  In this case, the number of threads or processes can be controlled with the
+  ``n_jobs`` parameter.
+- with lower-level parallelism via OpenMP, used in C or Cython code.
+  In this case, parallelism is always done using threads and specifying
+  ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally
+  more performant by several orders of magnitude.
+- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
+  on arrays.
 
-We describe these 3 scenarios in the following subsections.
+We describe these 3 types of parallelism in the following subsections in more details.
 
-Joblib-based parallelism
-........................
+Higher-level parallelism with joblib
+....................................
 
 When the underlying implementation uses joblib, the number of workers
 (threads or processes) that are spawned in parallel can be controlled via the
@@ -33,15 +36,16 @@ When the underlying implementation uses joblib, the number of workers
 
 .. note::
 
-    Where (and how) parallelization happens in the estimators is currently
-    poorly documented. Please help us by improving our docs and tackle `issue
-    14228 <https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
+    Where (and how) parallelization happens in the estimators using joblib by
+    specifying `n_jobs` is currently poorly documented.
+    Please help us by improving our docs and tackle `issue 14228
+    <https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
 
 Joblib is able to support both multi-processing and multi-threading. Whether
 joblib chooses to spawn a thread or a process depends on the **backend**
 that it's using.
 
-Scikit-learn generally relies on the ``loky`` backend, which is joblib's
+scikit-learn generally relies on the ``loky`` backend, which is joblib's
 default backend. Loky is a multi-processing backend. When doing
 multi-processing, in order to avoid duplicating the memory in each process
 (which isn't reasonable with big datasets), joblib will create a `memmap
@@ -70,40 +74,46 @@ that increasing the number of workers is always a good thing. In some cases
 it can be highly detrimental to performance to run multiple copies of some
 estimators or functions in parallel (see oversubscription below).
 
-OpenMP-based parallelism
-........................
+Lower-level parallelism with OpenMP
+...................................
 
 OpenMP is used to parallelize code written in Cython or C, relying on
-multi-threading exclusively. By default (and unless joblib is trying to
-avoid oversubscription), the implementation will use as many threads as
-possible.
+multi-threading exclusively. By default, the implementations using OpenMP
+will use as many threads as possible, i.e. as many threads as logical cores.
 
-You can control the exact number of threads that are used via the
-``OMP_NUM_THREADS`` environment variable:
+You can control the exact number of threads that are used either:
 
-.. prompt:: bash $
+ - via the ``OMP_NUM_THREADS`` environment variable, e.g. for instance when:
+   running a python script:
+
+   .. prompt:: bash $
 
-    OMP_NUM_THREADS=4 python my_script.py
+        OMP_NUM_THREADS=4 python my_script.py
 
-Parallel Numpy routines from numerical libraries
-................................................
+ - or via `threadpoolctl` as explained by `this piece of documentation
+   <https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools>`_.
 
-Scikit-learn relies heavily on NumPy and SciPy, which internally call
-multi-threaded linear algebra routines implemented in libraries such as MKL,
-OpenBLAS or BLIS.
+Parallel NumPy and SciPy routines from numerical libraries
+..........................................................
 
-The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
-via the ``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, and
-``BLIS_NUM_THREADS`` environment variables.
+scikit-learn relies heavily on NumPy and SciPy, which internally call
+multi-threaded linear algebra routines of BLAS implemented in libraries
+such as MKL, OpenBLAS or BLIS.
 
-Please note that scikit-learn has no direct control over these
-implementations. Scikit-learn solely relies on Numpy and Scipy.
+You can control the exact number of threads used by BLAS for each library
+using environment variables, namely:
+  - ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
+  - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
+  - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
 
 .. note::
-    At the time of writing (2019), NumPy and SciPy packages distributed on
-    pypi.org (used by ``pip``) and on the conda-forge channel are linked
-    with OpenBLAS, while conda packages shipped on the "defaults" channel
-    from anaconda.org are linked by default with MKL.
+    At the time of writing (2022), NumPy and SciPy packages which are
+    distributed on pypi.org (i.e. the ones installed via ``pip install``)
+    and on the conda-forge channel (i.e. the ones installed via
+    ``conda install --channel conda-forge``) are linked with OpenBLAS, while
+    NumPy and SciPy packages packages shipped on the ``defaults`` conda
+    channel from Anaconda.org (i.e. the ones installed via ``conda install``)
+    are linked by default with MKL.
 
 
 Oversubscription: spawning too many threads
@@ -120,8 +130,8 @@ with ``n_jobs=8`` over a
 OpenMP). Each instance of
 :class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads
 (since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which
-leads to oversubscription of physical CPU resources and to scheduling
-overhead.
+leads to oversubscription of threads for physical CPU resources and thus
+to scheduling overhead.
 
 Oversubscription can arise in the exact same fashion with parallelized
 routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
@@ -146,14 +156,15 @@ Note that:
   only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
   finer control over the number of threads in its workers (see joblib docs
   linked below).
-- Joblib is currently unable to avoid oversubscription in a
-  multi-threading context. It can only do so with the ``loky`` backend
-  (which spawns processes).
+- `threadpoolctl` internally manages the numbers of threads used by OpenMP
+  and BLAS implementations for scikit-learn implementations.
 
 You will find additional details about joblib mitigation of oversubscription
 in `joblib documentation
 <https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources>`_.
 
+You will find additional details about parallelism in numerical python libraries
+in `this document from Thomas J. Fan <https://thomasjpfan.github.io/parallelism-python-libraries-design/>`_.
 
 Configuration switches
 -----------------------
@@ -161,18 +172,7 @@ Configuration switches
 Python runtime
 ..............
 
-:func:`sklearn.set_config` controls the following behaviors:
-
-`assume_finite`
-~~~~~~~~~~~~~~~
-
-Used to skip validation, which enables faster computations but may lead to
-segmentation faults if the data contains NaNs.
-
-`working_memory`
-~~~~~~~~~~~~~~~~
-
-The optimal size of temporary arrays used by some algorithms.
+:func:`sklearn.set_config` controls the following behaviors.
 
 .. _environment_variable:
 

From 06b6340f5e21baef45edaa93af9b1c32b6dae5a1 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Mon, 21 Nov 2022 10:49:47 +0100
Subject: [PATCH 2/8] STY Break line for bullet point list

---
 doc/computing/parallelism.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index bd72e6fee823a..541b03d0898f1 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -102,6 +102,7 @@ such as MKL, OpenBLAS or BLIS.
 
 You can control the exact number of threads used by BLAS for each library
 using environment variables, namely:
+
   - ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
   - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
   - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses

From 617b18a59c9f59a50accf41b8537cba4659bc0a7 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Mon, 21 Nov 2022 11:44:11 +0100
Subject: [PATCH 3/8] DOC Fix duplication

Co-authored-by: Tim Head <betatim@gmail.com>
---
 doc/computing/parallelism.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index 541b03d0898f1..8c9d5756f640e 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -83,7 +83,7 @@ will use as many threads as possible, i.e. as many threads as logical cores.
 
 You can control the exact number of threads that are used either:
 
- - via the ``OMP_NUM_THREADS`` environment variable, e.g. for instance when:
+ - via the ``OMP_NUM_THREADS`` environment variable, for instance when:
    running a python script:
 
    .. prompt:: bash $

From 2549650b9a10d4a285f248bc90284dd683ca9999 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Tue, 22 Nov 2022 11:57:18 +0100
Subject: [PATCH 4/8] DOC Update comment to mention the joblib-based backends'
 use

---
 sklearn/metrics/pairwise.py | 18 ++++++++++++------
 sklearn/neighbors/_base.py  | 16 +++++++++++-----
 2 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/sklearn/metrics/pairwise.py b/sklearn/metrics/pairwise.py
index e6abd596b0000..1ccff8ae8c8b7 100644
--- a/sklearn/metrics/pairwise.py
+++ b/sklearn/metrics/pairwise.py
@@ -688,9 +688,12 @@ def pairwise_distances_argmin_min(
         values = values.flatten()
         indices = indices.flatten()
     else:
-        # TODO: once BaseDistanceReductionDispatcher supports distance metrics
-        # for boolean datasets, we won't need to fallback to
-        # pairwise_distances_chunked anymore.
+        # Joblib-based backend, which is used when user-defined callable
+        # are passed for metric.
+
+        # This won't be used in the future once PairwiseDistancesReductions support:
+        #   - DistanceMetrics which work on supposedly binary data
+        #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
 
         # Turn off check for finiteness because this is costly and because arrays
         # have already been validated.
@@ -800,9 +803,12 @@ def pairwise_distances_argmin(X, Y, *, axis=1, metric="euclidean", metric_kwargs
         )
         indices = indices.flatten()
     else:
-        # TODO: once BaseDistanceReductionDispatcher supports distance metrics
-        # for boolean datasets, we won't need to fallback to
-        # pairwise_distances_chunked anymore.
+        # Joblib-based backend, which is used when user-defined callable
+        # are passed for metric.
+
+        # This won't be used in the future once PairwiseDistancesReductions support:
+        #   - DistanceMetrics which work on supposedly binary data
+        #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
 
         # Turn off check for finiteness because this is costly and because arrays
         # have already been validated.
diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py
index 07b4e4d5996ff..99506bd2fe1ed 100644
--- a/sklearn/neighbors/_base.py
+++ b/sklearn/neighbors/_base.py
@@ -839,9 +839,12 @@ class from an array representing our data set and ask who's
             )
 
         elif self._fit_method == "brute":
-            # TODO: should no longer be needed once ArgKmin
-            # is extended to accept sparse and/or float32 inputs.
+            # Joblib-based backend, which is used when user-defined callable
+            # are passed for metric.
 
+            # This won't be used in the future once PairwiseDistancesReductions support:
+            #   - DistanceMetrics which work on supposedly binary data
+            #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
             reduce_func = partial(
                 self._kneighbors_reduce_func,
                 n_neighbors=n_neighbors,
@@ -1173,9 +1176,12 @@ class from an array representing our data set and ask who's
             )
 
         elif self._fit_method == "brute":
-            # TODO: should no longer be needed once we have Cython-optimized
-            # implementation for radius queries, with support for sparse and/or
-            # float32 inputs.
+            # Joblib-based backend, which is used when user-defined callable
+            # are passed for metric.
+
+            # This won't be used in the future once PairwiseDistancesReductions support:
+            #   - DistanceMetrics which work on supposedly binary data
+            #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
 
             # for efficiency, use squared euclidean distances
             if self.effective_metric_ == "euclidean":

From 5de3c0aba058e29ebfc89c42b13c63bf8303c0b8 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Tue, 22 Nov 2022 12:00:32 +0100
Subject: [PATCH 5/8] DOC Be more precise

---
 doc/computing/parallelism.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index 8c9d5756f640e..dd19eac47bd88 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -21,7 +21,7 @@ This can be either done:
 - with lower-level parallelism via OpenMP, used in C or Cython code.
   In this case, parallelism is always done using threads and specifying
   ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally
-  more performant by several orders of magnitude.
+  more performant than joblib-based implementations by up to two orders of magnitude.
 - with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
   on arrays.
 

From 35916a2c96bc1ac4265e9aa83423f81089f10e52 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Wed, 23 Nov 2022 09:40:58 +0100
Subject: [PATCH 6/8] DOC Make the instruction more explicit

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
---
 doc/computing/parallelism.rst | 24 ++++++++++++++++++++----
 sklearn/neighbors/_base.py    |  6 ++++--
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index dd19eac47bd88..f34b675ad4aa3 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -13,7 +13,8 @@ Parallelism
 Some scikit-learn estimators and utilities parallelize costly operations
 using multiple CPU cores.
 
-This can be either done:
+Depending on the type of estimator and sometimes the values of the
+constructor parameters, this is either done:
 
 - with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/>`_.
   In this case, the number of threads or processes can be controlled with the
@@ -97,7 +98,7 @@ Parallel NumPy and SciPy routines from numerical libraries
 ..........................................................
 
 scikit-learn relies heavily on NumPy and SciPy, which internally call
-multi-threaded linear algebra routines of BLAS implemented in libraries
+multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries
 such as MKL, OpenBLAS or BLIS.
 
 You can control the exact number of threads used by BLAS for each library
@@ -107,6 +108,16 @@ using environment variables, namely:
   - ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
   - ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
 
+Note that BLAS & LAPACK implementations can also be impacted by
+`OMP_NUM_THREADS`. To check whether this is the case in your environment,
+you can inspect how the number of threads effectively used by those libraries
+is affected when running the the following command in a bash or zsh terminal
+for different values of `OMP_NUM_THREADS`::
+
+.. prompt:: bash $
+
+    OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy
+
 .. note::
     At the time of writing (2022), NumPy and SciPy packages which are
     distributed on pypi.org (i.e. the ones installed via ``pip install``)
@@ -157,8 +168,13 @@ Note that:
   only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
   finer control over the number of threads in its workers (see joblib docs
   linked below).
-- `threadpoolctl` internally manages the numbers of threads used by OpenMP
-  and BLAS implementations for scikit-learn implementations.
+- When joblib is configured to use the ``threading`` backend, there is no
+  mechanism to avoid oversubscriptions when calling into parallel native
+  libraries in the joblib-managed threads.
+- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code
+  always use `threadpoolctl` internally to automatically adapt the numbers of
+  threads used by OpenMP and potentially nested BLAS calls so as to avoid
+  oversubscription.
 
 You will find additional details about joblib mitigation of oversubscription
 in `joblib documentation
diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py
index 99506bd2fe1ed..f54585d8f41de 100644
--- a/sklearn/neighbors/_base.py
+++ b/sklearn/neighbors/_base.py
@@ -842,7 +842,8 @@ class from an array representing our data set and ask who's
             # Joblib-based backend, which is used when user-defined callable
             # are passed for metric.
 
-            # This won't be used in the future once PairwiseDistancesReductions support:
+            # This won't be used in the future once PairwiseDistancesReductions
+            # supports:
             #   - DistanceMetrics which work on supposedly binary data
             #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
             reduce_func = partial(
@@ -1179,7 +1180,8 @@ class from an array representing our data set and ask who's
             # Joblib-based backend, which is used when user-defined callable
             # are passed for metric.
 
-            # This won't be used in the future once PairwiseDistancesReductions support:
+            # This won't be used in the future once PairwiseDistancesReductions
+            # supports:
             #   - DistanceMetrics which work on supposedly binary data
             #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
 

From 118bdf3a779c5dd5c5635c0e6fe252645b8eab14 Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Wed, 23 Nov 2022 10:22:32 +0100
Subject: [PATCH 7/8] DOC Trim bullet point list and add a paragraph of
 explanations

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
---
 doc/computing/parallelism.rst | 13 ++++++++-----
 sklearn/neighbors/_base.py    |  4 ++--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index f34b675ad4aa3..b32d997070568 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -17,15 +17,18 @@ Depending on the type of estimator and sometimes the values of the
 constructor parameters, this is either done:
 
 - with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/>`_.
-  In this case, the number of threads or processes can be controlled with the
-  ``n_jobs`` parameter.
 - with lower-level parallelism via OpenMP, used in C or Cython code.
-  In this case, parallelism is always done using threads and specifying
-  ``n_jobs`` *has no effect*. Implementations relying on this parallelism are generally
-  more performant than joblib-based implementations by up to two orders of magnitude.
 - with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
   on arrays.
 
+The `n_jobs` parameters of estimators always controls the amount of parallelism
+managed by joblib (processes or threads depending on the joblib backend).
+The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code
+or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn
+is always controlled by environment variables or `threadpoolctl` as explained below.
+Note that some estimators can leverage all three kinds of parallelism at different
+points of their training and prediction methods.
+
 We describe these 3 types of parallelism in the following subsections in more details.
 
 Higher-level parallelism with joblib
diff --git a/sklearn/neighbors/_base.py b/sklearn/neighbors/_base.py
index f54585d8f41de..3b01824a3a73a 100644
--- a/sklearn/neighbors/_base.py
+++ b/sklearn/neighbors/_base.py
@@ -843,7 +843,7 @@ class from an array representing our data set and ask who's
             # are passed for metric.
 
             # This won't be used in the future once PairwiseDistancesReductions
-            # supports:
+            # support:
             #   - DistanceMetrics which work on supposedly binary data
             #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
             reduce_func = partial(
@@ -1181,7 +1181,7 @@ class from an array representing our data set and ask who's
             # are passed for metric.
 
             # This won't be used in the future once PairwiseDistancesReductions
-            # supports:
+            # support:
             #   - DistanceMetrics which work on supposedly binary data
             #   - CSR-dense and dense-CSR case if 'euclidean' in metric.
 

From ede91d210b1fd7bfd0d388069297e3f62c6ed2ea Mon Sep 17 00:00:00 2001
From: Julien Jerphanion <git@jjerphan.xyz>
Date: Wed, 23 Nov 2022 10:39:28 +0100
Subject: [PATCH 8/8] DOC Document `sklearn.{set_config,config_context}` and
 SKLEARN_PAIRWISE_DIST_CHUNK_SIZE

---
 doc/computing/parallelism.rst | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/doc/computing/parallelism.rst b/doc/computing/parallelism.rst
index b32d997070568..97e3e2866083f 100644
--- a/doc/computing/parallelism.rst
+++ b/doc/computing/parallelism.rst
@@ -189,15 +189,16 @@ in `this document from Thomas J. Fan <https://thomasjpfan.github.io/parallelism-
 Configuration switches
 -----------------------
 
-Python runtime
-..............
+Python API
+..........
 
-:func:`sklearn.set_config` controls the following behaviors.
+:func:`sklearn.set_config` and :func:`sklearn.config_context` can be used to change
+parameters of the configuration which control aspect of parallelism.
 
 .. _environment_variable:
 
 Environment variables
-......................
+.....................
 
 These environment variables should be set before importing scikit-learn.
 
@@ -297,3 +298,14 @@ float64 data.
 When this environment variable is set to a non zero value, the `Cython`
 derivative, `boundscheck` is set to `True`. This is useful for finding
 segfaults.
+
+`SKLEARN_PAIRWISE_DIST_CHUNK_SIZE`
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions`
+implementations. The default value is `256` which has been showed to be adequate on
+most machines.
+
+Users looking for the best performance might want to tune this variable using
+powers of 2 so as to get the best parallelism behavior for their hardware,
+especially with respect to their caches' sizes.