MNT New helper for effective number of OpenMP threads #14196

jeremiedbb · 2019-06-26T12:24:08Z

some PRs are waiting for a helper to go from n_jobs to the number of threads to set in prange (which does not accept None or negative values). (see #13264)

This PR proposes an equivalent of joblib.effective_n_jobs but with a different behavior when n_jobs = None. In that case OpenMP can use as much threads as possible. It's done using both omp_get_max_threads and joblib.effective_n_jobs because loky's effective_n_jobs is more clever than just returning the cpu_count.

Not sure about the name though.

cc @ogrisel @tomMoral

sklearn/utils/openmp_helpers.pyx

jeremiedbb · 2019-06-26T13:21:41Z

@NicolasHug you might also be interested in this.

sklearn/utils/openmp_helpers.pyx

NicolasHug · 2019-06-26T16:00:31Z

(Oops sorry, wrong button!)

I'm confused by the fact that the default (None) means "use as many threads as possible" when the parameter relates to OpenMP, but means "only use 1 thread (unless backend is specified...)" when it relates to joblib.
I think it's going to be hard to explain.

I'm also not sure whether it's right to use the same parameter name "n_jobs" for both OpenMP and joblib. Why not having n_omp_threads, and n_jobs? A "job" really is a joblib concept, not an OpenMP concept. OpenMP only deals with threads.

jeremiedbb · 2019-06-26T16:12:26Z

I'm confused by the fact that the default (None) means "use as many threads as possible" when the parameter relates to OpenMP, but means "only use 1 thread (unless backend is specified...)" when it relates to joblib.
I think it's going to be hard to explain.

The idea is that we let the BLAS use as many threads as possible, so why not do it also for our low level multi-threaded parts (by default).

I'm also not sure whether it's right to use the same parameter name "n_jobs" for both OpenMP and joblib. Why not having n_omp_threads, and n_jobs? A "job" really is a joblib concept, not an OpenMP concept. OpenMP only deals with threads.

It would be less user friendly to have estimators with n_jobs and estimators with n_omp_threads. However It would be confusing that n_jobs=None does not have the same behavior for all estimators.
I'm unhappy in both cases :(

NicolasHug · 2019-06-26T17:05:39Z

It would be less user friendly to have estimators with n_jobs and estimators with n_omp_threads

I feel the opposite: it would give them more control, and they would know what n_jobs refers to. Hiding too much of the logic can also be detrimental to ease of use, sometimes.

If we support both n_jobs and n_omp_threads, the default for n_omp_threads could be 'auto', which would mean "use as many threads as possible, avoiding over-subscription". E.g. with 8 cores, a GridSearch with n_jobs=4 would result in n_omp_threads set to 2 for the underlying estimator. If a user sets n_omp_threads to an int, then they might risk over/under subscription (but that's their problem).

tomMoral · 2019-06-26T17:52:13Z

The drawback of using multiple kw arguments is that it makes it less consistent to use and to discover in the user API. I would rather see some good documentation about the underlying parallelism for each docstring with `n_jobs` to explain the default behavior. Also, we discussed with @ogrisel about changing the default behavior for `n_jobs=None` with the threading backend to the maximal number of threads, which would make it easier to explain (thread based parallelism vs process based parallelism behavior). I agree changing `n_jobs` to something like `n_workers` would be semantically better (even for joblib) but I guess it is very complicated. Le mer. 26 juin 2019 à 19:07, Nicolas Hug <notifications@github.com> a écrit :

…

It would be less user friendly to have estimators with n_jobs and estimators with n_omp_threads I feel the opposite: it would give them more control, and they would know what n_jobs refers to. Hiding too much of the logic can also be detrimental to ease of use, sometimes. If we support both n_jobs and n_omp_threads, the default for n_omp_threads could be 'auto', which would mean "use as many threads as possible, avoiding over-subscription". E.g. with 8 cores, a GridSearch with n_jobs=4 would result in n_omp_threads set to 2 for the underlying estimator. If a user sets n_omp_threads to an int, then they might risk over/under subscription (but that's their problem). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14196?email_source=notifications&email_token=AAZKZ6N744RLMTYKD4ZP3ILP4OO4ZA5CNFSM4H3RVYB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUF73Y#issuecomment-505962479>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZKZ6OSETDBJBCB44XJ2STP4OO4ZANCNFSM4H3RVYBQ> .

jeremiedbb · 2019-06-27T09:15:15Z

Also, we discussed with @ogrisel about changing the default behavior for n_jobs=None with the threading backend to the maximal number of threads, which would make it easier to explain (thread based parallelism vs process based parallelism behavior).

If this change happens in joblib, it would definitely make sense to have the same behavior for OpenMP based parallelism.

NicolasHug · 2019-06-27T13:05:36Z

But the default backend is loky right? So for n_jobs to have the same behaviour in joblib and openmp, users would have to wrap their call with the joblib context manager and use the threading backend?

rth · 2019-06-27T13:10:51Z

But the default backend is loky right? So for n_jobs to have the same behaviour in joblib and openmp, users would have to wrap their call with the joblib context manager and use the threading backend?

There are a few places where we currently use prefer="threads" (cf neighbours.base, forest, iforest, logistic, linear_model.coordinate_descent modules), so the use of n_jobs currently doesn't say anything about whether threading or loky backend is used.

rth · 2019-06-27T13:16:27Z

Also, there are places where we now use the loky multiprocessing backend where we might want to switch to the threading joblib backend or OpenMP in the future. It would be a bit awkward if we have to deprecate n_jobs in favor of n_threads (or vice versa) to do that for an estimator (or having to revert if we change our mind about the parallel backend yet again later).

The way parallelism is done for an estimator is an implementation detail that should be mostly transparent to the user.

NicolasHug · 2019-06-27T13:27:17Z

The way parallelism is done for an estimator is an implementation detail that should be mostly transparent to the user.

I don't disagree. But then the default should be consistent.

tomMoral · 2019-06-27T13:40:56Z

I don't disagree. But then the default should be consistent.

But it is hard to decide with which rule you should be consistent:

1 - By default, C-level threadpools use all cores, so for thread base parallelism, we could use the same
2 - By default, we want joblib process based parallelism to be deactivated (as it might hurt the performance in some cases) so it should be the same for thread based parallelism.

The rational behind following rule 1 is that using all thread tends to always improve the performances, so we should use it (as it is done in numpy).

rth · 2019-07-05T09:01:33Z

This is currently blocking us from adding OpenMP code and I think we should merge it, it's just a private function. There is still some time to decide how the public API should look like for this.

Is it still a WIP @jeremiedbb ?

For OpenMP maybe it could make sense to even not expose n_jobs (or an equivalent parameter), and just have it enabled by default same as BLAS and numpy does. Because in practice, if n_jobs=None means -1 (and we are handling oversubscription issues), maybe setting limiting it via an env variable at startup could be enough.

jeremiedbb · 2019-07-05T11:10:44Z

The WIP tag was until there's a consensus :)

it's just a private function. There is still some time to decide how the public API should look like for this.

I agree

For OpenMP maybe it could make sense to even not expose n_jobs (or an equivalent parameter), and just have it enabled by default same as BLAS and numpy does. Because in practice, if n_jobs=None means -1 (and we are handling oversubscription issues), maybe setting limiting it via an env variable at startup could be enough.

I'm in favor of exposing n_jobs with a good documentation for it (Nicolas opened an issue about that).
I think there's value in keeping control of parallelism at runtime. For instance you might want to set n_jobs to -1 when fitting an estimator alone but set it to 1 when you put it into a gridsearch, without having to restart ipython.

However this can be decided later outside of this PR, maybe we should open a dedicated issue so that others can give their opinion.

rth · 2019-07-05T11:19:04Z

The WIP tag was until there's a consensus :)

For me WIP tags means that the author hasn't finished work or is planning to add more things. So if you are asking for feedback you should remove it IMO.

However this can be decided later outside of this PR, maybe we should open a dedicated issue so that others can give their opinion.

+1 for opening a issue about it listing possible options.

rth

Since it's only a private helper. cf the discussion on the public API in #14265 (comment)

NicolasHug · 2019-08-04T16:49:28Z

sklearn/utils/openmp_helpers.pyx

+    """Determine the effective number of threads used for parallel OpenMP calls
+
+    - For ``n_threads = None``, returns the minimum between
+      openmp.omp_get_max_threads() and joblib.effective_n_jobs(-1).


I'm not sure I understand the purpose of also checking joblib.effective_n_jobs(-1)?

Is this for cases where you want to run code in a joblib context manager?

It's because effective_n_jobs is more clever than cpu_count or omp_get_max_threads is some cases. I don't recall in which situations but @tomMoral should now better.

Ok, would be nice to have a comment explaining this then

The loky implementation of cpu_count better handles containers and CPU affinity. (see the docstring here for detailed description).

However, it might be better to call joblib.cpu_count here instead of effective_n_jobs, as it also calls loky.cpu_count under the hood, .

Reading the docstring, setting LOKY_MAX_CPU_COUNT will affect the output of _openmp_effective_n_threads().

I don't think that makes sense?

Ok, maybe Im overthinking it. I'm interested in @ogrisel's opinion on this

What I wanted to do is just a conservative way to find the max number of cores to use. I think that omp_get_num_procs could overestimate it in some situations (e.g. containers although I'm not sure about that). A more robust method is joblib's effective_n_jobs.

I think you should not see this as controlling nthreads through joblib but instead use joblib to gather more informations in order to set the most appropriate number of threads.

On the other hand I agree that the fact that it depends onthe backend is a bit brittle. Tom's proposition is better.

@ogrisel what's your take on this?

I like the fact that we do not use 32 openmp threads by default when running in a docker container that can only run with a 2000 mCPU quota in the cgroups. This is typically what we have on CI systems such as travis or circle: we run in docker containers with CPU quota but openmp.omp_get_max_threads() would naively return 32 leading to sever oversubscription that causes a significant slow down of the computation.

@NicolasHug so you would rather duplicate the code from loky just to remove the dependency of the LOKY_MAX_CPU_COUNT env variable? We could do that but that feels a bit overkill. I don't expect many people to mess around with LOKY_MAX_CPU_COUNT.

jnothman

Implementation lgtm.

sklearn/utils/openmp_helpers.pyx

NicolasHug · 2019-10-12T16:19:37Z

What I say is that at runtime you can setomp_set_num_threadsand it will have priority overOMP_NUM_THREADS`. It's how openmp works and there's absolutely no reason to bypass that (why would want to remove the possibility to control the number of threads at runtime ?

I am completely OK with that.

The problem I have is with

Current behavior is respect OMP_NUM_THREADS unless it exceeds joblib.cpu_count.

You can't say that and then claim that OMP_NUM_THREADS is respected.

tomMoral · 2019-10-13T07:22:51Z

The point is does it makes sense to cap OMP_NUM_THREADS to the number of CPUs or not. To summarize, the 2 points:

Against : some informed people might not understand why we do it as oversubscription might help in some cases and setting this limit will be harder to explain.
For : this limits is defined a performance limit and it will avoid catastrophic behaviors with oversubscription. It is also enforced by MKL and OpenBLAS.

I am in favor of setting this limit because I have not seen a situation where oversubscription helped and it can lead to catastrophic behaviors. Also, I think the documentation issue should not influence the choice.

NicolasHug · 2019-10-13T16:21:51Z

It is also enforced by MKL and OpenBLAS.

Can you expand a bit please? That seems contradictory with what we say in #15116:

Manually setting one of the environment variables (OMP_NUM_THREADS,
MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, or BLIS_NUM_THREADS)
will take precedence over what joblib tries to do

the documentation issue should not influence the choice

I strongly disagree here. If we can't explain it properly, users won't understand it. There's not point in providing software that users cannot understand. And I'm not even talking about maintenance cost that comes with complex designs.

tomMoral · 2019-10-13T17:02:13Z

It is also enforced by MKL and OpenBLAS.

Can you expand a bit please? That seems contradictory with what we say in #15116:

In my understanding, for MKL and OpenBLAS, the environment variables are caped by the number of CPUs. The following script, it should not use 100 cpus but the maximal number of cpus available on your machine:

OMP_NUM_THREADS=100 python -c "import numpy as np; A = np.random.rand(10000, 10000); A.dot(A.T)"

but @jeremiedbb knows better than me in this matter. This means that we would copy the behavior from numpy, which was the original goal.

the documentation issue should not influence the choice

I strongly disagree here. If we can't explain it properly, users won't understand it.

I agree that it is necessary to be able to explain our API choice but it seems a bit much to say that we won't be able to explain that overall, the maximal number of threads is caped by the number of CPUs.

jeremiedbb · 2019-10-14T08:48:18Z

Thanks @NicolasHug for clarifying your concern. We weren't arguing on the same thing the whole time :)

In my understanding, for MKL and OpenBLAS, the environment variables are caped by the number of CPUs.

That's right. If you set MKL_NUM_THREADS or OPENBLAS_NUM_THREADS, the max number of threads they'll use will still be capped by a maximum value (#cpus for openblas, #physical cores for mkl).

If you set OMP_NUM_THREADS, mkl will respect that up to it's max value. That's what is replicated in the PR.

Now we want to have Scikit-learn + Joblib + complex OPENMP + ( Joblib * complex OPENMP)??

the fact that we use a joblib function in this helper does not mean it has an interaction with joblib parallelism tools. To bound by number of cpu we need a way to find that number. Moreover we want a method clever enough for docker-like situations. It happens that joblib already has one so we're better using it instead of re-implement it ourselves. So about the documentation we only need to say that OMP_NUM_THREADS is respected up to the number of cpus (no need to involve joblib here).

NicolasHug · 2019-10-14T13:27:42Z

Sorry for the confusion

Capping OMP_NUM_THREADS to the number of CPUs is one thing. It seems natural, especially since other libraries do that as well.

But capping OMP_NUM_THREADS to the number of available CPUs (to avoid problems in docker etc.) is another thing. This is where I'm concerned: that's a good thing in general, but my point is that if the user manually sets OMP_NUM_THREADS, we should not enforce this clever limit. Users expect their choice to be respected here. And it makes benchmarking against other libraries (e.g. LightGBM) impossible in some cases, since they will use a different effective number of threads.

This is also @rth 's concern, as far as I can tell:

My concerns is more in the case loky limits the number of cores for some reason (say you are inside a parallel process), then no mater what you set as OMP_NUM_THREAD on CLI it will have no effect, which is somewhat counter-intuitive. Also it's much easier to explain that you can always force the number of threads manually with OMP_NUM_THREADS , as opposed to some complicated logic inside loky.

Regarding this

Now we want to have Scikit-learn + Joblib + complex OPENMP + ( Joblib * complex OPENMP)??

I was referring here to having special cases for handling pranged estimators, instead of using the uniform way advertised in #15116. But that's another issue which is independent of the one above. I'm OK with discussing this one when/if the time comes.

tomMoral · 2019-10-14T14:42:56Z

Capping OMP_NUM_THREADS to the number of CPUs is one thing. It seems natural, especially since other libraries do that as well.

But capping OMP_NUM_THREADS to the number of available CPUs (to avoid problems in docker etc.) is another thing.

Could you clarify the difference? This seems to be the same for me. If you have a virtual machine like docker, the config of the virtual machine gives you a number of CPUs given by cpu_count.

NicolasHug · 2019-10-14T14:53:25Z

This is the difference between using 32 cores instead of the 2 that are actually available in e.g. a CI as detailed in #14196 (comment)

Basically: capping to 32 makes sense. But we should not cap to 2 when the user sets OMP_NUM_THREADS=4.

rth · 2019-10-14T15:25:10Z

In my understanding, for MKL and OpenBLAS, the environment variables are caped by the number of CPUs.

They may be but I'm not sure it's part of the OpenMP spec. For instance, libgomp does not seem to limit the number of threads https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fTHREAD_005fLIMIT.html#OMP_005fTHREAD_005fLIMIT

So doing something like,

if os.environ.get("OMP_NUM_THREADS"):
    # fall back to user provided number of threads
	max_n_threads = openmp.omp_get_max_threads()
else:
	max_n_threads = min(openmp.omp_get_max_threads(), cpu_count())

would be less disruptive I think. Whether there are cases where it would be useful is besides the point, but rather that we should avoid silent and surprising behavior. Personally I don't read the docs for BLIS, OpenBLAS and MKL in detail, but I know that setting OMP_NUM_THREADS will set the number of threads. If there is some logic limiting it in the case where it's manually provided, IMO we should leave it to those libraries, rather than re-implementing some custom solution ourselves.

Of course this sidesteps the fact that each of those libraries also have their own ENV variable to control this, but well..

jeremiedbb · 2019-10-14T16:34:19Z

Basically: capping to 32 makes sense. But we should not cap to 2 when the user sets OMP_NUM_THREADS=4.

I don't see the difference between
"we should cap to 32 when the user sets OMP_NUM_THREADS=34."
and
"But we should not cap to 2 when the user sets OMP_NUM_THREADS=4."

In a docker image with 2 cpus, the fact that openmp sees 32 cpus is not a feature but a defect.
In that case it makes no sense to cap to the number of cpus in the first case either.

They may be but I'm not sure it's part of the OpenMP spec. For instance, libgomp does not seem to limit the number of threads

Indeed it does not. It's the whole purpose of this helper.

So doing something like,
if os.environ.get("OMP_NUM_THREADS"):

What if I did not set the env var, but I changed the number of threads with omp_set_num_threads ? Why the env var should be respected at all cost and not the programmatic way ? The issue is that there's no way to check if I changed the number of threads using the function.

Personally I don't read the docs for BLIS, OpenBLAS and MKL in detail, but I know that setting OMP_NUM_THREADS will set the number of threads

Yes but it will be caped to the number of cpus (or physical cores for mkl). And it's preferred to use MKL_NUM_THREADS (or equivalent)

If there is some logic limiting it in the case where it's manually provided, IMO we should leave it to those libraries, rather than re-implementing some custom solution ourselves.

Now that we do code with openmp we are one of those libraries, just like mkl. So we need to implement some custom function.

If you want we can have a SKLEARN_OMP_NUM_THREADS env var which has priority over OMP_NUM_THREADS, just like mkl. If a user sets it it will be the number of OpenMP threads no matter the number of cpus (not like mkl). But please show me an example where it's really worth to use more threads than cpus with scikit-learn.

NicolasHug · 2019-10-14T16:45:37Z

But please show me an example where it's really worth to use more threads than cpus with scikit-learn

Probably none. But that's not the point. The point is to respect what users want. We don't want scikit-learn to be an outsider in the way it treats OMP_NUM_THREADS.

rth · 2019-10-14T19:27:00Z

But please show me an example where it's really worth to use more threads than cpus with scikit-learn

One use case could be when either

CPython's cpu_count is wrong on some platform: it happened before e.g. https://bugs.python.org/issue30581 or https://bugs.python.org/issue33166, could happen again with new or exotic hardware
or loky.cpu_count is wrong in some specific conditions. I contributed the initial PR for it, and I can't guarantee that there is no platform or conditions under which it could behave differently than what we would expect.

So by clipping user provided OMP_NUM_THREADS we are saying we trust those two things more than we trust users their understanding of their hardware.

Another use case is simply to evaluate the behavior of the system under over-subscription (e.g. on a multi-tenant or multi-process system). It using 2 threads per CPU cores produces 10x slowdown (related to #15078) it's interesting to know and probably means that we don't understand something and it could be worth investigating.

A third use case, is one runs on some VM environment one don't fully understand (e.g. cloud provider) and wants to see what is the best number of threads, so they increase OMP_NUM_THREADS progressively. It is expected that the run time would reach an optimum at some value. With the current approach they won't see the slowdown due to oversubsription because the number of threads will be limited at some point by this mechanism. It it's probably right, but could be wrong and they will never know because they don't have access to raw data. Also it makes more complex to understand what is going on.

What if I did not set the env var, but I changed the number of threads with omp_set_num_threads ? Why the env var should be respected at all cost and not the programmatic way ?

Because it takes ~5 seconds to prepend an environment variable to any script (accounting for misspellings) and look at the result. While there are less people who know about the existence of omp_set_num_threads, let alone who use it. Though I agree that inconsistency is not ideal.

If you want we can have a SKLEARN_OMP_NUM_THREADS env var which has priority over OMP_NUM_THREADS, just like mkl.

I mostly use OMP_NUM_THREADS for everything, because I don't really want to think about the backend. What's the advantage of a separate env variable, short of if we want to specify the parallelism level on for those few particular scikit-learn routines?

As a side note this PR really received way too many comments for the few lines it adds. Maybe in such cases a quick call over a cup of coffee/tee would sort things out faster :)

jeremiedbb · 2019-10-15T08:51:37Z

The point is to respect what users want. We don't want scikit-learn to be an outsider in the way it treats OMP_NUM_THREADS.

@NicolasHug I disagree with this point.
First all libraries using openmp don't have the same behavior. MKL limits to number of cpus, lgbm doesn't.
Then we already differ from what xgboost and lightgbm does regarding openmp threads. They both expose a n_jobs parameter which we agreed not to do. Moreover both have different defaults: xgboost has n_jobs=1 and lgbm has n_jobs=-1 by default.
Finally as I said before, doing something just because others do it should not enter into consideration. Should we put a n_jobs in histgbt just because they do ?

@rth thanks for the detailed use cases. I agree that letting more flexibility to the user could be a good thing. My concern is that in what you proposed only the env var works, not the function.

Because it takes ~5 seconds to prepend an environment variable to any script (accounting for misspellings) and look at the result.

it does not take more time to do from threadpoolctl import threadpool_limits; threadpool_limits(1, 'openmp') :)
And I find it even easier and faster to be able to test the number of threads in interactive mode. I would really not cut that possibility.

While there are less people who know about the existence of omp_set_num_threads, let alone who use it.

as of today. Tomorrow billions of people will use threadpoolctl :D

I mostly use OMP_NUM_THREADS for everything, because I don't really want to think about the backend.

It's not the recommended way. do it at your own risks ! :)

Maybe in such cases a quick call over a cup of coffee/tee would sort things out faster :)

Let's do that.

NicolasHug · 2019-10-15T12:00:12Z

I'm sorry but whether XGBoost or LightGBM expose n_jobs is completely irrelevant to the discussion here.

I'm a user, I want to benchmark XGBoost, LightGBM and sklearn. I expect OMP_NUM_THREAD=4 bench.py to work as expected: tell all libraries to use 4 threads. You're making this impossible.

Yes, I understand that other libraries like MKL will cap to the number of CPUS: 1) whether they should is debatable, 2) it is still a much more reasonable default than what we are about to introduce here, and 3) MKL isn't comparable to sklearn in terms of abstraction level.

Again, I'm not against capping to the number of actually available CPUs: this is a good default to have. But users should be able to bypass that default. That's what a default is for. I don't understand why you are so keen on forbidding users to use what they want to use. On top of the reasons @rth mentioned and mine, there probably are many others that we don't know about yet. We don't know everything.

That's all I have on this.

jeremiedbb · 2019-10-15T12:10:56Z

I don't understand why you are so keen on forbidding users to use what they want to use.

@NicolasHug you missed one thing in my last comment. I actually agreed that we can let some flexibility. In fact I never really was against that. What I'm really against since the beginning is that it can only be controlled by the OMP_NUM_THREADS env var and not by the omp_set_num_threads function.

I'm sorry but whether XGBoost or LightGBM expose n_jobs is completely irrelevant to the discussion here.

It's relevant in the sense that you're telling that we should do as they do. But we don't do as they do for some openmp related behavior (e.g. n_jobs) so what is special about this one (env var) ?

ogrisel

Alright so the consensus is to implement something like #14196 (comment) so that we give more control to the user if the env var is explicitly set? I agree with Jeremie that the env var is not so special compared to some code that would explicitly set_omp_num_threads but get_omp_num_threads will not tell us if this was set explicitly programmatically or if the value stems from the default CPU detection which is broken with cgroups.

So +1 to not apply the cpu_count() cap when the env var is set.

@jeremiedbb please feel free to update this PR.

ogrisel · 2019-10-15T16:49:54Z