-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT New helper for effective number of OpenMP threads #14196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT New helper for effective number of OpenMP threads #14196
Conversation
@NicolasHug you might also be interested in this. |
(Oops sorry, wrong button!) I'm confused by the fact that the default (None) means "use as many threads as possible" when the parameter relates to OpenMP, but means "only use 1 thread (unless backend is specified...)" when it relates to joblib. I'm also not sure whether it's right to use the same parameter name "n_jobs" for both OpenMP and joblib. Why not having |
The idea is that we let the BLAS use as many threads as possible, so why not do it also for our low level multi-threaded parts (by default).
It would be less user friendly to have estimators with n_jobs and estimators with n_omp_threads. However It would be confusing that n_jobs=None does not have the same behavior for all estimators. |
I feel the opposite: it would give them more control, and they would know what If we support both |
The drawback of using multiple kw arguments is that it makes it less
consistent to use and to discover in the user API.
I would rather see some good documentation about the underlying parallelism
for each docstring with `n_jobs` to explain the default behavior.
Also, we discussed with @ogrisel about changing the default behavior for
`n_jobs=None` with the threading backend to the maximal number of threads,
which would make it easier to explain (thread based parallelism vs process
based parallelism behavior).
I agree changing `n_jobs` to something like `n_workers` would be
semantically better (even for joblib) but I guess it is very complicated.
Le mer. 26 juin 2019 à 19:07, Nicolas Hug <notifications@github.com> a
écrit :
… It would be less user friendly to have estimators with n_jobs and
estimators with n_omp_threads
I feel the opposite: it would give them more control, and they would know
what n_jobs refers to. Hiding too much of the logic can also be
detrimental to ease of use, sometimes.
If we support both n_jobs and n_omp_threads, the default for n_omp_threads
could be 'auto', which would mean "use as many threads as possible,
avoiding over-subscription". E.g. with 8 cores, a GridSearch with n_jobs=4
would result in n_omp_threads set to 2 for the underlying estimator. If a
user sets n_omp_threads to an int, then they might risk over/under
subscription (but that's their problem).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14196?email_source=notifications&email_token=AAZKZ6N744RLMTYKD4ZP3ILP4OO4ZA5CNFSM4H3RVYB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUF73Y#issuecomment-505962479>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZKZ6OSETDBJBCB44XJ2STP4OO4ZANCNFSM4H3RVYBQ>
.
|
If this change happens in joblib, it would definitely make sense to have the same behavior for OpenMP based parallelism. |
But the default backend is loky right? So for |
There are a few places where we currently use |
Also, there are places where we now use the loky multiprocessing backend where we might want to switch to the threading joblib backend or OpenMP in the future. It would be a bit awkward if we have to deprecate The way parallelism is done for an estimator is an implementation detail that should be mostly transparent to the user. |
I don't disagree. But then the default should be consistent. |
But it is hard to decide with which rule you should be consistent: 1 - By default, C-level threadpools use all cores, so for thread base parallelism, we could use the same The rational behind following rule 1 is that using all thread tends to always improve the performances, so we should use it (as it is done in |
This is currently blocking us from adding OpenMP code and I think we should merge it, it's just a private function. There is still some time to decide how the public API should look like for this. Is it still a WIP @jeremiedbb ? For OpenMP maybe it could make sense to even not expose |
The WIP tag was until there's a consensus :)
I agree
I'm in favor of exposing However this can be decided later outside of this PR, maybe we should open a dedicated issue so that others can give their opinion. |
For me WIP tags means that the author hasn't finished work or is planning to add more things. So if you are asking for feedback you should remove it IMO.
+1 for opening a issue about it listing possible options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's only a private helper. cf the discussion on the public API in #14265 (comment)
sklearn/utils/openmp_helpers.pyx
Outdated
"""Determine the effective number of threads used for parallel OpenMP calls | ||
|
||
- For ``n_threads = None``, returns the minimum between | ||
openmp.omp_get_max_threads() and joblib.effective_n_jobs(-1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the purpose of also checking joblib.effective_n_jobs(-1)
?
Is this for cases where you want to run code in a joblib context manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because effective_n_jobs is more clever than cpu_count or omp_get_max_threads is some cases. I don't recall in which situations but @tomMoral should now better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, would be nice to have a comment explaining this then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loky
implementation of cpu_count
better handles containers and CPU affinity. (see the docstring here for detailed description).
However, it might be better to call joblib.cpu_count
here instead of effective_n_jobs
, as it also calls loky.cpu_count
under the hood, .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the docstring, setting LOKY_MAX_CPU_COUNT
will affect the output of _openmp_effective_n_threads()
.
I don't think that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe Im overthinking it. I'm interested in @ogrisel's opinion on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I wanted to do is just a conservative way to find the max number of cores to use. I think that omp_get_num_procs
could overestimate it in some situations (e.g. containers although I'm not sure about that). A more robust method is joblib's effective_n_jobs.
I think you should not see this as controlling nthreads through joblib but instead use joblib to gather more informations in order to set the most appropriate number of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand I agree that the fact that it depends onthe backend is a bit brittle. Tom's proposition is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel what's your take on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the fact that we do not use 32 openmp threads by default when running in a docker container that can only run with a 2000 mCPU quota in the cgroups. This is typically what we have on CI systems such as travis or circle: we run in docker containers with CPU quota but openmp.omp_get_max_threads()
would naively return 32 leading to sever oversubscription that causes a significant slow down of the computation.
@NicolasHug so you would rather duplicate the code from loky just to remove the dependency of the LOKY_MAX_CPU_COUNT
env variable? We could do that but that feels a bit overkill. I don't expect many people to mess around with LOKY_MAX_CPU_COUNT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation lgtm.
I am completely OK with that. The problem I have is with
You can't say that and then claim that |
The point is does it makes sense to cap
I am in favor of setting this limit because I have not seen a situation where oversubscription helped and it can lead to catastrophic behaviors. Also, I think the documentation issue should not influence the choice. |
Can you expand a bit please? That seems contradictory with what we say in #15116:
I strongly disagree here. If we can't explain it properly, users won't understand it. There's not point in providing software that users cannot understand. And I'm not even talking about maintenance cost that comes with complex designs. |
In my understanding, for MKL and OpenBLAS, the environment variables are caped by the number of CPUs. The following script, it should not use 100 cpus but the maximal number of cpus available on your machine:
but @jeremiedbb knows better than me in this matter. This means that we would copy the behavior from
I agree that it is necessary to be able to explain our API choice but it seems a bit much to say that we won't be able to explain that overall, the maximal number of threads is caped by the number of CPUs. |
Thanks @NicolasHug for clarifying your concern. We weren't arguing on the same thing the whole time :)
That's right. If you set If you set
the fact that we use a joblib function in this helper does not mean it has an interaction with joblib parallelism tools. To bound by number of cpu we need a way to find that number. Moreover we want a method clever enough for docker-like situations. It happens that joblib already has one so we're better using it instead of re-implement it ourselves. So about the documentation we only need to say that |
Sorry for the confusion Capping But capping This is also @rth 's concern, as far as I can tell:
Regarding this
I was referring here to having special cases for handling pranged estimators, instead of using the uniform way advertised in #15116. But that's another issue which is independent of the one above. I'm OK with discussing this one when/if the time comes. |
Could you clarify the difference? This seems to be the same for me. If you have a virtual machine like docker, the config of the virtual machine gives you a number of CPUs given by |
This is the difference between using 32 cores instead of the 2 that are actually available in e.g. a CI as detailed in #14196 (comment) Basically: capping to 32 makes sense. But we should not cap to 2 when the user sets |
They may be but I'm not sure it's part of the OpenMP spec. For instance, libgomp does not seem to limit the number of threads https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fTHREAD_005fLIMIT.html#OMP_005fTHREAD_005fLIMIT So doing something like, if os.environ.get("OMP_NUM_THREADS"):
# fall back to user provided number of threads
max_n_threads = openmp.omp_get_max_threads()
else:
max_n_threads = min(openmp.omp_get_max_threads(), cpu_count()) would be less disruptive I think. Whether there are cases where it would be useful is besides the point, but rather that we should avoid silent and surprising behavior. Personally I don't read the docs for BLIS, OpenBLAS and MKL in detail, but I know that setting Of course this sidesteps the fact that each of those libraries also have their own ENV variable to control this, but well.. |
I don't see the difference between In a docker image with 2 cpus, the fact that openmp sees 32 cpus is not a feature but a defect.
Indeed it does not. It's the whole purpose of this helper.
What if I did not set the env var, but I changed the number of threads with
Yes but it will be caped to the number of cpus (or physical cores for mkl). And it's preferred to use MKL_NUM_THREADS (or equivalent)
Now that we do code with openmp we are one of those libraries, just like mkl. So we need to implement some custom function. If you want we can have a SKLEARN_OMP_NUM_THREADS env var which has priority over OMP_NUM_THREADS, just like mkl. If a user sets it it will be the number of OpenMP threads no matter the number of cpus (not like mkl). But please show me an example where it's really worth to use more threads than cpus with scikit-learn. |
Probably none. But that's not the point. The point is to respect what users want. We don't want scikit-learn to be an outsider in the way it treats |
One use case could be when either
So by clipping user provided Another use case is simply to evaluate the behavior of the system under over-subscription (e.g. on a multi-tenant or multi-process system). It using 2 threads per CPU cores produces 10x slowdown (related to #15078) it's interesting to know and probably means that we don't understand something and it could be worth investigating. A third use case, is one runs on some VM environment one don't fully understand (e.g. cloud provider) and wants to see what is the best number of threads, so they increase
Because it takes ~5 seconds to prepend an environment variable to any script (accounting for misspellings) and look at the result. While there are less people who know about the existence of
I mostly use OMP_NUM_THREADS for everything, because I don't really want to think about the backend. What's the advantage of a separate env variable, short of if we want to specify the parallelism level on for those few particular scikit-learn routines? As a side note this PR really received way too many comments for the few lines it adds. Maybe in such cases a quick call over a cup of coffee/tee would sort things out faster :) |
@NicolasHug I disagree with this point. @rth thanks for the detailed use cases. I agree that letting more flexibility to the user could be a good thing. My concern is that in what you proposed only the env var works, not the function.
it does not take more time to do
as of today. Tomorrow billions of people will use threadpoolctl :D
It's not the recommended way. do it at your own risks ! :)
Let's do that. |
I'm sorry but whether XGBoost or LightGBM expose n_jobs is completely irrelevant to the discussion here. I'm a user, I want to benchmark XGBoost, LightGBM and sklearn. I expect Yes, I understand that other libraries like MKL will cap to the number of CPUS: 1) whether they should is debatable, 2) it is still a much more reasonable default than what we are about to introduce here, and 3) MKL isn't comparable to sklearn in terms of abstraction level. Again, I'm not against capping to the number of actually available CPUs: this is a good default to have. But users should be able to bypass that default. That's what a default is for. I don't understand why you are so keen on forbidding users to use what they want to use. On top of the reasons @rth mentioned and mine, there probably are many others that we don't know about yet. We don't know everything. That's all I have on this. |
@NicolasHug you missed one thing in my last comment. I actually agreed that we can let some flexibility. In fact I never really was against that. What I'm really against since the beginning is that it can only be controlled by the OMP_NUM_THREADS env var and not by the omp_set_num_threads function.
It's relevant in the sense that you're telling that we should do as they do. But we don't do as they do for some openmp related behavior (e.g. n_jobs) so what is special about this one (env var) ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright so the consensus is to implement something like #14196 (comment) so that we give more control to the user if the env var is explicitly set? I agree with Jeremie that the env var is not so special compared to some code that would explicitly set_omp_num_threads
but get_omp_num_threads
will not tell us if this was set explicitly programmatically or if the value stems from the default CPU detection which is broken with cgroups.
So +1 to not apply the cpu_count()
cap when the env var is set.
@jeremiedbb please feel free to update this PR.
sklearn/utils/_openmp_helpers.pyx
Outdated
"""Determine the effective number of threads used for parallel OpenMP calls | ||
|
||
- For ``n_threads = None``, returns the minimum between the number of cpus | ||
and ``openmp.omp_get_max_threads()``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be rephrased to be more precise: "the number of cpus taking cgroups
quotas into account and openmp.omp_get_max_threads()
. cgroups
quotas can typically be set by tools such as Docker."
Note that openmp.omp_get_max_threads()
already takes number of CPUs (including SMT) and CPU affinity mask constraints into account by default for all OpenMP implementations as far as I know (I have not checked though). So there is no need to mention those here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpicks, but otherwise LGTM.
sklearn/utils/_openmp_helpers.pyx
Outdated
# Fall back to user provided number of threads making it possible | ||
# to exceed the number of cpus. | ||
# It's however inconsistent with `omp_set_num_threads` which can't | ||
# be used this purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather remove this last sentence. This is a detail that only 1% of the potential readership will understand and will confuse the other 99%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but I do it unwillingly 😒
sklearn/utils/_openmp_helpers.pyx
Outdated
|
||
return n_threads | ||
ELSE: | ||
# OpenMP not supported => sequential mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# OpenMP not supported => sequential mode | |
# OpenMP disabled at build-time: => sequential mode |
sklearn/utils/_openmp_helpers.pyx
Outdated
IF SKLEARN_OPENMP_SUPPORTED: | ||
import os | ||
cimport openmp | ||
from . import cpu_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not from joblib? It's deprecated in utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I forgot that...
sklearn/utils/_openmp_helpers.pyx
Outdated
"""Determine the effective number of threads to be used for OpenMP calls | ||
|
||
- For ``n_threads = None``, returns the minimum between | ||
``openmp.omp_get_max_threads()`` and the number of cpus, taking cgroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless OMP_NUM_THREADS
is explicitly set in which case it takes precedence?
sklearn/utils/_openmp_helpers.pyx
Outdated
Cgroups quotas can typically be set by tools such as Docker. | ||
The result of ``omp_get_max_threads`` can be influenced by environment | ||
variable ``OMP_NUM_THREADS`` or at runtime by ``omp_set_num_threads``. | ||
- For ``n_threads > 0``, use this as the maximal number of threads for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the same form as above: "for ..., return ..."
same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jeremiedbb
some PRs are waiting for a helper to go from
n_jobs
to the number of threads to set inprange
(which does not accept None or negative values). (see #13264)This PR proposes an equivalent of joblib.effective_n_jobs but with a different behavior when
n_jobs = None
. In that case OpenMP can use as much threads as possible. It's done using bothomp_get_max_threads
andjoblib.effective_n_jobs
because loky'seffective_n_jobs
is more clever than just returning the cpu_count.Not sure about the name though.
cc @ogrisel @tomMoral