-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-74028: Introduce a prefetch parameter to Executor.map to handle large iterators #114975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46ea84e
to
67b7b0f
Compare
Code itself aside, I don't think this PR solves the current issue. Yes, with the Also, this design means the worker process won't work, until the user requests the result - which is very counter intuitive. If the user maps a large amount of data, they would hope that the data is being processed in the background by the executor, not wait there until the user asks for the result. So, from my personal perspective, the design is not what we want for the executor. If you want to fix this, you need more than this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read the docs again and the docs clearly states:
the iterables are collected immediately rather than lazily;
So this could break backward compatibility. Also the original issue is solvable by chunksize
. Unfortunately, I don't think this PR is worth more effort.
Unfortunately |
I believe the reason |
Thanks for the discussion! Would love @graingert @rdarder @kumaraditya303 @brianquinlan your thoughts as well |
f18eecf
to
fb6f498
Compare
I'm not sure whether I'm following fully here, but my intention was if we use the default or
Do we mean the issue in bpo-29842 as well?
No worries at all, less work for me if that's case :) |
fb6f498
to
203c8b2
Compare
My wording is inaccurate in that comment. What I really meant is - with the
No, that one will not be solved by I think the fundamental factors we need beforep proceeding this, is the gain v cost. What do we get from the change?
I'm not saying no. Just how I consider this issue. This might be why the original issue has been sitting there for a couple of years - it might be a serious issue to the users. |
Thanks for the comments! @gaogaotiantian
Happy to update the documentation if helpful.
I guess the main rationale is to give the users the flexibility, to choose how many items, from the input iterator, they would like to be processed at a time. This is certainly useful in the infinite iterator case, and I would imagine this to be useful in cases where, I don't want too many tasks to be processed at one time as well. |
203c8b2
to
5a5dbe6
Compare
5a5dbe6
to
acab150
Compare
Sorry but I still do not get the rationale for it. I understood your statements, but it did not convince me that this is a feature that worth the effort. The main concern is the usage - the current implementation will not submit task until the result is asked, after the prefetched ones, that just does not feel right to me. You'll need some core dev behind this anyway (I'm not one). If you can find someone who likes the idea, you might be able to make some progress on this PR. |
Hey @gpshead, sorry for tagging, but since this is concurrency related, I thought you might be interested |
TLDR: RTFM Once upon a time, in a countryside farm in Belgium... At first, the upgrade of databases was straightforward. But, as time passed, the size of the databases grew, and some CPU-intensive computations took so much time that a solution needed to be found. Hopefully, the Python standard library has the perfect module for this task: `concurrent.futures`. Then, Python 3.10 appeared, and the usage of `ProcessPoolExecutor` started to sometimes hang for no apparent reasons. Soon, our hero finds out he wasn't the only one to suffer from this issue[^1]. Unfortunately, the proposed solution looked overkill. Still, it revealed that the issue had already been known[^2] for a few years. Despite the fact that an official patch wasn't ready to be committed, discussion about its legitimacy[^3] leads our hero to a nicer solution. By default, `ProcessPoolExecutor.map` submits elements one by one to the pool. This is pretty inefficient when there are a lot of elements to process. This can be changed by using a large value for the *chunksize* argument. Who would have thought that a bigger chunk size would solve a performance issue? As always, the response was in the documentation[^4]. [^1]: https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load [^2]: python/cpython#74028 [^3]: python/cpython#114975 (review) [^4]: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
TLDR: RTFM Once upon a time, in a countryside farm in Belgium... At first, the upgrade of databases was straightforward. But, as time passed, the size of the databases grew, and some CPU-intensive computations took so much time that a solution needed to be found. Hopefully, the Python standard library has the perfect module for this task: `concurrent.futures`. Then, Python 3.10 appeared, and the usage of `ProcessPoolExecutor` started to sometimes hang for no apparent reasons. Soon, our hero finds out he wasn't the only one to suffer from this issue[^1]. Unfortunately, the proposed solution looked overkill. Still, it revealed that the issue had already been known[^2] for a few years. Despite the fact that an official patch wasn't ready to be committed, discussion about its legitimacy[^3] leads our hero to a nicer solution. By default, `ProcessPoolExecutor.map` submits elements one by one to the pool. This is pretty inefficient when there are a lot of elements to process. This can be changed by using a large value for the *chunksize* argument. Who would have thought that a bigger chunk size would solve a performance issue? As always, the response was in the documentation[^4]. [^1]: https://stackoverflow.com/questions/74633896/processpoolexecutor-using-map-hang-on-large-load [^2]: python/cpython#74028 [^3]: python/cpython#114975 (review) [^4]: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map closes #94 Signed-off-by: Nicolas Seinlet (nse) <nse@odoo.com>
Hi, fyi here is a follow up PR: #125663 🙏🏻 |
Introduce a
prefetch
parameter toExecutor.map
, so that large and even unbounded iterators can be handled.This is a continuation of #18566, with backward compatibility, which is to say when the new
prefetch
parameter is not specified, we default to the current behaviour.cc @graingert @rdarder @kumaraditya303 @brianquinlan
📚 Documentation preview 📚: https://cpython-previews--114975.org.readthedocs.build/