-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Inline async_executor for deeper integration with bevy_tasks #20331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
You added a new feature but didn't update the readme. Please run |
67f2a00
to
ce22ca3
Compare
Are the slower benches just noise? |
Some of them, particularly those under a hundred microseconds and/or do more than just run empty systems, are more prone to having noise affect them with thread wake up times being upwards of 10-30us each on my machine. That said, some of them do deserve some more scrutiny to see if those differences are replicable. The new polling and scheduling function are made to minimize contention when working with a large number of tasks, but could easily add additional overhead when there's zero/little work to do. |
Currently the single threaded scope crates its own executor instead of spawning tasks on the thread local executor. This may sometimes prevent some futures from running while the scope completes. Is that acceptable? It certainly is simpler than trying to spawn them on the larger thread local scope, and I've been thinking about adopting it in the other PR. |
a025261
to
4a8b6b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a full review. Mostly just did a quick skim of the code.
I'm generally onboard with adopting the async executor code, so we can customize it to our needs better. Adding the ability to queue local tasks in this pr shows how much it simplifies some of the above logic. But if we're doing this we should probably steal at some of their tests and be running miri on bevy_tasks.
I wouldn't be surprised if some of the regressions are due to us now checking the "local queue" more often now, when it was a ration one tick of LOCAL_EXECUTOR to 100 ticks of the TaskPool executor before.
CI is finally green!
Revisiting this, this PR definitely seems to increase the base cost to poll for new work and to spawn and reschedule task, so a lot of the very short benchmarks that basically just runs a (near) empty schedule or handles a small number of tasks seems to show some regressions. That said, the high contention cases are showing notable improvements.
While true, this should be a very cheap operation. Other than the TLS access on each poll, checking for local tasks is a just an unsynchronized bounds check. The biggest one might be the failover to pull from the "thread-locked" queue, which is unbounded (atomic linked list) and thus has more baseline overhead on each poll,, even without any contention. |
…eaded cancellation
…e feature is not enabled
Objective
Improve the performance of Bevy's async task executor and clean up the code surrounding
TaskPool::scope
.Solution
async_executor
's codeLocalExecutor
andThreadExecutor
into the coreExecutor
implementation, and poll their queues inExecutor::run
.!Send
types for local queues (Mutex
->UnsafeCell
,ConcurrentQueue
->VecDeque
).Arc
clones by using&'static
references to thread-local queues.This is a breaking change.
ThreadExecutor
is nowThreadSpawner
and forbids ticking and running the executor, only allowing spawning tasks to the target thread.Executor
andLocalExecutor
wrappers are no longer public.Testing
I've only tested this against a few examples: 3d_scene and many_foxes.
This current implementation does seems to break animation systems.EDIT: That seems to be #20383.Performance Testing
All of the benchmarks including a multithreaded schedule run are included below. Overall, this seems to be significant win whenever there are a very large number of tasks (i.e. the empty_archetypes/par_for_each` benchmarks), with some being upwards of 2x faster due to lower contention.
Future Work
StaticExecutor
optimization for the static usages of the TaskPools. Would likely need a new implementation of Optimize TaskPools for use in static variables. #12990.rayon::spawn_broadcast
style APIs for more aggressively waking up threads than what the current implementation allows.ThreadSpawner
with a more integratedExecutor::spawn_to_thread(ThreadId, Fut)
API instead, and use the ThreadIds tracked byNonSend
resources to schedule those systems.FnOnce
) for lower-overhead execution, would avoid some extra atomics and state management.block_on
implementation that ticks the local and thread-locked queues.