Skip to content

Inline async_executor for deeper integration with bevy_tasks #20331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

james7132
Copy link
Member

@james7132 james7132 commented Jul 30, 2025

Objective

Improve the performance of Bevy's async task executor and clean up the code surrounding TaskPool::scope.

Solution

  • Inline a copy of async_executor's code
  • Move the TLS LocalExecutor and ThreadExecutor into the core Executor implementation, and poll their queues in Executor::run.
  • Use !Send types for local queues (Mutex -> UnsafeCell, ConcurrentQueue -> VecDeque).
  • Avoid extra Arc clones by using &'static references to thread-local queues.
  • Avoid extra contention on the global injector queue, and minimize extra allocations by opportunistically pushing onto thread-local queues when done on "executor-owned" threads.

This is a breaking change.

  • ThreadExecutor is now ThreadSpawner and forbids ticking and running the executor, only allowing spawning tasks to the target thread.
  • Executor and LocalExecutor wrappers are no longer public.

Testing

I've only tested this against a few examples: 3d_scene and many_foxes. This current implementation does seems to break animation systems. EDIT: That seems to be #20383.

Performance Testing

All of the benchmarks including a multithreaded schedule run are included below. Overall, this seems to be significant win whenever there are a very large number of tasks (i.e. the empty_archetypes/par_for_each` benchmarks), with some being upwards of 2x faster due to lower contention.

group                                                                                                     bevy_executor                            main
-----                                                                                                     -------------                            ----
added_archetypes/archetype_count/100                                                                      1.00     34.8±0.47µs        ? ?/sec      1.08     37.6±1.61µs        ? ?/sec
added_archetypes/archetype_count/1000                                                                     1.00    381.5±7.53µs        ? ?/sec      1.04    398.5±6.26µs        ? ?/sec
added_archetypes/archetype_count/10000                                                                    1.00      9.8±0.86ms        ? ?/sec      1.11     10.9±0.62ms        ? ?/sec
busy_systems/01x_entities_03_systems                                                                      1.00     24.3±0.53µs        ? ?/sec      1.62     39.3±3.69µs        ? ?/sec
busy_systems/01x_entities_09_systems                                                                      1.05     62.2±1.46µs        ? ?/sec      1.00     59.1±2.83µs        ? ?/sec
busy_systems/01x_entities_15_systems                                                                      1.04     99.4±1.59µs        ? ?/sec      1.00     95.6±7.88µs        ? ?/sec
busy_systems/03x_entities_03_systems                                                                      1.00     38.6±0.69µs        ? ?/sec      1.05     40.5±2.31µs        ? ?/sec
busy_systems/03x_entities_09_systems                                                                      1.06    109.7±2.50µs        ? ?/sec      1.00    103.3±5.59µs        ? ?/sec
busy_systems/03x_entities_15_systems                                                                      1.15    179.6±4.66µs        ? ?/sec      1.00    156.6±6.52µs        ? ?/sec
busy_systems/05x_entities_03_systems                                                                      1.06     56.0±1.45µs        ? ?/sec      1.00     52.6±1.78µs        ? ?/sec
busy_systems/05x_entities_09_systems                                                                      1.17    160.8±8.20µs        ? ?/sec      1.00    137.6±7.71µs        ? ?/sec
busy_systems/05x_entities_15_systems                                                                      1.14    258.6±7.00µs        ? ?/sec      1.00   226.7±10.04µs        ? ?/sec
contrived/01x_entities_03_systems                                                                         1.02     12.4±0.31µs        ? ?/sec      1.00     12.2±0.84µs        ? ?/sec
contrived/01x_entities_09_systems                                                                         1.00     29.4±0.73µs        ? ?/sec      1.08     31.6±3.33µs        ? ?/sec
contrived/01x_entities_15_systems                                                                         1.03     46.4±0.76µs        ? ?/sec      1.00     45.1±1.80µs        ? ?/sec
contrived/03x_entities_03_systems                                                                         1.00     22.2±0.58µs        ? ?/sec      1.10     24.3±1.01µs        ? ?/sec
contrived/03x_entities_09_systems                                                                         1.00     54.5±1.49µs        ? ?/sec      1.13     61.4±4.13µs        ? ?/sec
contrived/03x_entities_15_systems                                                                         1.00     90.9±2.19µs        ? ?/sec      1.10   100.0±10.73µs        ? ?/sec
contrived/05x_entities_03_systems                                                                         1.45     47.1±1.39µs        ? ?/sec      1.00     32.5±2.69µs        ? ?/sec
contrived/05x_entities_09_systems                                                                         1.28    131.2±2.31µs        ? ?/sec      1.00    102.4±4.47µs        ? ?/sec
contrived/05x_entities_15_systems                                                                         1.35    217.7±3.76µs        ? ?/sec      1.00    161.5±6.58µs        ? ?/sec
empty_archetypes/for_each/10                                                                              1.26      2.3±0.13µs        ? ?/sec      1.00  1821.2±136.06ns        ? ?/sec
empty_archetypes/for_each/100                                                                             1.30      2.3±0.07µs        ? ?/sec      1.00  1765.7±51.08ns        ? ?/sec
empty_archetypes/for_each/1000                                                                            1.16      2.6±0.12µs        ? ?/sec      1.00      2.3±0.08µs        ? ?/sec
empty_archetypes/for_each/10000                                                                           1.00      8.3±0.28µs        ? ?/sec      1.71     14.3±1.61µs        ? ?/sec
empty_archetypes/iter/10                                                                                  1.41      2.3±0.06µs        ? ?/sec      1.00  1613.4±36.14ns        ? ?/sec
empty_archetypes/iter/100                                                                                 1.36      2.3±0.06µs        ? ?/sec      1.00  1673.3±32.25ns        ? ?/sec
empty_archetypes/iter/1000                                                                                1.25      2.5±0.12µs        ? ?/sec      1.00      2.0±0.08µs        ? ?/sec
empty_archetypes/iter/10000                                                                               1.00      5.0±0.27µs        ? ?/sec      2.68     13.5±1.42µs        ? ?/sec
empty_archetypes/par_for_each/10                                                                          1.00      4.9±0.16µs        ? ?/sec      2.87     14.0±1.77µs        ? ?/sec
empty_archetypes/par_for_each/100                                                                         1.00      4.8±0.07µs        ? ?/sec      2.81     13.5±1.47µs        ? ?/sec
empty_archetypes/par_for_each/1000                                                                        1.00      5.4±0.23µs        ? ?/sec      2.65     14.2±0.98µs        ? ?/sec
empty_archetypes/par_for_each/10000                                                                       1.00     16.8±0.55µs        ? ?/sec      1.22     20.6±1.08µs        ? ?/sec
empty_systems/0_systems                                                                                   1.16     10.9±0.17ns        ? ?/sec      1.00      9.4±0.53ns        ? ?/sec
empty_systems/1000_systems                                                                                1.03   459.3±29.67µs        ? ?/sec      1.00   448.0±30.66µs        ? ?/sec
empty_systems/100_systems                                                                                 1.00     37.5±1.29µs        ? ?/sec      1.04     38.8±1.42µs        ? ?/sec
empty_systems/10_systems                                                                                  1.00      6.4±0.52µs        ? ?/sec      1.52      9.7±0.73µs        ? ?/sec
empty_systems/2_systems                                                                                   1.00      2.6±0.13µs        ? ?/sec      2.30      6.0±0.96µs        ? ?/sec
empty_systems/4_systems                                                                                   1.00      4.3±0.19µs        ? ?/sec      2.10      8.9±0.71µs        ? ?/sec
for_each_iter                                                                                             1.05     23.3±1.29ms        ? ?/sec      1.00     22.1±0.35ms        ? ?/sec
for_each_par_iter/threads/1                                                                               1.00     11.1±0.09ms        ? ?/sec      1.29     14.2±4.69ms        ? ?/sec
for_each_par_iter/threads/16                                                                              1.00      2.0±0.05ms        ? ?/sec      1.04      2.1±0.16ms        ? ?/sec
for_each_par_iter/threads/2                                                                               1.01      7.5±0.11ms        ? ?/sec      1.00      7.5±0.07ms        ? ?/sec
for_each_par_iter/threads/32                                                                              1.00      2.0±0.03ms        ? ?/sec      1.11      2.2±0.07ms        ? ?/sec
for_each_par_iter/threads/4                                                                               1.02      4.6±0.07ms        ? ?/sec      1.00      4.5±0.05ms        ? ?/sec
for_each_par_iter/threads/8                                                                               1.03      3.0±0.13ms        ? ?/sec      1.00      2.9±0.10ms        ? ?/sec
many_maps_iter                                                                                            1.02     23.0±0.79ms        ? ?/sec      1.00     22.6±0.60ms        ? ?/sec
many_maps_par_iter/threads/1                                                                              1.02     21.5±1.30ms        ? ?/sec      1.00     21.1±1.70ms        ? ?/sec
many_maps_par_iter/threads/16                                                                             1.00      2.1±0.06ms        ? ?/sec      1.48      3.1±0.08ms        ? ?/sec
many_maps_par_iter/threads/2                                                                              1.02     11.6±0.33ms        ? ?/sec      1.00     11.4±0.15ms        ? ?/sec
many_maps_par_iter/threads/32                                                                             1.00      2.1±0.06ms        ? ?/sec      1.03      2.1±0.07ms        ? ?/sec
many_maps_par_iter/threads/4                                                                              1.01      7.9±0.24ms        ? ?/sec      1.00      7.8±0.23ms        ? ?/sec
many_maps_par_iter/threads/8                                                                              1.00      3.1±0.24ms        ? ?/sec      1.56      4.9±0.09ms        ? ?/sec
no_archetypes/system_count/0                                                                              1.00     13.2±0.10ns        ? ?/sec      1.00     13.1±0.22ns        ? ?/sec
no_archetypes/system_count/10                                                                             1.00    107.5±3.07ns        ? ?/sec      1.03    110.4±3.48ns        ? ?/sec
no_archetypes/system_count/100                                                                            1.00   942.8±25.67ns        ? ?/sec      1.02   965.7±24.47ns        ? ?/sec
overhead_iter                                                                                             1.02      0.2±0.00ns        ? ?/sec      1.00      0.2±0.00ns        ? ?/sec
overhead_par_iter/threads/1                                                                               1.00     21.6±0.84µs        ? ?/sec      1.05     22.6±1.38µs        ? ?/sec
overhead_par_iter/threads/16                                                                              1.00     32.4±3.66µs        ? ?/sec      1.22     39.7±1.91µs        ? ?/sec
overhead_par_iter/threads/2                                                                               1.00     26.8±0.58µs        ? ?/sec      1.04     27.9±2.35µs        ? ?/sec
overhead_par_iter/threads/32                                                                              1.00     37.2±2.18µs        ? ?/sec      1.05     39.0±3.05µs        ? ?/sec
overhead_par_iter/threads/4                                                                               1.00     30.1±0.59µs        ? ?/sec      1.10     33.0±2.05µs        ? ?/sec
overhead_par_iter/threads/8                                                                               1.00     31.2±0.64µs        ? ?/sec      1.18     36.7±1.34µs        ? ?/sec
par_iter_simple/hybrid                                                                                    1.00     61.4±9.19µs        ? ?/sec      1.43     87.7±4.90µs        ? ?/sec
par_iter_simple/with_0_fragment                                                                           1.00     34.6±4.18µs        ? ?/sec      1.39     48.2±5.16µs        ? ?/sec
par_iter_simple/with_1000_fragment                                                                        1.00     46.1±4.91µs        ? ?/sec      1.42     65.6±5.16µs        ? ?/sec
par_iter_simple/with_100_fragment                                                                         1.00     36.6±5.48µs        ? ?/sec      1.35     49.3±2.98µs        ? ?/sec
par_iter_simple/with_10_fragment                                                                          1.00     34.9±4.18µs        ? ?/sec      1.35     47.2±2.96µs        ? ?/sec
param/combinator_system/8_dyn_params_system                                                               1.54      2.8±0.15µs        ? ?/sec      1.00  1804.0±184.09ns        ? ?/sec
param/combinator_system/8_piped_systems                                                                   1.13      2.7±0.15µs        ? ?/sec      1.00      2.4±0.67µs        ? ?/sec
param/combinator_system/8_variant_param_set_system                                                        1.36      2.8±0.36µs        ? ?/sec      1.00      2.1±0.24µs        ? ?/sec
run_condition/no/1000_systems                                                                             1.01     36.2±3.26µs        ? ?/sec      1.00     35.8±0.50µs        ? ?/sec
run_condition/no/100_systems                                                                              1.06      2.2±0.12µs        ? ?/sec      1.00      2.1±0.10µs        ? ?/sec
run_condition/no/10_systems                                                                               1.12    303.8±1.32ns        ? ?/sec      1.00    272.2±6.74ns        ? ?/sec
run_condition/yes/1000_systems                                                                            1.10   679.2±29.21µs        ? ?/sec      1.00   619.5±87.51µs        ? ?/sec
run_condition/yes/100_systems                                                                             1.00     46.4±3.53µs        ? ?/sec      1.17     54.2±5.73µs        ? ?/sec
run_condition/yes/10_systems                                                                              1.00      5.5±0.29µs        ? ?/sec      1.47      8.1±1.18µs        ? ?/sec
run_condition/yes_using_query/1000_systems                                                                1.00    419.5±6.49µs        ? ?/sec      1.09   457.6±44.89µs        ? ?/sec
run_condition/yes_using_query/100_systems                                                                 1.00     39.5±0.78µs        ? ?/sec      1.14     45.0±5.11µs        ? ?/sec
run_condition/yes_using_query/10_systems                                                                  1.00      6.7±0.55µs        ? ?/sec      1.10      7.4±0.58µs        ? ?/sec
run_condition/yes_using_resource/1000_systems                                                             1.07   435.1±13.28µs        ? ?/sec      1.00    405.2±6.88µs        ? ?/sec
run_condition/yes_using_resource/100_systems                                                              1.00     38.0±1.17µs        ? ?/sec      1.01     38.3±1.66µs        ? ?/sec
run_condition/yes_using_resource/10_systems                                                               1.00      6.7±0.72µs        ? ?/sec      1.46      9.7±0.66µs        ? ?/sec
run_empty_schedule/MultiThreaded                                                                          1.05      9.9±0.09ns        ? ?/sec      1.00      9.5±0.29ns        ? ?/sec
run_empty_schedule/Simple                                                                                 1.00     10.2±0.22ns        ? ?/sec      1.03     10.5±0.44ns        ? ?/sec
run_empty_schedule/SingleThreaded                                                                         1.00     13.1±0.07ns        ? ?/sec      1.05     13.7±0.60ns        ? ?/sec
schedule/base                                                                                             1.00     20.4±0.74µs        ? ?/sec      1.61     33.0±2.16µs        ? ?/sec

Future Work

  • Implement the StaticExecutor optimization for the static usages of the TaskPools. Would likely need a new implementation of Optimize TaskPools for use in static variables. #12990.
  • Add support for rayon::spawn_broadcast style APIs for more aggressively waking up threads than what the current implementation allows.
  • Replace ThreadSpawner with a more integrated Executor::spawn_to_thread(ThreadId, Fut) API instead, and use the ThreadIds tracked by NonSend resources to schedule those systems.
  • Support non-async Tasks (i.e. FnOnce) for lower-overhead execution, would avoid some extra atomics and state management.
  • Create a custom block_on implementation that ticks the local and thread-locked queues.

@james7132 james7132 added this to the 0.18 milestone Jul 30, 2025
@james7132 james7132 requested a review from hymm July 30, 2025 04:00
@james7132 james7132 added C-Performance A change motivated by improving speed, memory usage or compile times C-Code-Quality A section of code that is hard to understand or change A-Tasks Tools for parallel and async work M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide X-Controversial There is active debate or serious implications around merging this PR S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels Jul 30, 2025
Copy link
Contributor

You added a new feature but didn't update the readme. Please run cargo run -p build-templated-pages -- update features to update it, and commit the file change.

@hymm
Copy link
Contributor

hymm commented Aug 5, 2025

Are the slower benches just noise?

@james7132
Copy link
Member Author

james7132 commented Aug 5, 2025

Some of them, particularly those under a hundred microseconds and/or do more than just run empty systems, are more prone to having noise affect them with thread wake up times being upwards of 10-30us each on my machine.

That said, some of them do deserve some more scrutiny to see if those differences are replicable. The new polling and scheduling function are made to minimize contention when working with a large number of tasks, but could easily add additional overhead when there's zero/little work to do.

@NthTensor
Copy link
Contributor

Currently the single threaded scope crates its own executor instead of spawning tasks on the thread local executor. This may sometimes prevent some futures from running while the scope completes. Is that acceptable?

It certainly is simpler than trying to spawn them on the larger thread local scope, and I've been thinking about adopting it in the other PR.

Copy link
Contributor

@hymm hymm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a full review. Mostly just did a quick skim of the code.

I'm generally onboard with adopting the async executor code, so we can customize it to our needs better. Adding the ability to queue local tasks in this pr shows how much it simplifies some of the above logic. But if we're doing this we should probably steal at some of their tests and be running miri on bevy_tasks.

I wouldn't be surprised if some of the regressions are due to us now checking the "local queue" more often now, when it was a ration one tick of LOCAL_EXECUTOR to 100 ticks of the TaskPool executor before.

@james7132
Copy link
Member Author

CI is finally green!

Are the slower benches just noise?

Revisiting this, this PR definitely seems to increase the base cost to poll for new work and to spawn and reschedule task, so a lot of the very short benchmarks that basically just runs a (near) empty schedule or handles a small number of tasks seems to show some regressions. That said, the high contention cases are showing notable improvements.

I wouldn't be surprised if some of the regressions are due to us now checking the "local queue" more often now, when it was a ration one tick of LOCAL_EXECUTOR to 100 ticks of the TaskPool executor before.

While true, this should be a very cheap operation. Other than the TLS access on each poll, checking for local tasks is a just an unsynchronized bounds check. The biggest one might be the failover to pull from the "thread-locked" queue, which is unbounded (atomic linked list) and thus has more baseline overhead on each poll,, even without any contention.

@james7132 james7132 requested review from hymm and NthTensor August 12, 2025 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Code-Quality A section of code that is hard to understand or change C-Performance A change motivated by improving speed, memory usage or compile times M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide X-Controversial There is active debate or serious implications around merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants