Inline async_executor for deeper integration with bevy_tasks #20331

james7132 · 2025-07-30T04:00:41Z

Objective

Improve the performance of Bevy's async task executor and clean up the code surrounding TaskPool::scope.

Solution

Inline a copy of async_executor's code
Move the TLS LocalExecutor and ThreadExecutor into the core Executor implementation, and poll their queues in Executor::run.
Use !Send types for local queues (Mutex -> UnsafeCell, ConcurrentQueue -> VecDeque).
Avoid extra Arc clones by using &'static references to thread-local queues.
Avoid extra contention on the global injector queue, and minimize extra allocations by opportunistically pushing onto thread-local queues when done on "executor-owned" threads.

This is a breaking change.

ThreadExecutor is now ThreadSpawner and forbids ticking and running the executor, only allowing spawning tasks to the target thread.
Executor and LocalExecutor wrappers are no longer public.

Testing

I've only tested this against a few examples: 3d_scene and many_foxes. ~~This current implementation does seems to break animation systems.~~ EDIT: That seems to be #20383.

Performance Testing

All of the benchmarks including a multithreaded schedule run are included below. Overall, this seems to be significant win whenever there are a very large number of tasks (i.e. the empty_archetypes/par_for_each` benchmarks), with some being upwards of 2x faster due to lower contention.

group                                                                                                     bevy_executor                            main
-----                                                                                                     -------------                            ----
added_archetypes/archetype_count/100                                                                      1.00     34.8±0.47µs        ? ?/sec      1.08     37.6±1.61µs        ? ?/sec
added_archetypes/archetype_count/1000                                                                     1.00    381.5±7.53µs        ? ?/sec      1.04    398.5±6.26µs        ? ?/sec
added_archetypes/archetype_count/10000                                                                    1.00      9.8±0.86ms        ? ?/sec      1.11     10.9±0.62ms        ? ?/sec
busy_systems/01x_entities_03_systems                                                                      1.00     24.3±0.53µs        ? ?/sec      1.62     39.3±3.69µs        ? ?/sec
busy_systems/01x_entities_09_systems                                                                      1.05     62.2±1.46µs        ? ?/sec      1.00     59.1±2.83µs        ? ?/sec
busy_systems/01x_entities_15_systems                                                                      1.04     99.4±1.59µs        ? ?/sec      1.00     95.6±7.88µs        ? ?/sec
busy_systems/03x_entities_03_systems                                                                      1.00     38.6±0.69µs        ? ?/sec      1.05     40.5±2.31µs        ? ?/sec
busy_systems/03x_entities_09_systems                                                                      1.06    109.7±2.50µs        ? ?/sec      1.00    103.3±5.59µs        ? ?/sec
busy_systems/03x_entities_15_systems                                                                      1.15    179.6±4.66µs        ? ?/sec      1.00    156.6±6.52µs        ? ?/sec
busy_systems/05x_entities_03_systems                                                                      1.06     56.0±1.45µs        ? ?/sec      1.00     52.6±1.78µs        ? ?/sec
busy_systems/05x_entities_09_systems                                                                      1.17    160.8±8.20µs        ? ?/sec      1.00    137.6±7.71µs        ? ?/sec
busy_systems/05x_entities_15_systems                                                                      1.14    258.6±7.00µs        ? ?/sec      1.00   226.7±10.04µs        ? ?/sec
contrived/01x_entities_03_systems                                                                         1.02     12.4±0.31µs        ? ?/sec      1.00     12.2±0.84µs        ? ?/sec
contrived/01x_entities_09_systems                                                                         1.00     29.4±0.73µs        ? ?/sec      1.08     31.6±3.33µs        ? ?/sec
contrived/01x_entities_15_systems                                                                         1.03     46.4±0.76µs        ? ?/sec      1.00     45.1±1.80µs        ? ?/sec
contrived/03x_entities_03_systems                                                                         1.00     22.2±0.58µs        ? ?/sec      1.10     24.3±1.01µs        ? ?/sec
contrived/03x_entities_09_systems                                                                         1.00     54.5±1.49µs        ? ?/sec      1.13     61.4±4.13µs        ? ?/sec
contrived/03x_entities_15_systems                                                                         1.00     90.9±2.19µs        ? ?/sec      1.10   100.0±10.73µs        ? ?/sec
contrived/05x_entities_03_systems                                                                         1.45     47.1±1.39µs        ? ?/sec      1.00     32.5±2.69µs        ? ?/sec
contrived/05x_entities_09_systems                                                                         1.28    131.2±2.31µs        ? ?/sec      1.00    102.4±4.47µs        ? ?/sec
contrived/05x_entities_15_systems                                                                         1.35    217.7±3.76µs        ? ?/sec      1.00    161.5±6.58µs        ? ?/sec
empty_archetypes/for_each/10                                                                              1.26      2.3±0.13µs        ? ?/sec      1.00  1821.2±136.06ns        ? ?/sec
empty_archetypes/for_each/100                                                                             1.30      2.3±0.07µs        ? ?/sec      1.00  1765.7±51.08ns        ? ?/sec
empty_archetypes/for_each/1000                                                                            1.16      2.6±0.12µs        ? ?/sec      1.00      2.3±0.08µs        ? ?/sec
empty_archetypes/for_each/10000                                                                           1.00      8.3±0.28µs        ? ?/sec      1.71     14.3±1.61µs        ? ?/sec
empty_archetypes/iter/10                                                                                  1.41      2.3±0.06µs        ? ?/sec      1.00  1613.4±36.14ns        ? ?/sec
empty_archetypes/iter/100                                                                                 1.36      2.3±0.06µs        ? ?/sec      1.00  1673.3±32.25ns        ? ?/sec
empty_archetypes/iter/1000                                                                                1.25      2.5±0.12µs        ? ?/sec      1.00      2.0±0.08µs        ? ?/sec
empty_archetypes/iter/10000                                                                               1.00      5.0±0.27µs        ? ?/sec      2.68     13.5±1.42µs        ? ?/sec
empty_archetypes/par_for_each/10                                                                          1.00      4.9±0.16µs        ? ?/sec      2.87     14.0±1.77µs        ? ?/sec
empty_archetypes/par_for_each/100                                                                         1.00      4.8±0.07µs        ? ?/sec      2.81     13.5±1.47µs        ? ?/sec
empty_archetypes/par_for_each/1000                                                                        1.00      5.4±0.23µs        ? ?/sec      2.65     14.2±0.98µs        ? ?/sec
empty_archetypes/par_for_each/10000                                                                       1.00     16.8±0.55µs        ? ?/sec      1.22     20.6±1.08µs        ? ?/sec
empty_systems/0_systems                                                                                   1.16     10.9±0.17ns        ? ?/sec      1.00      9.4±0.53ns        ? ?/sec
empty_systems/1000_systems                                                                                1.03   459.3±29.67µs        ? ?/sec      1.00   448.0±30.66µs        ? ?/sec
empty_systems/100_systems                                                                                 1.00     37.5±1.29µs        ? ?/sec      1.04     38.8±1.42µs        ? ?/sec
empty_systems/10_systems                                                                                  1.00      6.4±0.52µs        ? ?/sec      1.52      9.7±0.73µs        ? ?/sec
empty_systems/2_systems                                                                                   1.00      2.6±0.13µs        ? ?/sec      2.30      6.0±0.96µs        ? ?/sec
empty_systems/4_systems                                                                                   1.00      4.3±0.19µs        ? ?/sec      2.10      8.9±0.71µs        ? ?/sec
for_each_iter                                                                                             1.05     23.3±1.29ms        ? ?/sec      1.00     22.1±0.35ms        ? ?/sec
for_each_par_iter/threads/1                                                                               1.00     11.1±0.09ms        ? ?/sec      1.29     14.2±4.69ms        ? ?/sec
for_each_par_iter/threads/16                                                                              1.00      2.0±0.05ms        ? ?/sec      1.04      2.1±0.16ms        ? ?/sec
for_each_par_iter/threads/2                                                                               1.01      7.5±0.11ms        ? ?/sec      1.00      7.5±0.07ms        ? ?/sec
for_each_par_iter/threads/32                                                                              1.00      2.0±0.03ms        ? ?/sec      1.11      2.2±0.07ms        ? ?/sec
for_each_par_iter/threads/4                                                                               1.02      4.6±0.07ms        ? ?/sec      1.00      4.5±0.05ms        ? ?/sec
for_each_par_iter/threads/8                                                                               1.03      3.0±0.13ms        ? ?/sec      1.00      2.9±0.10ms        ? ?/sec
many_maps_iter                                                                                            1.02     23.0±0.79ms        ? ?/sec      1.00     22.6±0.60ms        ? ?/sec
many_maps_par_iter/threads/1                                                                              1.02     21.5±1.30ms        ? ?/sec      1.00     21.1±1.70ms        ? ?/sec
many_maps_par_iter/threads/16                                                                             1.00      2.1±0.06ms        ? ?/sec      1.48      3.1±0.08ms        ? ?/sec
many_maps_par_iter/threads/2                                                                              1.02     11.6±0.33ms        ? ?/sec      1.00     11.4±0.15ms        ? ?/sec
many_maps_par_iter/threads/32                                                                             1.00      2.1±0.06ms        ? ?/sec      1.03      2.1±0.07ms        ? ?/sec
many_maps_par_iter/threads/4                                                                              1.01      7.9±0.24ms        ? ?/sec      1.00      7.8±0.23ms        ? ?/sec
many_maps_par_iter/threads/8                                                                              1.00      3.1±0.24ms        ? ?/sec      1.56      4.9±0.09ms        ? ?/sec
no_archetypes/system_count/0                                                                              1.00     13.2±0.10ns        ? ?/sec      1.00     13.1±0.22ns        ? ?/sec
no_archetypes/system_count/10                                                                             1.00    107.5±3.07ns        ? ?/sec      1.03    110.4±3.48ns        ? ?/sec
no_archetypes/system_count/100                                                                            1.00   942.8±25.67ns        ? ?/sec      1.02   965.7±24.47ns        ? ?/sec
overhead_iter                                                                                             1.02      0.2±0.00ns        ? ?/sec      1.00      0.2±0.00ns        ? ?/sec
overhead_par_iter/threads/1                                                                               1.00     21.6±0.84µs        ? ?/sec      1.05     22.6±1.38µs        ? ?/sec
overhead_par_iter/threads/16                                                                              1.00     32.4±3.66µs        ? ?/sec      1.22     39.7±1.91µs        ? ?/sec
overhead_par_iter/threads/2                                                                               1.00     26.8±0.58µs        ? ?/sec      1.04     27.9±2.35µs        ? ?/sec
overhead_par_iter/threads/32                                                                              1.00     37.2±2.18µs        ? ?/sec      1.05     39.0±3.05µs        ? ?/sec
overhead_par_iter/threads/4                                                                               1.00     30.1±0.59µs        ? ?/sec      1.10     33.0±2.05µs        ? ?/sec
overhead_par_iter/threads/8                                                                               1.00     31.2±0.64µs        ? ?/sec      1.18     36.7±1.34µs        ? ?/sec
par_iter_simple/hybrid                                                                                    1.00     61.4±9.19µs        ? ?/sec      1.43     87.7±4.90µs        ? ?/sec
par_iter_simple/with_0_fragment                                                                           1.00     34.6±4.18µs        ? ?/sec      1.39     48.2±5.16µs        ? ?/sec
par_iter_simple/with_1000_fragment                                                                        1.00     46.1±4.91µs        ? ?/sec      1.42     65.6±5.16µs        ? ?/sec
par_iter_simple/with_100_fragment                                                                         1.00     36.6±5.48µs        ? ?/sec      1.35     49.3±2.98µs        ? ?/sec
par_iter_simple/with_10_fragment                                                                          1.00     34.9±4.18µs        ? ?/sec      1.35     47.2±2.96µs        ? ?/sec
param/combinator_system/8_dyn_params_system                                                               1.54      2.8±0.15µs        ? ?/sec      1.00  1804.0±184.09ns        ? ?/sec
param/combinator_system/8_piped_systems                                                                   1.13      2.7±0.15µs        ? ?/sec      1.00      2.4±0.67µs        ? ?/sec
param/combinator_system/8_variant_param_set_system                                                        1.36      2.8±0.36µs        ? ?/sec      1.00      2.1±0.24µs        ? ?/sec
run_condition/no/1000_systems                                                                             1.01     36.2±3.26µs        ? ?/sec      1.00     35.8±0.50µs        ? ?/sec
run_condition/no/100_systems                                                                              1.06      2.2±0.12µs        ? ?/sec      1.00      2.1±0.10µs        ? ?/sec
run_condition/no/10_systems                                                                               1.12    303.8±1.32ns        ? ?/sec      1.00    272.2±6.74ns        ? ?/sec
run_condition/yes/1000_systems                                                                            1.10   679.2±29.21µs        ? ?/sec      1.00   619.5±87.51µs        ? ?/sec
run_condition/yes/100_systems                                                                             1.00     46.4±3.53µs        ? ?/sec      1.17     54.2±5.73µs        ? ?/sec
run_condition/yes/10_systems                                                                              1.00      5.5±0.29µs        ? ?/sec      1.47      8.1±1.18µs        ? ?/sec
run_condition/yes_using_query/1000_systems                                                                1.00    419.5±6.49µs        ? ?/sec      1.09   457.6±44.89µs        ? ?/sec
run_condition/yes_using_query/100_systems                                                                 1.00     39.5±0.78µs        ? ?/sec      1.14     45.0±5.11µs        ? ?/sec
run_condition/yes_using_query/10_systems                                                                  1.00      6.7±0.55µs        ? ?/sec      1.10      7.4±0.58µs        ? ?/sec
run_condition/yes_using_resource/1000_systems                                                             1.07   435.1±13.28µs        ? ?/sec      1.00    405.2±6.88µs        ? ?/sec
run_condition/yes_using_resource/100_systems                                                              1.00     38.0±1.17µs        ? ?/sec      1.01     38.3±1.66µs        ? ?/sec
run_condition/yes_using_resource/10_systems                                                               1.00      6.7±0.72µs        ? ?/sec      1.46      9.7±0.66µs        ? ?/sec
run_empty_schedule/MultiThreaded                                                                          1.05      9.9±0.09ns        ? ?/sec      1.00      9.5±0.29ns        ? ?/sec
run_empty_schedule/Simple                                                                                 1.00     10.2±0.22ns        ? ?/sec      1.03     10.5±0.44ns        ? ?/sec
run_empty_schedule/SingleThreaded                                                                         1.00     13.1±0.07ns        ? ?/sec      1.05     13.7±0.60ns        ? ?/sec
schedule/base                                                                                             1.00     20.4±0.74µs        ? ?/sec      1.61     33.0±2.16µs        ? ?/sec

Future Work

Implement the StaticExecutor optimization for the static usages of the TaskPools. Would likely need a new implementation of Optimize TaskPools for use in static variables. #12990.
Add support for rayon::spawn_broadcast style APIs for more aggressively waking up threads than what the current implementation allows.
Replace ThreadSpawner with a more integrated Executor::spawn_to_thread(ThreadId, Fut) API instead, and use the ThreadIds tracked by NonSend resources to schedule those systems.
Support non-async Tasks (i.e. FnOnce) for lower-overhead execution, would avoid some extra atomics and state management.
Create a custom block_on implementation that ticks the local and thread-locked queues.

github-actions · 2025-07-30T04:03:35Z

You added a new feature but didn't update the readme. Please run cargo run -p build-templated-pages -- update features to update it, and commit the file change.

…queue contention

hymm · 2025-08-05T19:29:35Z

Are the slower benches just noise?

james7132 · 2025-08-05T19:52:50Z

Some of them, particularly those under a hundred microseconds and/or do more than just run empty systems, are more prone to having noise affect them with thread wake up times being upwards of 10-30us each on my machine.

That said, some of them do deserve some more scrutiny to see if those differences are replicable. The new polling and scheduling function are made to minimize contention when working with a large number of tasks, but could easily add additional overhead when there's zero/little work to do.

…ctors

…ded spaces

NthTensor · 2025-08-06T13:18:20Z

Currently the single threaded scope crates its own executor instead of spawning tasks on the thread local executor. This may sometimes prevent some futures from running while the scope completes. Is that acceptable?

It certainly is simpler than trying to spawn them on the larger thread local scope, and I've been thinking about adopting it in the other PR.

hymm

This is not a full review. Mostly just did a quick skim of the code.

I'm generally onboard with adopting the async executor code, so we can customize it to our needs better. Adding the ability to queue local tasks in this pr shows how much it simplifies some of the above logic. But if we're doing this we should probably steal at some of their tests and be running miri on bevy_tasks.

I wouldn't be surprised if some of the regressions are due to us now checking the "local queue" more often now, when it was a ration one tick of LOCAL_EXECUTOR to 100 ticks of the TaskPool executor before.

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs

crates/bevy_tasks/src/single_threaded_task_pool.rs

crates/bevy_tasks/src/task_pool.rs

james7132 · 2025-08-10T22:35:56Z

CI is finally green!

Are the slower benches just noise?

Revisiting this, this PR definitely seems to increase the base cost to poll for new work and to spawn and reschedule task, so a lot of the very short benchmarks that basically just runs a (near) empty schedule or handles a small number of tasks seems to show some regressions. That said, the high contention cases are showing notable improvements.

I wouldn't be surprised if some of the regressions are due to us now checking the "local queue" more often now, when it was a ration one tick of LOCAL_EXECUTOR to 100 ticks of the TaskPool executor before.

While true, this should be a very cheap operation. Other than the TLS access on each poll, checking for local tasks is a just an unsynchronized bounds check. The biggest one might be the failover to pull from the "thread-locked" queue, which is unbounded (atomic linked list) and thus has more baseline overhead on each poll,, even without any contention.

…eaded cancellation

…e feature is not enabled

james7132 added this to the 0.18 milestone Jul 30, 2025

james7132 requested a review from hymm July 30, 2025 04:00

alice-i-cecile requested a review from NthTensor July 30, 2025 05:11

james7132 mentioned this pull request Jul 30, 2025

Feature: Spawning jobs/tasks to a target thread NthTensor/Forte#31

Open

james7132 force-pushed the bevy_executor branch 2 times, most recently from 67f2a00 to ce22ca3 Compare August 3, 2025 01:55

james7132 added 17 commits August 2, 2025 19:39

Inline async_executor

3691690

Merge LocalExecutor into the Executor implementation

54b0bbc

Merge ThreadEecutors into the new forked async_executor

2b8ed66

Move stealer queues to TLS storage. Avoid extra Arc allocations

1121542

Queue tasks directly back onto local queues to avoid global injector …

01ac223

…queue contention

Use UnsafeCell instead of RefCell

1dab08e

Optimize access of thread locked tasks

0b036fe

Clean up unnecessary unsafe use

45b5e15

Address potential unsoundness with ThreadSpawner

ab45573

Fix some CI errors

05f1f40

Update docs

0f226bf

Format TOML files

9c28473

Make note on ThreadLocal soundness hole

7f8c932

Allow clippy warning

f84028e

Fix typos

f44d302

Try to fix CI outside of miri

227258e

Remove depnedency on concurrent-queue

d88035d

Properly handle thread destruction and recycling

93c696b

james7132 added 3 commits August 6, 2025 00:48

Switch to a solution that doesn't require TLS access in thread destru…

686d429

…ctors

Try to provide a blocking solution to TaskPool::scope in single threa…

aa64f03

…ded spaces

Merge branch 'main' into bevy_executor

1d73b78

james7132 added 7 commits August 9, 2025 12:06

CI fixes

79c31b7

Merge branch 'main' into bevy_executor

fc875d9

Fix up the build, and hopefully CI

4dcb37c

Shut up Clippy

0923f02

Remove test println

43a09d7

Fix for portable atomics

8f313f7

Fix for web builds

4a8b6b0

james7132 force-pushed the bevy_executor branch from a025261 to 4a8b6b0 Compare August 10, 2025 18:29

hymm reviewed Aug 10, 2025

View reviewed changes

crates/bevy_ecs/src/schedule/executor/multi_threaded.rs Show resolved Hide resolved

crates/bevy_tasks/src/single_threaded_task_pool.rs Show resolved Hide resolved

crates/bevy_tasks/src/task_pool.rs Show resolved Hide resolved

james7132 added 11 commits August 10, 2025 22:05

Reduce async type genreation, code indirection, and handle single-thr…

7e5cf6c

…eaded cancellation

Fix docs and move expect attribute

eb689b5

Complete the comment

7ec07af

Add async_executor's tests

4a4434c

Run Miri on bevy_tasks in CI

2ce5537

Make stealing a non-blocking operation

8af9886

Merge branch 'main' into bevy_executor

cfc6af4

Cache-pad the thread locals and disable multithreaded polling when th…

74d301c

…e feature is not enabled

Fix Miri job to run them in sequence

898166b

Fix up README and Clippy

2b96789

It's clippy

f5737f7

james7132 requested review from hymm and NthTensor August 12, 2025 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Inline async_executor for deeper integration with bevy_tasks #20331

Inline async_executor for deeper integration with bevy_tasks #20331

james7132 commented Jul 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

hymm commented Aug 5, 2025

Uh oh!

james7132 commented Aug 5, 2025 •

edited

Loading

Uh oh!

NthTensor commented Aug 6, 2025

Uh oh!

hymm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

james7132 commented Aug 10, 2025

Uh oh!

Uh oh!

Uh oh!

Inline async_executor for deeper integration with bevy_tasks #20331

Are you sure you want to change the base?

Inline async_executor for deeper integration with bevy_tasks #20331

Conversation

james7132 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

Testing

Performance Testing

Future Work

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

hymm commented Aug 5, 2025

Uh oh!

james7132 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NthTensor commented Aug 6, 2025

Uh oh!

hymm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

james7132 commented Aug 10, 2025

Uh oh!

Uh oh!

james7132 commented Jul 30, 2025 •

edited

Loading

james7132 commented Aug 5, 2025 •

edited

Loading