ls: parallelize with rayon #7990

tim-day-387 · 2025-05-25T21:17:24Z

I work on the Lustre filesystem. Lustre is a parallel filesystem (i.e. scale-out network filesystem) commonly used in HPC and AI.

I'm interested in making the Rust coreutils perform well on Lustre and other networked filesystems. I've started with ls -l (a historic challenge for Lustre, due to the high volume of statx calls). The latest versions of Lustre (using statahead or multiple metadata servers) can improve the performance of sequential statx calls significantly. But parallelizing the calls with ls -l could also net huge gains - especially on older systems.

I have a simple patch which parallelizes the statx calls using rayon. When iterating over the readdir() results in
enter_directory(), we can preemptively cache the entries metadata by calling get_metadata_no_flush() within a rayon parallel iterator.

I've done a small benchmark that shows around a 2.5x improvement when running ls -l on a directory with 10,000 files:

root@big-debian-desktop-kvm:/mnt/lustre/testdir# hyperfine --warmup 2 --prepare "echo 3 > /proc/sys/vm/drop_caches" "/workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null" "ls -l /mnt/lustre/testdir > /dev/null"
Benchmark 1: /workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null
  Time (mean ± σ):     47.553 s ±  0.108 s    [User: 0.078 s, System: 41.647 s]
  Range (min … max):   47.311 s … 47.658 s    10 runs
 
Benchmark 2: ls -l /mnt/lustre/testdir > /dev/null
  Time (mean ± σ):     127.087 s ±  0.351 s    [User: 0.046 s, System: 42.776 s]
  Range (min … max):   126.790 s … 127.961 s    10 runs
 
Summary
  /workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null ran
    2.67 ± 0.01 times faster than ls -l /mnt/lustre/testdir > /dev/null

This compares parallelized Rust ls -l with Debian's packaged GNU ls version 9.7-2. Without this patch, they performed nearly identically. I used the latest Lustre development branch. The entire filesystem (client, 1 metadata server, 2 object servers) are collocated on the same node. I'm using the in-memory storage backend for Lustre. I've artificially added a 2ms network delay to each RPC. I'm not using statahead. In a real setup, I'd expect the performance to scale linearly with the core count of the client.

I don't think this PR is yet in good enough shape to be merged. I have a few open questions:

This patch unconditionally stats each file, which is bad for the regular ls case. So some refactor is needed.
When a utility does parallelization "under the hood", how do we handle default configuration? Rayon has an environment variable RAYON_NUM_THREADS that defaults to the core count. This would be overkill for local NVMe, but critical for larger scale systems.
Each thread ought to start at different offsets in the directory. If we had a statahead of 10 entries and a directory of 100 entries, it wouldn't make sense to assign 10 threads to stat each of those entries individually. They should start at offsets of 10. I'm not yet sure how to do this with Rayon.

I'm interested in getting some feedback before doing another revision.

Parallelize the stat calls within ls using rayon. When iterating over the readdir() results in enter_directory(), we preemptively cache the entries metadata by calling get_metadata_no_flush() within a rayon parallel iterator. Signed-off-by: Timothy Day <timday@amazon.com>

sylvestre · 2025-05-25T21:25:27Z

cool stuff, bravo :)

github-actions · 2025-05-25T22:07:06Z

GNU testsuite comparison:

GNU test failed: tests/ls/follow-slink. tests/ls/follow-slink is passing on 'main'. Maybe you have to rebase?

jtracey · 2025-05-25T23:15:35Z

2. When a utility does parallelization "under the hood", how do we handle default configuration? Rayon has an environment variable RAYON_NUM_THREADS that defaults to the core count. This would be overkill for local NVMe, but critical for larger scale systems.

Memory-backed storage will obviously be CPU bound, you're saying that's also the case for the typical network-backed stores? If so, then IMO a default thread count of 1 unless RAYON_NUM_THREADS is set could work, and systems that want it can export it globally. Otherwise, a new environment variable for uutils could be in order. Either way, I think a default of 1 thread is right for now, fancy logic for a different number of default threads could maybe come later (aside from potential performance overhead, as someone who has done long-running experiments where I had to be careful around Linux's ~4 million PID_MAX_LIMIT, I'd be pretty annoyed if some batched ls calls on a system with a few hundred cores killed the experiment ;)).

3. Each thread ought to start at different offsets in the directory. If we had a statahead of 10 entries and a directory of 100 entries, it wouldn't make sense to assign 10 threads to stat each of those entries individually. They should start at offsets of 10. I'm not yet sure how to do this with Rayon.

If I understand the problem correctly, I suspect the answer is creating a wrapper type for the collected slice, implementing iterator on it so that it yields the items in the desired order, then using .into_par_iter().

tim-day-387 · 2025-05-26T00:15:07Z

Memory-backed storage will obviously be CPU bound, you're saying that's also the case for the typical network-backed stores?

Network storage will be IO bound - you want to minimize the number of RPCs you have to send to remote storage. For each RPC, you have to pay a fixed round-trip-time cost. I've tested on a real filesystem and got similar results to those I posted above. In my micro-benchmark, I simulated round-trip-time by adding a 2ms delay (i.e. I added an mdelay(2); in Lustre's loopback network driver) to each RPC. That'll make it similarly IO bound.

If you can't avoid an RPC, you have two options for improving performance:

Batching (i.e. statahead or readahead) - If application is using traditional blocking IO and you can't rewrite it, you have to rely on predictive statahead/readahead to preemptively fetch data and cache it.
Parallelization - If RPCs don't have a strict dependency on each other, they should sent in parallel. That can be done in the kernel (if the application is using async IO or the request can be sharded). Otherwise, the application needs it's own thread pool to send and wait on requests.

My approach is using a rayon thread pool doing blocking statx calls via Rust std Metadata implementation. Those blocking calls would also benefit from statahead.

Alternatively, we could using io_uring to submit the requests. This would require us to use an io_uring crate to make syscalls rather than the Rust standard library - so the implementation would be more involved. And I'm not aware of a good native Rust io_uring crate. But the Rust standard library is problematic for other reasons - specifically, it's over zealous use of STATX_BASIC_STATS | STATX_BTIME on all statx calls. I'm going to open a separate PR for that. But if we stop using Rust's std for fetching metadata, maybe we just use io_uring?

Any thoughts on using io_uring vs. rayon?

Otherwise, a new environment variable for uutils could be in order.

Is there precedent for other uutils environment variables?

I think a default of 1 thread is right for now, fancy logic for a different number of default threads could maybe come later

I think that's reasonable. We could set the default number of threads based on filesystem type returned from statfs. But we don't have to decide that now.

If I understand the problem correctly, I suspect the answer is creating a wrapper type for the collected slice, implementing iterator on it so that it yields the items in the desired order, then using .into_par_iter().

I'll give that a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ls: parallelize with rayon #7990

ls: parallelize with rayon #7990

Uh oh!

tim-day-387 commented May 25, 2025

Uh oh!

sylvestre commented May 25, 2025

Uh oh!

github-actions bot commented May 25, 2025

Uh oh!

jtracey commented May 25, 2025

Uh oh!

tim-day-387 commented May 26, 2025

Uh oh!

Uh oh!

Uh oh!

ls: parallelize with rayon #7990

Are you sure you want to change the base?

ls: parallelize with rayon #7990

Uh oh!

Conversation

tim-day-387 commented May 25, 2025

Uh oh!

sylvestre commented May 25, 2025

Uh oh!

github-actions bot commented May 25, 2025

Uh oh!

jtracey commented May 25, 2025

Uh oh!

tim-day-387 commented May 26, 2025

Uh oh!

Uh oh!