Skip to content

ls: parallelize with rayon #7990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tim-day-387
Copy link

I work on the Lustre filesystem. Lustre is a parallel filesystem (i.e. scale-out network filesystem) commonly used in HPC and AI.

I'm interested in making the Rust coreutils perform well on Lustre and other networked filesystems. I've started with ls -l (a historic challenge for Lustre, due to the high volume of statx calls). The latest versions of Lustre (using statahead or multiple metadata servers) can improve the performance of sequential statx calls significantly. But parallelizing the calls with ls -l could also net huge gains - especially on older systems.

I have a simple patch which parallelizes the statx calls using rayon. When iterating over the readdir() results in
enter_directory(), we can preemptively cache the entries metadata by calling get_metadata_no_flush() within a rayon parallel iterator.

I've done a small benchmark that shows around a 2.5x improvement when running ls -l on a directory with 10,000 files:

root@big-debian-desktop-kvm:/mnt/lustre/testdir# hyperfine --warmup 2 --prepare "echo 3 > /proc/sys/vm/drop_caches" "/workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null" "ls -l /mnt/lustre/testdir > /dev/null"
Benchmark 1: /workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null
  Time (mean ± σ):     47.553 s ±  0.108 s    [User: 0.078 s, System: 41.647 s]
  Range (min … max):   47.311 s … 47.658 s    10 runs
 
Benchmark 2: ls -l /mnt/lustre/testdir > /dev/null
  Time (mean ± σ):     127.087 s ±  0.351 s    [User: 0.046 s, System: 42.776 s]
  Range (min … max):   126.790 s … 127.961 s    10 runs
 
Summary
  /workspace/coreutils/target/release/coreutils ls -l /mnt/lustre/testdir > /dev/null ran
    2.67 ± 0.01 times faster than ls -l /mnt/lustre/testdir > /dev/null

This compares parallelized Rust ls -l with Debian's packaged GNU ls version 9.7-2. Without this patch, they performed nearly identically. I used the latest Lustre development branch. The entire filesystem (client, 1 metadata server, 2 object servers) are collocated on the same node. I'm using the in-memory storage backend for Lustre. I've artificially added a 2ms network delay to each RPC. I'm not using statahead. In a real setup, I'd expect the performance to scale linearly with the core count of the client.

I don't think this PR is yet in good enough shape to be merged. I have a few open questions:

  1. This patch unconditionally stats each file, which is bad for the regular ls case. So some refactor is needed.
  2. When a utility does parallelization "under the hood", how do we handle default configuration? Rayon has an environment variable RAYON_NUM_THREADS that defaults to the core count. This would be overkill for local NVMe, but critical for larger scale systems.
  3. Each thread ought to start at different offsets in the directory. If we had a statahead of 10 entries and a directory of 100 entries, it wouldn't make sense to assign 10 threads to stat each of those entries individually. They should start at offsets of 10. I'm not yet sure how to do this with Rayon.

I'm interested in getting some feedback before doing another revision.

Parallelize the stat calls within ls using rayon.
When iterating over the readdir() results in
enter_directory(), we preemptively cache the entries
metadata by calling get_metadata_no_flush() within
a rayon parallel iterator.

Signed-off-by: Timothy Day <timday@amazon.com>
@sylvestre
Copy link
Contributor

cool stuff, bravo :)

Copy link

GNU testsuite comparison:

GNU test failed: tests/ls/follow-slink. tests/ls/follow-slink is passing on 'main'. Maybe you have to rebase?

@jtracey
Copy link
Contributor

jtracey commented May 25, 2025

2. When a utility does parallelization "under the hood", how do we handle default configuration? Rayon has an environment variable RAYON_NUM_THREADS that defaults to the core count. This would be overkill for local NVMe, but critical for larger scale systems.

Memory-backed storage will obviously be CPU bound, you're saying that's also the case for the typical network-backed stores? If so, then IMO a default thread count of 1 unless RAYON_NUM_THREADS is set could work, and systems that want it can export it globally. Otherwise, a new environment variable for uutils could be in order. Either way, I think a default of 1 thread is right for now, fancy logic for a different number of default threads could maybe come later (aside from potential performance overhead, as someone who has done long-running experiments where I had to be careful around Linux's ~4 million PID_MAX_LIMIT, I'd be pretty annoyed if some batched ls calls on a system with a few hundred cores killed the experiment ;)).

3. Each thread ought to start at different offsets in the directory. If we had a statahead of 10 entries and a directory of 100 entries, it wouldn't make sense to assign 10 threads to stat each of those entries individually. They should start at offsets of 10. I'm not yet sure how to do this with Rayon.

If I understand the problem correctly, I suspect the answer is creating a wrapper type for the collected slice, implementing iterator on it so that it yields the items in the desired order, then using .into_par_iter().

@tim-day-387
Copy link
Author

Memory-backed storage will obviously be CPU bound, you're saying that's also the case for the typical network-backed stores?

Network storage will be IO bound - you want to minimize the number of RPCs you have to send to remote storage. For each RPC, you have to pay a fixed round-trip-time cost. I've tested on a real filesystem and got similar results to those I posted above. In my micro-benchmark, I simulated round-trip-time by adding a 2ms delay (i.e. I added an mdelay(2); in Lustre's loopback network driver) to each RPC. That'll make it similarly IO bound.

If you can't avoid an RPC, you have two options for improving performance:

  1. Batching (i.e. statahead or readahead) - If application is using traditional blocking IO and you can't rewrite it, you have to rely on predictive statahead/readahead to preemptively fetch data and cache it.
  2. Parallelization - If RPCs don't have a strict dependency on each other, they should sent in parallel. That can be done in the kernel (if the application is using async IO or the request can be sharded). Otherwise, the application needs it's own thread pool to send and wait on requests.

My approach is using a rayon thread pool doing blocking statx calls via Rust std Metadata implementation. Those blocking calls would also benefit from statahead.

Alternatively, we could using io_uring to submit the requests. This would require us to use an io_uring crate to make syscalls rather than the Rust standard library - so the implementation would be more involved. And I'm not aware of a good native Rust io_uring crate. But the Rust standard library is problematic for other reasons - specifically, it's over zealous use of STATX_BASIC_STATS | STATX_BTIME on all statx calls. I'm going to open a separate PR for that. But if we stop using Rust's std for fetching metadata, maybe we just use io_uring?

Any thoughts on using io_uring vs. rayon?

Otherwise, a new environment variable for uutils could be in order.

Is there precedent for other uutils environment variables?

I think a default of 1 thread is right for now, fancy logic for a different number of default threads could maybe come later

I think that's reasonable. We could set the default number of threads based on filesystem type returned from statfs. But we don't have to decide that now.

If I understand the problem correctly, I suspect the answer is creating a wrapper type for the collected slice, implementing iterator on it so that it yields the items in the desired order, then using .into_par_iter().

I'll give that a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants