-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ls: parallelize with rayon #7990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Parallelize the stat calls within ls using rayon. When iterating over the readdir() results in enter_directory(), we preemptively cache the entries metadata by calling get_metadata_no_flush() within a rayon parallel iterator. Signed-off-by: Timothy Day <timday@amazon.com>
cool stuff, bravo :) |
GNU testsuite comparison:
|
Memory-backed storage will obviously be CPU bound, you're saying that's also the case for the typical network-backed stores? If so, then IMO a default thread count of 1 unless
If I understand the problem correctly, I suspect the answer is creating a wrapper type for the collected slice, implementing iterator on it so that it yields the items in the desired order, then using |
Network storage will be IO bound - you want to minimize the number of RPCs you have to send to remote storage. For each RPC, you have to pay a fixed round-trip-time cost. I've tested on a real filesystem and got similar results to those I posted above. In my micro-benchmark, I simulated round-trip-time by adding a 2ms delay (i.e. I added an If you can't avoid an RPC, you have two options for improving performance:
My approach is using a rayon thread pool doing blocking Alternatively, we could using Any thoughts on using
Is there precedent for other uutils environment variables?
I think that's reasonable. We could set the default number of threads based on filesystem type returned from
I'll give that a try. |
I work on the Lustre filesystem. Lustre is a parallel filesystem (i.e. scale-out network filesystem) commonly used in HPC and AI.
I'm interested in making the Rust coreutils perform well on Lustre and other networked filesystems. I've started with
ls -l
(a historic challenge for Lustre, due to the high volume ofstatx
calls). The latest versions of Lustre (using statahead or multiple metadata servers) can improve the performance of sequentialstatx
calls significantly. But parallelizing the calls withls -l
could also net huge gains - especially on older systems.I have a simple patch which parallelizes the
statx
calls using rayon. When iterating over thereaddir()
results inenter_directory()
, we can preemptively cache the entries metadata by callingget_metadata_no_flush()
within a rayon parallel iterator.I've done a small benchmark that shows around a 2.5x improvement when running
ls -l
on a directory with 10,000 files:This compares parallelized Rust
ls -l
with Debian's packaged GNUls
version9.7-2
. Without this patch, they performed nearly identically. I used the latest Lustre development branch. The entire filesystem (client, 1 metadata server, 2 object servers) are collocated on the same node. I'm using the in-memory storage backend for Lustre. I've artificially added a 2ms network delay to each RPC. I'm not using statahead. In a real setup, I'd expect the performance to scale linearly with the core count of the client.I don't think this PR is yet in good enough shape to be merged. I have a few open questions:
ls
case. So some refactor is needed.RAYON_NUM_THREADS
that defaults to the core count. This would be overkill for local NVMe, but critical for larger scale systems.I'm interested in getting some feedback before doing another revision.