shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

blyxxyz · 2025-03-26T18:37:22Z

Add a --random-seed option that allows using a string as a seed for deterministic output. For example: shuf --random-seed=🦀 -i1-10.

This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever --random-seed is passed. Details and motivation:

coreutils/src/uu/shuf/src/random_seed.rs

Lines 12 to 61 in e89aaf8

    
           /// Reproducible seeded random number generation. 
        
           /// 
        
           /// The behavior should stay the same between releases, so don't change it without 
        
           /// a very good reason. 
        
           /// 
        
           /// # How it works 
        
           /// 
        
           /// - Take a Unicode string as the seed. 
        
           /// 
        
           /// - Encode this seed as UTF-8. 
        
           /// 
        
           /// - Take the SHA3-256 hash of the encoded seed. 
        
           /// 
        
           /// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG. 
        
           ///   (We don't touch the nonce, so that's probably zero.) 
        
           /// 
        
           /// - Take 64-bit samples from the RNG. 
        
           /// 
        
           /// - Use Lemire's method to generate uniformly distributed integers and: 
        
           /// 
        
           ///   - With --repeat, use these to pick elements from ranges. 
        
           /// 
        
           ///   - Without --repeat, use these to do left-to-right modern Fisher-Yates. 
        
           /// 
        
           /// # Why it works like this 
        
           /// 
        
           /// - Unicode string: Greatest common denominator between platforms. Windows doesn't 
        
           ///   let you pass raw bytes as a CLI argument and that would be bad practice anyway. 
        
           ///   A decimal or hex number would work but this is much more flexible without being 
        
           ///   unmanageable. 
        
           /// 
        
           ///   (Footgun: if the user passes a filename we won't read from the file but the 
        
           ///   command will run anyway.) 
        
           /// 
        
           /// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding. 
        
           /// 
        
           /// - SHA3-256: We want to make good use of the entire user input and SHA-3 is 
        
           ///   state of the art. ChaCha12 takes a 256-bit seed. 
        
           /// 
        
           /// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art. 
        
           /// 
        
           /// - 64-bit samples: We could often get away with 32-bit samples but let's keep things 
        
           ///   simple and only use one width. (There doesn't seem to be much of a performance hit.) 
        
           /// 
        
           /// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves. 
        
           ///   `rand` provides fancier implementations but only promises reproducibility within 
        
           ///   patch releases: https://rust-random.github.io/book/crate-reprod.html 
        
           /// 
        
           ///   Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very 
        
           ///   specific algorithm I assume it's safe in practice.

Resolves Implementation of shuf --random-source #2528.

I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:

--random-seed <STRING> seed with STRING for reproducible output

Make the behavior of --random-source mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a #[ignore] test.)
Use u64 instead of usize for --input-range and --head-count. It's now possible to generate integers above u32::MAX on 32-bit systems.
Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to seq | shuf. (The performance is practically the same.)
Handle --random-source I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.
Flush the BufWriter at the end. If we don't do this then write errors get ignored.
Use a larger write buffer (8K → 64K).
Use the itoa crate to format integers. With --repeat and --input-range this was a bottleneck.

I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.

This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.

…rrors

This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.

When the --random-source option is used uutils shuf now gives identical output to GNU shuf in many (but not all) cases. This is helpful to users who use it to get deterministic output, e.g. by combining it with `openssl` as suggested in the GNU info pages. I reverse engineered the algorithm from GNU shuf's output. There may be bugs. Other modes of shuffling still use `rand`'s `ThreadRng`, though they now sample a uniform distribution directly without going through the slice helper trait. Additionally, switch from `usize` to `u64` for `--input-range` and `--head-count`. This way the same range of numbers can be generated on 32-bit platforms as on 64-bit platforms.

This adds a new option to get reproducible output from a seed. This was already possible with --random-source, but doing that properly was tricky and had poor performance. Adding this option implies a commitment to keep using the exact same algorithms in the future. For that reason we only use third-party libraries for well-known algorithms and implement our own distributions on top of that. ----- As a teenager on King's Day I once used `shuf` for divination. People paid €0.50 to enter a cramped tent and sat down next to me behind an old netbook. I would ask their name and their sun sign and pipe this information into `shuf --random-source=/dev/stdin`, which selected pseudo-random dictionary words and `tee`d them into `espeak`. If someone's name was too short `shuf` crashed with an end of file error. --random-seed would have worked better.

github-actions · 2025-03-26T19:22:39Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)

We used to use a clever homegrown way to sample integers. But GNU shuf with --random-source observably uses Fisher-Yates, and the output of the old version depended on a heuristic (making it dangerous for --random-seed). So now we do Fisher-Yates here, just like we do for other inputs. In deterministic modes the output for --input-range is identical that for piping `seq` into `shuf`. We imitate the old algorithm's method for keeping the resource use in check. The performance of the new version is very close to that of the old version: I haven't found any cases where it's much faster or much slower.

github-actions · 2025-03-26T20:00:54Z

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

sylvestre · 2025-03-27T11:38:52Z

Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test)
thanks

blyxxyz · 2025-03-27T19:34:26Z

Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.

sylvestre · 2025-03-30T09:22:17Z

sorry, i was confusing with shred :( sorry

Not sure why I added this. The test can stay though.

blyxxyz added 7 commits March 26, 2025 09:22

shuf: Move NonrepeatingIterator to own module

bbf65ba

shuf: correctness: Flush output after writing

fa386da

This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.

shuf: perf: Bump output buffer to 64KB

bab5343

shuf: correctness: Do not use panics to report --random-source read e…

f6df6a6

…rrors

shuf: perf: Use itoa for integer formatting

b6d1b80

This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.

blyxxyz force-pushed the clean-shuf-2 branch from 535b0c0 to f34dda2 Compare March 26, 2025 19:26

blyxxyz mentioned this pull request Mar 31, 2025

Flush BufWriters, clean up error reporting #7622

Merged

blyxxyz added 3 commits April 30, 2025 13:55

Merge remote-tracking branch 'origin/main' into clean-shuf-2

a8fdcc1

shuf: Remove pointless heuristic

10bc746

Not sure why I added this. The test can stay though.

shuf: Add tests

2f06a48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

Uh oh!

blyxxyz commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

sylvestre commented Mar 27, 2025

Uh oh!

blyxxyz commented Mar 27, 2025

Uh oh!

sylvestre commented Mar 30, 2025

Uh oh!

Uh oh!

	/// Reproducible seeded random number generation.
	///
	/// The behavior should stay the same between releases, so don't change it without
	/// a very good reason.
	///
	/// # How it works
	///
	/// - Take a Unicode string as the seed.
	///
	/// - Encode this seed as UTF-8.
	///
	/// - Take the SHA3-256 hash of the encoded seed.
	///
	/// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG.
	/// (We don't touch the nonce, so that's probably zero.)
	///
	/// - Take 64-bit samples from the RNG.
	///
	/// - Use Lemire's method to generate uniformly distributed integers and:
	///
	/// - With --repeat, use these to pick elements from ranges.
	///
	/// - Without --repeat, use these to do left-to-right modern Fisher-Yates.
	///
	/// # Why it works like this
	///
	/// - Unicode string: Greatest common denominator between platforms. Windows doesn't
	/// let you pass raw bytes as a CLI argument and that would be bad practice anyway.
	/// A decimal or hex number would work but this is much more flexible without being
	/// unmanageable.
	///
	/// (Footgun: if the user passes a filename we won't read from the file but the
	/// command will run anyway.)
	///
	/// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding.
	///
	/// - SHA3-256: We want to make good use of the entire user input and SHA-3 is
	/// state of the art. ChaCha12 takes a 256-bit seed.
	///
	/// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art.
	///
	/// - 64-bit samples: We could often get away with 32-bit samples but let's keep things
	/// simple and only use one width. (There doesn't seem to be much of a performance hit.)
	///
	/// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves.
	/// `rand` provides fancier implementations but only promises reproducibility within
	/// patch releases: https://rust-random.github.io/book/crate-reprod.html
	///
	/// Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very
	/// specific algorithm I assume it's safe in practice.

Uh oh!

shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize #7585

Are you sure you want to change the base?

shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize #7585

Uh oh!

Conversation

blyxxyz commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

sylvestre commented Mar 27, 2025

Uh oh!

blyxxyz commented Mar 27, 2025

Uh oh!

sylvestre commented Mar 30, 2025

Uh oh!

Uh oh!

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585

shuf: Add `--random-seed`, make `--random-source` GNU-compatible, report write failures, optimize #7585