Skip to content

shuf: Add --random-seed, make --random-source GNU-compatible, report write failures, optimize #7585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

blyxxyz
Copy link
Contributor

@blyxxyz blyxxyz commented Mar 26, 2025

  • Add a --random-seed option that allows using a string as a seed for deterministic output. For example: shuf --random-seed=🦀 -i1-10.

    This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever --random-seed is passed. Details and motivation:

    /// Reproducible seeded random number generation.
    ///
    /// The behavior should stay the same between releases, so don't change it without
    /// a very good reason.
    ///
    /// # How it works
    ///
    /// - Take a Unicode string as the seed.
    ///
    /// - Encode this seed as UTF-8.
    ///
    /// - Take the SHA3-256 hash of the encoded seed.
    ///
    /// - Use that hash as the input for a [`rand_chacha`] ChaCha12 RNG.
    /// (We don't touch the nonce, so that's probably zero.)
    ///
    /// - Take 64-bit samples from the RNG.
    ///
    /// - Use Lemire's method to generate uniformly distributed integers and:
    ///
    /// - With --repeat, use these to pick elements from ranges.
    ///
    /// - Without --repeat, use these to do left-to-right modern Fisher-Yates.
    ///
    /// # Why it works like this
    ///
    /// - Unicode string: Greatest common denominator between platforms. Windows doesn't
    /// let you pass raw bytes as a CLI argument and that would be bad practice anyway.
    /// A decimal or hex number would work but this is much more flexible without being
    /// unmanageable.
    ///
    /// (Footgun: if the user passes a filename we won't read from the file but the
    /// command will run anyway.)
    ///
    /// - UTF-8: That's what Rust likes and it's the least unreasonable Unicode encoding.
    ///
    /// - SHA3-256: We want to make good use of the entire user input and SHA-3 is
    /// state of the art. ChaCha12 takes a 256-bit seed.
    ///
    /// - ChaCha12: [`rand`]'s default rng as of writing. Seems state of the art.
    ///
    /// - 64-bit samples: We could often get away with 32-bit samples but let's keep things
    /// simple and only use one width. (There doesn't seem to be much of a performance hit.)
    ///
    /// - Lemire, Fisher-Yates: These are very easy to implement and maintain ourselves.
    /// `rand` provides fancier implementations but only promises reproducibility within
    /// patch releases: https://rust-random.github.io/book/crate-reprod.html
    ///
    /// Strictly speaking even `ChaCha12` is subject to breakage. But since it's a very
    /// specific algorithm I assume it's safe in practice.

    Resolves Implementation of shuf --random-source #2528.

    I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:

    --random-seed <STRING> seed with STRING for reproducible output

  • Make the behavior of --random-source mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a #[ignore] test.)

  • Use u64 instead of usize for --input-range and --head-count. It's now possible to generate integers above u32::MAX on 32-bit systems.

  • Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to seq | shuf. (The performance is practically the same.)

  • Handle --random-source I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.

  • Flush the BufWriter at the end. If we don't do this then write errors get ignored.

  • Use a larger write buffer (8K → 64K).

  • Use the itoa crate to format integers. With --repeat and --input-range this was a bottleneck.

I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work. shuf -r -n10000000 -i0-10 now runs twice as fast though.

blyxxyz added 7 commits March 26, 2025 09:22
This is important since the output is buffered and errors may end up
ignored otherwise.

`shuf -e a b c > /dev/full` now errors while it didn't before.
This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.
When the --random-source option is used uutils shuf now gives
identical output to GNU shuf in many (but not all) cases. This is
helpful to users who use it to get deterministic output, e.g. by
combining it with `openssl` as suggested in the GNU info pages.

I reverse engineered the algorithm from GNU shuf's output. There may
be bugs.

Other modes of shuffling still use `rand`'s `ThreadRng`, though they
now sample a uniform distribution directly without going through the
slice helper trait.

Additionally, switch from `usize` to `u64` for `--input-range` and
`--head-count`. This way the same range of numbers can be generated on
32-bit platforms as on 64-bit platforms.
This adds a new option to get reproducible output from a seed. This
was already possible with --random-source, but doing that properly was
tricky and had poor performance.

Adding this option implies a commitment to keep using the exact same
algorithms in the future. For that reason we only use third-party
libraries for well-known algorithms and implement our own
distributions on top of that.

-----

As a teenager on King's Day I once used `shuf` for divination. People
paid €0.50 to enter a cramped tent and sat down next to me behind an
old netbook. I would ask their name and their sun sign and pipe this
information into `shuf --random-source=/dev/stdin`, which selected
pseudo-random dictionary words and `tee`d them into `espeak`.

If someone's name was too short `shuf` crashed with an end of file
error. --random-seed would have worked better.
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)

We used to use a clever homegrown way to sample integers. But GNU shuf
with --random-source observably uses Fisher-Yates, and the output of
the old version depended on a heuristic (making it dangerous for
--random-seed).

So now we do Fisher-Yates here, just like we do for other inputs. In
deterministic modes the output for --input-range is identical that for
piping `seq` into `shuf`.

We imitate the old algorithm's method for keeping the resource use in
check. The performance of the new version is very close to that of the
old version: I haven't found any cases where it's much faster or much
slower.
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/stdbuf (passes in this run but fails in the 'main' branch)

@sylvestre
Copy link
Contributor

Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test)
thanks

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Mar 27, 2025

Is there such a test? I can only find a test for passing --random-source multiple times. tests/shuf/shuf-reservoir and tests/shuf/shuf pass on my machine.

@sylvestre
Copy link
Contributor

sorry, i was confusing with shred :( sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of shuf --random-source
2 participants