-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
shuf: Add --random-seed
, make --random-source
GNU-compatible, report write failures, optimize
#7585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
blyxxyz
wants to merge
11
commits into
uutils:main
Choose a base branch
from
blyxxyz:clean-shuf-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is important since the output is buffered and errors may end up ignored otherwise. `shuf -e a b c > /dev/full` now errors while it didn't before.
This gives a 1.8× speedup for `shuf -r -n1000000 -i1-1024`.
When the --random-source option is used uutils shuf now gives identical output to GNU shuf in many (but not all) cases. This is helpful to users who use it to get deterministic output, e.g. by combining it with `openssl` as suggested in the GNU info pages. I reverse engineered the algorithm from GNU shuf's output. There may be bugs. Other modes of shuffling still use `rand`'s `ThreadRng`, though they now sample a uniform distribution directly without going through the slice helper trait. Additionally, switch from `usize` to `u64` for `--input-range` and `--head-count`. This way the same range of numbers can be generated on 32-bit platforms as on 64-bit platforms.
This adds a new option to get reproducible output from a seed. This was already possible with --random-source, but doing that properly was tricky and had poor performance. Adding this option implies a commitment to keep using the exact same algorithms in the future. For that reason we only use third-party libraries for well-known algorithms and implement our own distributions on top of that. ----- As a teenager on King's Day I once used `shuf` for divination. People paid €0.50 to enter a cramped tent and sat down next to me behind an old netbook. I would ask their name and their sun sign and pipe this information into `shuf --random-source=/dev/stdin`, which selected pseudo-random dictionary words and `tee`d them into `espeak`. If someone's name was too short `shuf` crashed with an end of file error. --random-seed would have worked better.
GNU testsuite comparison:
|
We used to use a clever homegrown way to sample integers. But GNU shuf with --random-source observably uses Fisher-Yates, and the output of the old version depended on a heuristic (making it dangerous for --random-seed). So now we do Fisher-Yates here, just like we do for other inputs. In deterministic modes the output for --input-range is identical that for piping `seq` into `shuf`. We imitate the old algorithm's method for keeping the resource use in check. The performance of the new version is very close to that of the old version: I haven't found any cases where it's much faster or much slower.
GNU testsuite comparison:
|
Could you please make the gnu random source test pass ? (it is fine to adjust the upstream gnu test) |
Is there such a test? I can only find a test for passing |
sorry, i was confusing with shred :( sorry |
Not sure why I added this. The test can stay though.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a
--random-seed
option that allows using a string as a seed for deterministic output. For example:shuf --random-seed=🦀 -i1-10
.This comes with a forward compatibility commitment: we have to keep using SHA-3 and ChaCha12 and the current (in-tree) algorithms whenever
--random-seed
is passed. Details and motivation:coreutils/src/uu/shuf/src/random_seed.rs
Lines 12 to 61 in e89aaf8
Resolves Implementation of
shuf --random-source
#2528.I'd like some input on the design. With this interface it's possible that a user passes a filename as an argument without noticing that it uses the filename itself as the seed rather than the contents of the file. Are we OK with that? I tried to make the help text as clear as possible:
--random-seed <STRING> seed with STRING for reproducible output
Make the behavior of
--random-source
mostly compatible with GNU. I managed to black box reverse engineer their RNG (which is luckily very simple and similar to our old implementation). So now people who use this trick for reproducibility will be able to switch to uutils and still get the old output they're relying on. (Except for one weirdly specific invocation for which I've added a#[ignore]
test.)Use
u64
instead ofusize
for--input-range
and--head-count
. It's now possible to generate integers aboveu32::MAX
on 32-bit systems.Replace the nonrepeating integer generator. The new one has better reproducibility: it matches GNU and gives identical output to
seq | shuf
. (The performance is practically the same.)Handle
--random-source
I/O errors properly. They used to cause a panic (which due to buffering quirks would get printed above the output rather than at the end). Now they're reported normally.Flush the
BufWriter
at the end. If we don't do this then write errors get ignored.Use a larger write buffer (8K → 64K).
Use the
itoa
crate to format integers. With--repeat
and--input-range
this was a bottleneck.I also experimented with I/O tricks like vectored writes but they turned out to be too situational, so not a lot of performance work.
shuf -r -n10000000 -i0-10
now runs twice as fast though.