Skip to content

Sort merge chunking opens an extra file #6944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karlmcdowall opened this issue Dec 9, 2024 · 0 comments · Fixed by #6957
Closed

Sort merge chunking opens an extra file #6944

karlmcdowall opened this issue Dec 9, 2024 · 0 comments · Fixed by #6957
Labels

Comments

@karlmcdowall
Copy link
Contributor

As part of investigating this issue, found that the chunking logic for merges opens one file more than necessary.

Issue is here...

// Merge already sorted `MergeInput`s.
pub fn merge_with_file_limit<
    'a,
    M: MergeInput + 'static,
    F: ExactSizeIterator<Item = UResult<M>>,
    Tmp: WriteableTmpFile + 'static,
>(
    files: F,
    settings: &'a GlobalSettings,
    tmp_dir: &mut TmpDirWrapper,
) -> UResult<FileMerger<'a>> {
    if files.len() > settings.merge_batch_size {
        let mut remaining_files = files.len();
        let batches = files.chunks(settings.merge_batch_size);
        let mut batches = batches.into_iter();
        let mut temporary_files = vec![];
        while remaining_files != 0 {
            // Work around the fact that `Chunks` is not an `ExactSizeIterator`.
            remaining_files = remaining_files.saturating_sub(settings.merge_batch_size);

// ***ISSUE IS ON THE LINE BELOW*** The call to batches.next() advances the underlying iterator, opening the file too soon.
            let merger = merge_without_limit(batches.next().unwrap(), settings)?;         
            let mut tmp_file =
                Tmp::create(tmp_dir.next_file()?, settings.compress_prog.as_deref())?;
            merger.write_all_to(settings, tmp_file.as_write())?;
            temporary_files.push(tmp_file.finished_writing()?);
        }

So, for example, if you had a batch size of 2, the call to batches.next() above would open the first file, then merge_without_limit would iterate two more times (i.e. batch size of two) resulting in three open input-files rather than just the two required. This might seem like a minor issue, but it's impacting GNU compatibility for tests that run under ulimit with limited file descriptors available.

I have a fix for this that I plan to PR shortly.

karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 13, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 13, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 13, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 13, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 14, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 14, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 14, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 15, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
karlmcdowall added a commit to karlmcdowall/coreutils that referenced this issue Dec 15, 2024
Fix bug uutils#6944
Rework the way batching is done with sort such that it doesn't open
more input files than necessary. Previously, the code would always
open one extra input file which causes problems in ulimit scenarios.
Add additional test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants