Skip to content

Sort merge chunking opens an extra file #6944

Closed
@karlmcdowall

Description

@karlmcdowall

As part of investigating this issue, found that the chunking logic for merges opens one file more than necessary.

Issue is here...

// Merge already sorted `MergeInput`s.
pub fn merge_with_file_limit<
    'a,
    M: MergeInput + 'static,
    F: ExactSizeIterator<Item = UResult<M>>,
    Tmp: WriteableTmpFile + 'static,
>(
    files: F,
    settings: &'a GlobalSettings,
    tmp_dir: &mut TmpDirWrapper,
) -> UResult<FileMerger<'a>> {
    if files.len() > settings.merge_batch_size {
        let mut remaining_files = files.len();
        let batches = files.chunks(settings.merge_batch_size);
        let mut batches = batches.into_iter();
        let mut temporary_files = vec![];
        while remaining_files != 0 {
            // Work around the fact that `Chunks` is not an `ExactSizeIterator`.
            remaining_files = remaining_files.saturating_sub(settings.merge_batch_size);

// ***ISSUE IS ON THE LINE BELOW*** The call to batches.next() advances the underlying iterator, opening the file too soon.
            let merger = merge_without_limit(batches.next().unwrap(), settings)?;         
            let mut tmp_file =
                Tmp::create(tmp_dir.next_file()?, settings.compress_prog.as_deref())?;
            merger.write_all_to(settings, tmp_file.as_write())?;
            temporary_files.push(tmp_file.finished_writing()?);
        }

So, for example, if you had a batch size of 2, the call to batches.next() above would open the first file, then merge_without_limit would iterate two more times (i.e. batch size of two) resulting in three open input-files rather than just the two required. This might seem like a minor issue, but it's impacting GNU compatibility for tests that run under ulimit with limited file descriptors available.

I have a fix for this that I plan to PR shortly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions