Closed
Description
As part of investigating this issue, found that the chunking logic for merges opens one file more than necessary.
Issue is here...
// Merge already sorted `MergeInput`s.
pub fn merge_with_file_limit<
'a,
M: MergeInput + 'static,
F: ExactSizeIterator<Item = UResult<M>>,
Tmp: WriteableTmpFile + 'static,
>(
files: F,
settings: &'a GlobalSettings,
tmp_dir: &mut TmpDirWrapper,
) -> UResult<FileMerger<'a>> {
if files.len() > settings.merge_batch_size {
let mut remaining_files = files.len();
let batches = files.chunks(settings.merge_batch_size);
let mut batches = batches.into_iter();
let mut temporary_files = vec![];
while remaining_files != 0 {
// Work around the fact that `Chunks` is not an `ExactSizeIterator`.
remaining_files = remaining_files.saturating_sub(settings.merge_batch_size);
// ***ISSUE IS ON THE LINE BELOW*** The call to batches.next() advances the underlying iterator, opening the file too soon.
let merger = merge_without_limit(batches.next().unwrap(), settings)?;
let mut tmp_file =
Tmp::create(tmp_dir.next_file()?, settings.compress_prog.as_deref())?;
merger.write_all_to(settings, tmp_file.as_write())?;
temporary_files.push(tmp_file.finished_writing()?);
}
So, for example, if you had a batch size of 2, the call to batches.next()
above would open the first file, then merge_without_limit
would iterate two more times (i.e. batch size of two) resulting in three open input-files rather than just the two required. This might seem like a minor issue, but it's impacting GNU compatibility for tests that run under ulimit with limited file descriptors available.
I have a fix for this that I plan to PR shortly.