Skip to content

Conversation

zhitkoff
Copy link
Contributor

@zhitkoff zhitkoff commented Oct 29, 2023

This PR adds

  • full implementation for ---io-blksize option
  • refactoring of how splitting into N chunks works for --number strategies
  • handling input size of stdin stream and file size for files at /dev, /proc, /sys and similar locations that either report 0 for file size, while actually having some content OR report size greater than actual content

passes GNU tests/b-chunk.sh

@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

@zhitkoff
Copy link
Contributor Author

@tertsdiepraam would you mind reviewing this one ?

@zhitkoff zhitkoff marked this pull request as draft November 1, 2023 21:53
@zhitkoff
Copy link
Contributor Author

zhitkoff commented Nov 1, 2023

setting back to draft - even though it passes the GNU test, the implementation for ---io-blksize option is incomplete, needs more work

Copy link

github-actions bot commented Nov 2, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/rm/rm2 is no longer failing!
Congrats! The gnu test tests/split/b-chunk is no longer failing!
Skip an intermittent issue tests/rm/rm1

Copy link

github-actions bot commented Nov 8, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

Copy link

github-actions bot commented Nov 8, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

@zhitkoff zhitkoff marked this pull request as ready for review November 8, 2023 02:52
@zhitkoff
Copy link
Contributor Author

zhitkoff commented Nov 8, 2023

@tertsdiepraam @sylvestre this one is ready for review

}
}
None => b'\n',
};

let io_blksize: Option<usize> = if let Some(s) = matches.get_one::<String>(OPT_IO_BLKSIZE) {
Copy link
Contributor

@sylvestre sylvestre Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about ?

let io_blksize: Option<usize> = matches.get_one::<String>(OPT_IO_BLKSIZE).map(|s| {
    parse_size_u64(s)
        .ok()
        .and_then(|n| {
            let n = n.try_into().map_err(|_| SettingsError::InvalidIOBlockSize(s.to_string()))?;
            if n > OPT_IO_BLKSIZE_MAX {
                Err(SettingsError::InvalidIOBlockSize(s.to_string()))
            } else {
                Ok(n)
            }
        })
}).transpose()?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it would "eat" possible error returned by parse_size_u64(), so it cannot be bubbled up to from()->Result. Also, since ok() converts it to Option and the and_then() expects an option to be returned from the closure, it is not straitforward to return the SettingsError in either places within the closure inside closure - i.e. doing return Err(SettingsError::InvalidIOBlockSize(s.to_string())) or Some(Err(SettingsError::InvalidIOBlockSize(s.to_string()))) would not work : as return is not for from()->Result function but closure, and wrapping in Some() to satisfy and_then() signature we will end up with Option<Option<Result<>>> as a retuning value from closure from map() instead of Option<Result<>> ... could unwrap(), but then still missing/eating error from parse_size_u64() ...
I could be missing something though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ignore my comment then :(

// empty STDIN stream,
// and files with true file size 0
// will also fit here
input_size = num_bytes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about an early exit ?

Suggested change
input_size = num_bytes;
Ok(num_bytes)

let mut tmp_fd = File::open(Path::new(input))?;
let end = tmp_fd.seek(SeekFrom::End(0))?;
if end > 0 {
input_size = end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same:

Suggested change
input_size = end;
Ok(end)

Comment on lines 657 to 663
let read_limit: u64 = if let Some(n) = io_blksize {
*n
} else {
OPT_IO_BLKSIZE_MAX
}
.try_into()
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let read_limit: u64 = if let Some(n) = io_blksize {
*n
} else {
OPT_IO_BLKSIZE_MAX
}
.try_into()
.unwrap();
let read_limit = io_blksize.unwrap_or(OPT_IO_BLKSIZE_MAX) as u64;

// could report incorrect file size via `metadata.len()`
// either `0` while there is content
// or a size larger than actual content
input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, early return

Suggested change
input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
get_irregular_input_size(input, reader, buf, io_blksize)

Comment on lines 747 to 753
// Regular file
let metadata = metadata(input)?;
input_size = metadata.len();
// Double check the size if `metadata.len()` reports `0`
if input_size == 0 {
input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Regular file
let metadata = metadata(input)?;
input_size = metadata.len();
// Double check the size if `metadata.len()` reports `0`
if input_size == 0 {
input_size = get_irregular_input_size(input, reader, buf, io_blksize)?;
}
let metadata = metadata(input)?;
let size = metadata.len();
// If metadata reports size 0, use get_irregular_input_size to double-check
if size == 0 {
get_irregular_input_size(input, reader, buf, io_blksize)
} else {
Ok(size)
}

Copy link

github-actions bot commented Nov 8, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! Seeing that GNU test pass is great!

Comment on lines 58 to 60
// The ---io parameter is consumed and ignored.
// The parameter is included to make GNU coreutils tests pass.
static OPT_IO: &str = "-io";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example from GNU:

❯ split ---
split: option '---io-blksize' requires an argument
Try 'split --help' for more information.
❯ split ---io-bl
split: option '---io-blksize' requires an argument
Try 'split --help' for more information.

This is not a separate option, but one of those things where a long argument can be shortened to the first few letters if its unambiguous. clap supports this too and I think it's turned on (it's infer_long_args(true)) so do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, looks like it is the case. will remove

// of bytes per chunk.
//
// Get the size of the input in bytes
let initial_buf: &mut Vec<u8> = &mut vec![];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let initial_buf: &mut Vec<u8> = &mut vec![];
let initial_buf = Vec::new();

Comment on lines 1248 to 1251
if usize::BITS < 64 {
let _: usize = num_chunks
.try_into()
.map_err(|_| USimpleError::new(1, "Number of chunks too big"))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? The original code makes sense to me. If we need it to fit in a usize why wouldn't we make it a usize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it made sense for the original code because num_chunks was later used in method calls that require usize like writers.iter_mut().take(num_chunks - 1), and there was a test to test for that error - looks like only for code coverage %. I left it in to keep the test, but looking at it more closely - it does not need to be usize, so I will remove it along with the test that checked for it.

Comment on lines 723 to 726
if input == "-"
|| input.starts_with("/dev/")
|| input.starts_with("/proc/")
|| input.starts_with("/sys/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels inherently brittle and I'm wondering whether there's a better way to do it. For example, we could have links to all of these filesystems and that should still work. Based on the syscalls they make, I don't think GNU has exceptions like, though I'm not entirely sure what they are doing instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they might just try to get the length first and try this other method if that returns 0? I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in GNU they first go straight to reading it into a buffer and then compare that to what metadata/filesystem is reporting + do seek on the file if those disagree (as well as trying to copy things arround into temp file/buffer, etc) and some other edge cases. I tried to avoid reading file into a buffer first for everything, but could re-implement to closer follow GNU approach

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doing what GNU does is the right call! Nice investigative work! (As a precaution: remember not to look at the GNU source code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be ready to go

Comment on lines 1282 to 1285
loop {
let chunk_size = chunk_size_base + (chunk_size_reminder > i) as u64;
let buf: &mut Vec<u8> = &mut vec![];
i += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work?

for i in 1.. {
    ...
}

Copy link

github-actions bot commented Nov 9, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!
GNU test failed: tests/tail/truncate. tests/tail/truncate is passing on 'main'. Maybe you have to rebase?

Copy link

github-actions bot commented Nov 9, 2023

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

@zhitkoff
Copy link
Contributor Author

zhitkoff commented Nov 9, 2023

@tertsdiepraam @sylvestre
made changes based on review, please let me know if you have any more suggestions/comments

Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/split/b-chunk is no longer failing!

@zhitkoff
Copy link
Contributor Author

@tertsdiepraam @sylvestre I have another set of changes ready that pass next Gnu test for split 'l-chunk', but they are based on code included in this PR. Should I add them into this PR or wait until this one is resolved and then submit those in a new one?

@sylvestre sylvestre merged commit eb00c19 into uutils:main Nov 17, 2023
zhitkoff added a commit to zhitkoff/coreutils that referenced this pull request Nov 17, 2023
---------

Co-authored-by: Terts Diepraam <terts.diepraam@gmail.com>
Co-authored-by: Daniel Hofstetter <daniel.hofstetter@42dh.com>
Co-authored-by: Brandon Elam Barker <brandon.barker@gmail.com>
Co-authored-by: Kostiantyn Hryshchuk <statheres@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
@zhitkoff zhitkoff deleted the split-b-chunk branch December 3, 2023 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants