-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
csplit: rewrite with one function per pattern type #7806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
GNU testsuite comparison:
|
1 similar comment
GNU testsuite comparison:
|
It is still in draft mode :) |
89302d5
to
257f26f
Compare
GNU testsuite comparison:
|
7f0ef59
to
77d8a07
Compare
GNU testsuite comparison:
|
This pull request fixes several bugs in `csplit` by rewriting the code to have one function per pattern type. The functions are * `up_to_line_loop` for patterns like `123 {45}` and `123 {*}`, * `up_to_match_pos_offset_repeat` for `/regex/+123 {45}`, * `up_to_match_pos_offset_always` for `/regex/+123 {*}`, * `up_to_match_neg_offset_repeat` for `/regex/-123 {45}`, * `up_to_match_neg_offset_always` for `/regex/-123 {*}`, * `skip_to_match_pos_offset_repeat` for `%regex%+123 {45}`, * `skip_to_match_pos_offset_always` for `%regex%+123 {*}`, * `skip_to_match_neg_offset_repeat` for `%regex%-123 {45}`, * `skip_to_match_neg_offset_always` for `%regex%-123 {*}`, The command-line options, global state, and per-chunk state live in the `Splitter` struct (which replaces the `SplitWriter` and `InputSplitter` structs). There are a ton of ugly corner cases where different types of patterns interact; most of these should be called out with comments. I added some tests that were already passing, but were missing test coverage, including * same regex pattern twice in a row, as in `/foo/ /foo/` * line number out of range on empty input, as in `: | csplit - 1`, * skip to match with negative offset repeating indefinitely, as in `csplit %foo%-5 {*}`. Some bugs fixed include: * missing a regex match on the final line of a line number pattern, as in `seq 20 | csplit - 10 /10/`, * missing output `0\n` when input is a directory, * uutils#7286, a missing empty file with `--suppress-matched`. Some bugs remain: * The GNU test file `tests/csplit/csplit-suppress-matched` remains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures. * Probably related to the previous item, there is some weird behavior of GNU csplit that I could not understand. This was already partially covered by a test named `repeat_everything`. I've extracted just the relevant part of that into its own test named `test_repeat_line_after_match_suppress_matched` that has TODO comments for the behavior that differs from GNU `csplit`. Fixes uutils#7286
77d8a07
to
6e9d4f8
Compare
GNU testsuite comparison:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors csplit by rewriting pattern handling into dedicated functions per pattern type and fixes several bugs. Key changes include:
- Refactored pattern matching logic to use one function per pattern type.
- Expanded test coverage to validate various edge cases and updated expected outputs.
- Consolidated global and per-chunk state in the Splitter struct.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
tests/by-util/test_csplit.rs | New and updated tests to cover edge cases and validate new behavior. |
src/uu/csplit/src/patterns.rs | Simplified pattern execution by replacing an iterator with a function. |
Comments suppressed due to low confidence (3)
tests/by-util/test_csplit.rs:1558
- [nitpick] The Windows section in test_directory_input_file contains a commented-out stdout expectation. Please clarify the intended behavior on Windows to ensure the test covers the proper output.
// .stdout_is("0\n")
src/uu/csplit/src/patterns.rs:51
- [nitpick] Consider renaming the parameter 'i' to a more descriptive name, such as 'iteration_count', for improved readability.
pub(crate) fn should_continue(&self, i: usize) -> bool {
tests/by-util/test_csplit.rs:1388
- The change in test_line_num_range_with_up_to_match3 from failing to succeeding, with output '18\n0\n123\n', differs from previous behavior. Confirm that this behavior change is intentional and well documented.
.succeeds()
This pull request fixes several bugs in csplit by rewriting the code to have one function per pattern type. The functions are
up_to_line_repeat
for patterns like123 {45}
up_to_line_always
for123 {*}
up_to_match_pos_offset_repeat
for/regex/+123 {45}
up_to_match_pos_offset_always
for/regex/+123 {*}
up_to_match_neg_offset_repeat
for/regex/-123 {45}
up_to_match_neg_offset_always
for/regex/-123 {*}
skip_to_match_pos_offset_repeat
for%regex%+123 {45}
skip_to_match_pos_offset_always
for%regex%+123 {*}
skip_to_match_neg_offset_repeat
for%regex%-123 {45}
skip_to_match_neg_offset_always
for%regex%-123 {*}
The command-line options, global state, and per-chunk state live in the
Splitter
struct (which replaces theSplitWriter
andInputSplitter
structs). There are a ton of ugly corner cases where different types of patterns interact. I've tried to call those out with comments.I added some tests that were already passing, but were missing test coverage, including
/foo/ /foo/
: | csplit - 1
,csplit %foo%-5 {*}
.Some bugs fixed include:
seq 20 | csplit - 10 /10/
,0\n
when input is a directory,--suppress-matched
.Some bugs remain:
tests/csplit/csplit-suppress-matched
remains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures.repeat_everything
. I've extracted just the relevant part of that into its own test namedtest_repeat_line_after_match_suppress_matched
that has TODO comments for the behavior that differs from GNU csplit.Fixes #7286