Skip to content

csplit: rewrite with one function per pattern type #7806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jfinkels
Copy link
Collaborator

@jfinkels jfinkels commented Apr 20, 2025

This pull request fixes several bugs in csplit by rewriting the code to have one function per pattern type. The functions are

  • up_to_line_repeat for patterns like 123 {45}
  • up_to_line_always for 123 {*}
  • up_to_match_pos_offset_repeat for /regex/+123 {45}
  • up_to_match_pos_offset_always for /regex/+123 {*}
  • up_to_match_neg_offset_repeat for /regex/-123 {45}
  • up_to_match_neg_offset_always for /regex/-123 {*}
  • skip_to_match_pos_offset_repeat for %regex%+123 {45}
  • skip_to_match_pos_offset_always for %regex%+123 {*}
  • skip_to_match_neg_offset_repeat for %regex%-123 {45}
  • skip_to_match_neg_offset_always for %regex%-123 {*}

The command-line options, global state, and per-chunk state live in the Splitter struct (which replaces the SplitWriter and InputSplitter structs). There are a ton of ugly corner cases where different types of patterns interact. I've tried to call those out with comments.

I added some tests that were already passing, but were missing test coverage, including

  • same regex pattern twice in a row, as in /foo/ /foo/
  • line number out of range on empty input, as in : | csplit - 1,
  • skip to match with negative offset repeating indefinitely, as in csplit %foo%-5 {*}.

Some bugs fixed include:

Some bugs remain:

  • The GNU test file tests/csplit/csplit-suppress-matched remains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures.
  • Probably related to the previous item, there is some weird behavior of GNU csplit that I could not understand. This was already partially covered by a test named repeat_everything. I've extracted just the relevant part of that into its own test named test_repeat_line_after_match_suppress_matched that has TODO comments for the behavior that differs from GNU csplit.

Fixes #7286

Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

1 similar comment
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

It is still in draft mode :)

@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 89302d5 to 257f26f Compare April 27, 2025 02:39
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 7f0ef59 to 77d8a07 Compare April 27, 2025 18:04
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

This pull request fixes several bugs in `csplit` by rewriting the code
to have one function per pattern type. The functions are

* `up_to_line_loop` for patterns like `123 {45}` and `123 {*}`,
* `up_to_match_pos_offset_repeat` for `/regex/+123 {45}`,
* `up_to_match_pos_offset_always` for `/regex/+123 {*}`,
* `up_to_match_neg_offset_repeat` for `/regex/-123 {45}`,
* `up_to_match_neg_offset_always` for `/regex/-123 {*}`,
* `skip_to_match_pos_offset_repeat` for `%regex%+123 {45}`,
* `skip_to_match_pos_offset_always` for `%regex%+123 {*}`,
* `skip_to_match_neg_offset_repeat` for `%regex%-123 {45}`,
* `skip_to_match_neg_offset_always` for `%regex%-123 {*}`,

The command-line options, global state, and per-chunk state live in the
`Splitter` struct (which replaces the `SplitWriter` and `InputSplitter`
structs). There are a ton of ugly corner cases where different types of
patterns interact; most of these should be called out with comments.

I added some tests that were already passing, but were missing test
coverage, including

* same regex pattern twice in a row, as in `/foo/ /foo/`
* line number out of range on empty input, as in `: | csplit - 1`,
* skip to match with negative offset repeating indefinitely, as in
  `csplit %foo%-5 {*}`.

Some bugs fixed include:

* missing a regex match on the final line of a line number pattern, as
  in `seq 20 | csplit - 10 /10/`,
* missing output `0\n` when input is a directory,
* uutils#7286, a missing empty file with `--suppress-matched`.

Some bugs remain:

* The GNU test file `tests/csplit/csplit-suppress-matched` remains
  failing. I'm hoping that the change in this pull request will make it
  easier to reason about csplit and fix the remaining failures.
* Probably related to the previous item, there is some weird behavior of
  GNU csplit that I could not understand. This was already partially
  covered by a test named `repeat_everything`. I've extracted just the
  relevant part of that into its own test named
  `test_repeat_line_after_match_suppress_matched` that has TODO comments
  for the behavior that differs from GNU `csplit`.

Fixes uutils#7286
@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 77d8a07 to 6e9d4f8 Compare April 27, 2025 20:33
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@sylvestre sylvestre requested a review from Copilot April 27, 2025 21:19
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors csplit by rewriting pattern handling into dedicated functions per pattern type and fixes several bugs. Key changes include:

  • Refactored pattern matching logic to use one function per pattern type.
  • Expanded test coverage to validate various edge cases and updated expected outputs.
  • Consolidated global and per-chunk state in the Splitter struct.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/by-util/test_csplit.rs New and updated tests to cover edge cases and validate new behavior.
src/uu/csplit/src/patterns.rs Simplified pattern execution by replacing an iterator with a function.
Comments suppressed due to low confidence (3)

tests/by-util/test_csplit.rs:1558

  • [nitpick] The Windows section in test_directory_input_file contains a commented-out stdout expectation. Please clarify the intended behavior on Windows to ensure the test covers the proper output.
        // .stdout_is("0\n")

src/uu/csplit/src/patterns.rs:51

  • [nitpick] Consider renaming the parameter 'i' to a more descriptive name, such as 'iteration_count', for improved readability.
pub(crate) fn should_continue(&self, i: usize) -> bool {

tests/by-util/test_csplit.rs:1388

  • The change in test_line_num_range_with_up_to_match3 from failing to succeeding, with output '18\n0\n123\n', differs from previous behavior. Confirm that this behavior change is intentional and well documented.
        .succeeds()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csplit: --suppress-matched incorrectly elides last empty file
2 participants