Skip to content

gh-131878: Fix input of unicode characters with two or more code points in new pyrepl on Windows #131901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

sergey-miryanov
Copy link
Contributor

@sergey-miryanov sergey-miryanov commented Mar 30, 2025

If unicode characters with two or more codepoints (ñ or é) typed or pasted, then it should be converted to bytes before passing to eventqueue.push.

Also fixed handling of exceptions while decoding buffer. If exception occurred then buffer should be flushed to prevent mixing it with following control commands (for example "\x1b[201").

- with two codepoints or more
@python-cla-bot
Copy link

python-cla-bot bot commented Apr 6, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@chris-eibl
Copy link
Member

LGTM. I cannot do a thorough review, but at least confirm, that with this PR I can enter äÄöÖüÜ, etc, again 🚀

Copy link
Member

@tomasr8 tomasr8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with a Czech keyboard and everything works. Thanks!

sergey-miryanov and others added 2 commits April 26, 2025 21:31
Co-authored-by: Tomas R. <tomas.roun8@gmail.com>
@chris-eibl
Copy link
Member

LGTM. I cannot do a thorough review, but at least confirm, that with this PR I can enter äÄöÖüÜ, etc, again 🚀

Had some time now: this needs fixes for Linux. Otherwise looks good, thanks @sergey-miryanov!

@tomasr8
Copy link
Member

tomasr8 commented Apr 27, 2025

This breaks Linux, because only one char is inserted in UnixConsole.get_event() [...]

Since this wasn't caught by the tests, we should also probably add a test for that.

@chris-eibl
Copy link
Member

chris-eibl commented Apr 27, 2025

I have suggested the def test_push_unicode_character_two_bytes(self): which tests exactly that.

@sergey-miryanov
Copy link
Contributor Author

@chris-eibl Please take a look! I would like to keep tests for paste mode to be sure that escape sequences not mixed with input if error occurred. But I'm OK to remove them if you think it is not necessary.

@chris-eibl
Copy link
Member

LGTM. Sure, more tests are always better. Though, AFAIU, there is nothing special about paste mode here:

  • test_push_unicode_character_two_bytes_in_paste_mode just asserts that a valid "two byte char" within a sequence of "one byte chars" is working as expected.
  • likewise, test_push_unicode_character_as_str_in_paste_mode asserts that when an exception during a sequence of feeding bytes via push is caught, we can continue pushing afterwards.

IMHO, both are reasonable, but "paste mode" makes them look too special about pasting?

So I'd rename them and use arbitrary "one byte chars" in there.

@sergey-miryanov
Copy link
Contributor Author

Sequence of "one byte chars" is a control command to enable "paste mode". Originally in gh-131878 problem aroise from case where the wrongly passed unicode string puts to the buffer and mixed with following control command that disables paste mode and asserted in those code path. So, I want to check that we don't "corrupt" chars buffer.

@chris-eibl
Copy link
Member

Yupp, I know. But as said, there is nothing special about bracketed pasting here. From the perspective of the eventqueue test, these are just arbitrary bytes. Whether bracketed pasting is working correctly, is tested in

def test_bracketed_paste(self):

If you'd like to showcase that sequences of "one byte chars" interleaved with "multi byte chars" work correctly via "paste mode", then maybe just add a comment?

And small nit: ~ is missing in the escape sequences, see

paste_start = "\x1b[200~"
paste_end = "\x1b[201~"

To me, those two tests have merit, but are too special about "paste mode".

@sergey-miryanov
Copy link
Contributor Author

@chris-eibl It seems that I did not correctly assume how paste-mode works (shame on me!). I removed one test because the same code path is covered by test_push_unicode_character_two_bytes. And renamed and simplified a bit test_push_single_chars_and_unicode_character_as_str (not sure the name is ok though).

@chris-eibl
Copy link
Member

I don't have a better name - the comment in there clarifies what we intend to test (not much tbh).
Just a small nit on the comment.

Otherwise LGTM. Thanks for bearing with me!

Co-authored-by: Chris Eibl <138194463+chris-eibl@users.noreply.github.com>
@sergey-miryanov
Copy link
Contributor Author

@chris-eibl Thanks for the review and patience!

Copy link
Member

@chris-eibl chris-eibl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sergey-miryanov!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants