Skip to content

gh-124008: Fix calculation of the number of written bytes for the Windows console #124059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 27, 2024

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Sep 13, 2024

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters.

Also fix handling of memory allocation failures.

…he Windows console

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
@serhiy-storchaka serhiy-storchaka requested a review from a team November 5, 2024 08:46
@serhiy-storchaka
Copy link
Member Author

@zooba, @vstinner, do you mind to take a look at this PR?

@zooba
Copy link
Member

zooba commented Nov 13, 2024

I keep trying to review and I keep getting confused 😆

Perhaps we should revise this entire piece of functionality, do one big conversion (and hope that anyone writing more than 32KB to the console in one go has enough RAM), and loop over that? Then we're only dealing with UTF-16 pairs, which ought to be simpler.

@cmaloney
Copy link
Contributor

The RAM constraint should be gone with Windows 8+ (gh-121940), just the "interactivity" / Ctrl-C interruptability (if can get to WaitForMultipleObjectsEx would solve that and is where I've been trying to aim recently, but also lots of steps to get there).

@zooba
Copy link
Member

zooba commented Nov 13, 2024

The RAM constraint should be gone with Windows 8+

I was referring to the memory we have to allocate to store a UTF-16 version of the string. If someone decides to write 2GB worth of ASCII in one go, we'll need another 4GB to store the UTF-16 encoded version, even before we get to passing it to the console APIs.

@serhiy-storchaka
Copy link
Member Author

This PR consists of three parts:

  • fixed _find_last_utf8_boundary();
  • _wchar_to_utf8_count() to calculate the number of partially written bytes;
  • numerous other fixes, mainly missed PyErr_NoMemory().

Even if the 32KiB limit is gone, there is still the 2GiB or 4GiB limit, so the code will stay almost the same. It cannot help. Unless we get rid of MultiByteToWideChar() and use double conversion with Python UTF-8 and UTF-16 codecs, but this would be a much larger change, and the code may be less efficient.

@zooba
Copy link
Member

zooba commented Nov 21, 2024

I think it's the binary search to find the converted length that upsets me the most.

Does it get any simpler if we arbitrarily cut at 32K bytes, count back to a complete UTF-8 character, and then write from there? Rather than trying to fit within 32KB worth of UTF-16, which is no longer a real limit on supported platforms.

@serhiy-storchaka
Copy link
Member Author

No, it will not get much simpler.

Note that the code to find the initial length was already here. I only fixed some corner cases in it.

We still need binary search to map the number of actually written wchars to the number of UTF-8 bytes. Even if we do as you suggested, this will only save ~10 lines of code (relatively simple in comparison of the rest of this PR). We can do this in the following PR. For now, keeping this code means less changes.

@zooba
Copy link
Member

zooba commented Nov 25, 2024

Yeah, that's annoying... but you're right.

Okay, I have no further objections. Thanks for your work on this.

@serhiy-storchaka
Copy link
Member Author

Thank you for your review @zooba.

@serhiy-storchaka serhiy-storchaka merged commit 3cf83d9 into python:main Nov 27, 2024
39 checks passed
@miss-islington-app
Copy link

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the windows-console-write branch November 27, 2024 11:38
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Nov 27, 2024
…he Windows console (pythonGH-124059)

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
(cherry picked from commit 3cf83d9)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Nov 27, 2024
…he Windows console (pythonGH-124059)

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
(cherry picked from commit 3cf83d9)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Nov 27, 2024

GH-127325 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Nov 27, 2024
@bedevere-app
Copy link

bedevere-app bot commented Nov 27, 2024

GH-127326 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Nov 27, 2024
serhiy-storchaka added a commit that referenced this pull request Nov 27, 2024
…the Windows console (GH-124059) (GH-127326)

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
(cherry picked from commit 3cf83d9)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 RHEL8 Refleaks 3.12 has failed when building commit c3bb32d.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1125/builds/735) and take a look at the build logs.
  4. Check if the failure is related to this commit (c3bb32d) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1125/builds/735

Failed tests:

  • test_complex

Failed subtests:

  • test_truediv - test.test_complex.ComplexTest.test_truediv

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.12.cstratak-RHEL8-x86_64.refleak/build/Lib/test/test_complex.py", line 107, in test_truediv
    self.check_div(complex(random(), random()),
  File "/home/buildbot/buildarea/3.12.cstratak-RHEL8-x86_64.refleak/build/Lib/test/test_complex.py", line 84, in check_div
    self.assertClose(q, y)
  File "/home/buildbot/buildarea/3.12.cstratak-RHEL8-x86_64.refleak/build/Lib/test/test_complex.py", line 77, in assertClose
    self.assertCloseAbs(x.imag, y.imag, eps)
  File "/home/buildbot/buildarea/3.12.cstratak-RHEL8-x86_64.refleak/build/Lib/test/test_complex.py", line 72, in assertCloseAbs
    self.assertTrue(abs((x-y)/y) < eps)
AssertionError: False is not true

@vstinner
Copy link
Member

test_truediv - test.test_complex.ComplexTest.test_truediv

It's unrelated and I cannot reproduce the issue on the buildbot (nor on my laptop). I don't know what's going on.

serhiy-storchaka added a commit that referenced this pull request Nov 30, 2024
…the Windows console (GH-124059) (GH-127325)

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
(cherry picked from commit 3cf83d9)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
ebonnal pushed a commit to ebonnal/cpython that referenced this pull request Jan 12, 2025
…he Windows console (pythonGH-124059)

Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if
the data contains invalid UTF-8 sequences, use binary search to
calculate the number of written bytes from the number of written
characters.

Also fix writing incomplete UTF-8 sequences.

Also fix handling of memory allocation failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants