-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-124008: Fix calculation of the number of written bytes for the Windows console #124059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-124008: Fix calculation of the number of written bytes for the Windows console #124059
Conversation
…he Windows console Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures.
I keep trying to review and I keep getting confused 😆 Perhaps we should revise this entire piece of functionality, do one big conversion (and hope that anyone writing more than 32KB to the console in one go has enough RAM), and loop over that? Then we're only dealing with UTF-16 pairs, which ought to be simpler. |
The RAM constraint should be gone with Windows 8+ (gh-121940), just the "interactivity" / Ctrl-C interruptability (if can get to |
I was referring to the memory we have to allocate to store a UTF-16 version of the string. If someone decides to write 2GB worth of ASCII in one go, we'll need another 4GB to store the UTF-16 encoded version, even before we get to passing it to the console APIs. |
This PR consists of three parts:
Even if the 32KiB limit is gone, there is still the 2GiB or 4GiB limit, so the code will stay almost the same. It cannot help. Unless we get rid of |
I think it's the binary search to find the converted length that upsets me the most. Does it get any simpler if we arbitrarily cut at 32K bytes, count back to a complete UTF-8 character, and then write from there? Rather than trying to fit within 32KB worth of UTF-16, which is no longer a real limit on supported platforms. |
No, it will not get much simpler. Note that the code to find the initial length was already here. I only fixed some corner cases in it. We still need binary search to map the number of actually written wchars to the number of UTF-8 bytes. Even if we do as you suggested, this will only save ~10 lines of code (relatively simple in comparison of the rest of this PR). We can do this in the following PR. For now, keeping this code means less changes. |
Yeah, that's annoying... but you're right. Okay, I have no further objections. Thanks for your work on this. |
Thank you for your review @zooba. |
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13. |
…he Windows console (pythonGH-124059) Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures. (cherry picked from commit 3cf83d9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…he Windows console (pythonGH-124059) Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures. (cherry picked from commit 3cf83d9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
GH-127325 is a backport of this pull request to the 3.13 branch. |
GH-127326 is a backport of this pull request to the 3.12 branch. |
…the Windows console (GH-124059) (GH-127326) Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures. (cherry picked from commit 3cf83d9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
It's unrelated and I cannot reproduce the issue on the buildbot (nor on my laptop). I don't know what's going on. |
…the Windows console (GH-124059) (GH-127325) Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures. (cherry picked from commit 3cf83d9) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
…he Windows console (pythonGH-124059) Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters. Also fix writing incomplete UTF-8 sequences. Also fix handling of memory allocation failures.
Since MultiByteToWideChar()/WideCharToMultiByte() is not reversible if the data contains invalid UTF-8 sequences, use binary search to calculate the number of written bytes from the number of written characters.
Also fix handling of memory allocation failures.