Skip to content

gh-119609, PEP 756: Add PyUnicode_Export() function #123738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 5, 2024

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and PyUnicode_Import() functions to the limited C API.


📚 Documentation preview 📚: https://cpython-previews--123738.org.readthedocs.build/

Add PyUnicode_Export(), PyUnicode_GetBufferFormat() and
PyUnicode_Import() functions to the limited C API.
@bedevere-app
Copy link

bedevere-app bot commented Sep 5, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

I have made the requested changes; please review again.

@bedevere-app
Copy link

bedevere-app bot commented Sep 5, 2024

Thanks for making the requested changes!

@mdboom: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from mdboom September 5, 2024 16:56
@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

@mdboom @picnixz: Thanks for your reviews. I think that I addressed most, if not all, of them :-)

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final nitpick on my side (sorry but I only skimmed through the implementation since I don't have much energy now...).

A bit off-topic, but do we use the PRI* macros in the code base? I saw that you used the %i for formatting a uint32_t value, which usually works, but I wondered whether you prefer using the platform-dependent ones.

@vstinner vstinner requested a review from rhettinger as a code owner September 5, 2024 18:34
@vstinner
Copy link
Member Author

vstinner commented Sep 5, 2024

A side effect of this change is to add the __release_buffer__() method to the built-in str type.

I had to implement collections.UserString.__release_buffer__() to fix test_collections (the UserString simply raises NotImplementedError).

@vstinner
Copy link
Member Author

@serhiy-storchaka: I updated the PR to use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32(), and address your other comments.

@vstinner
Copy link
Member Author

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

@encukou
Copy link
Member

encukou commented Sep 12, 2024

I had to remove the check "last character in a NUL character" in tests, since _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32() don't write such last NUL character.

That's a security vulnerability waiting to happen.

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

@vstinner
Copy link
Member Author

@encukou:

Since the internal buffers do have the terminating NUL, and in most cases we expose those, people will expect the NUL even if we'd explicitly document that it's not guaranteed. IMO, we need to add it.

@serhiy-storchaka: Sorry, I reverted the "Use _PyUnicode_EncodeUTF16() and _PyUnicode_EncodeUTF32()" change to get back the NUL trailing character.

@vstinner
Copy link
Member Author

I'm not sure if we should guarantee that the exported buffer ends with a NUL character. I'm not sure that all Python implementations will be able to provide such guarantee in an efficient way (without having to allocate a temporary buffer for that).

@encukou
Copy link
Member

encukou commented Sep 12, 2024

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.
Another implementation can add a function like XPyUnicode_Export_Raw; if it becomes popular CPython can adopt it as an alias of PyUnicode_Export.

@vstinner
Copy link
Member Author

We should. As long as the API is used from C, exported strings should be NUL-terminated for safety.

I suggest to continue this discussion at: capi-workgroup/decisions#33 (comment)

@vstinner vstinner changed the title gh-119609: Add PyUnicode_Export() function gh-119609, PEP 756: Add PyUnicode_Export() function Sep 17, 2024
@vstinner
Copy link
Member Author

vstinner commented Nov 5, 2024

I withdrawn my PEP 756.

@vstinner vstinner closed this Nov 5, 2024
@vstinner vstinner deleted the unicode_view branch November 5, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants