Skip to content

Implement latin_1 in Rust #3046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 1, 2021
Merged

Conversation

fanninpm
Copy link
Contributor

@fanninpm fanninpm commented Sep 13, 2021

Based on the PyPy implementation. The encoding function needs some work with respect to error handling. EDIT: Now based on @coolreader18's ascii codec.

Requesting @coolreader18 as a potential reviewer, as they wrote the architecture for the encodings module and the error handling function(s).

Copy link
Member

@coolreader18 coolreader18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, so far this looks good! That is the correct way to add a new encoding; if that's what you were looking for. I'm not sure if unicode_encode_ucs1 is reusable for other encodings, since anything >ucs1 would be some way of splitting the ucs2/ucs4 into parts, at which point it's just a different encoding.

@fanninpm
Copy link
Contributor Author

I'm not sure if unicode_encode_ucs1 is reusable for other encodings, since anything >ucs1 would be some way of splitting the ucs2/ucs4 into parts, at which point it's just a different encoding.

unicode_encode_ucs1 is applicable to latin_1 and ascii, so that's why I put that function outside the module.

That is the correct way to add a new encoding; if that's what you were looking for.

While I was implementing this, I came across something that stymied me. In PyPy, the unicode_encode_ucs1 function references an error handler that is not yet implemented here. How would that error handler be implemented in Rust?

@coolreader18
Copy link
Member

coolreader18 commented Sep 22, 2021

Originally posted this in #3118 but it makes more sense here

re: the ucs1 helper function, I think it makes more sense to abstract over utf8/ascii than ascii/latin1, since we use a utf8 str type (so we can just validate + extend_from_slice for utf8/ascii) whereas CPython has UCS1 as a buffer (so they can just validate + memcpy for ascii/latin1). Finishing up #3118 made me remember why I did that; in our implementation of codecs, ascii and utf8 just naturally have a lot of shared code while the alternative is true in CPython.

This implementation is patterned off of the ascii codec.
@fanninpm fanninpm marked this pull request as ready for review September 24, 2021 01:36
Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. @coolreader18 could you review this again?

@youknowone youknowone merged commit 2eb6c68 into RustPython:main Oct 1, 2021
@fanninpm fanninpm deleted the latin-1-encoding branch October 1, 2021 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants