Skip to content

Add str.encode for utf-8 #901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2019
Merged

Conversation

youknowone
Copy link
Member

No description provided.

@@ -25,6 +25,18 @@ pub struct PyByteInner {
pub elements: Vec<u8>,
}

pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> {
let encoding = encoding.to_lowercase();
if encoding == "utf-8" || encoding == "u8" || encoding == "utf8" || encoding == "utf_8" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking bytes.decode I drafted a normalize_encoding like cpython. maybe it could be usable :

//same algorithm as cpython
pub fn normalize_encoding(encoding: &str) -> String {
    let mut res = String::new();
    let mut punct = false;

    for c in encoding.chars() {
        if c.is_alphanumeric() || c == '.' {
            if punct && !res.is_empty() {
                res.push('_')
            }
            res.push(c.to_ascii_lowercase());
            punct = false;
        } else {
            punct = true;
        }
    }
    res
}

you'll see on the same file some ideas to deal with errors.
https://github.com/jgirardet/RustPython/blob/1736d63932d5b7b4a6093eed62019255545888b2/vm/src/stdlib/_codecs.rs#L66

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the function and the codec module looks really good. And I just recognized it is not a part of 847 😞 . Will you create a small PR with the part to adapt? Or do you prefer I import some required parts from your work? Just waiting for future and editing later will work too if it will be finally merged in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I 'm waiting #847 to be merge, and next there is another big PR (depending of 847) with other bytes method. I 'll wait everything is reviewed/merge( it's much work) and then I will propose something for bytes.decode.
For no just take what you need, i'll see after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this function in objbyteinner.rs, thanks

@@ -25,6 +25,18 @@ pub struct PyByteInner {
pub elements: Vec<u8>,
}

pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe implementing from_string to pybyteinner ?

Copy link
Member Author

@youknowone youknowone May 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was one of my early idea but then I found it is hard to turns PyByteInner into PyBytes because there is no interface to create a PyBytes from inner. Do you have a good solution for that? For now, I think creating a vector is good enough for this kind of minimal charset transition API. Because it doesn't use any feature of PyBytesInner and easy to transit to it once it is required. Tell me if I missed something. I feel like I am very new to the str/bytes (actually for everything in this project, but especially).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a try_from_object in #847 which creates a PyByteInner from a PyObjectRef(which could be bytes, byterray, memoryview). I think it make the API more consitent to use use a PyByteInner::from_string and the use PyByteInner.elements
It's a personal opinion and I'm not in charge of this project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added PyBytes::from_string and PyByteInner::from_string

#[pymethod]
fn encode(
&self,
encoding: OptionalArg<PyObjectRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use PyStringRef to avoid type checking for "encoding" and "errors" wich will always be string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By trying it, I found I can't control the exception message with it. Do you have any idea about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RustPython doesn't aim to have exactly the same error message as cpython. But it's better to let the type checking "automatic" in function type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about error messages. If we can keep them the same or similar with small cost, why not? Is there priority about this? @windelbouwman

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages to my opinion do not have to be similar to cpython. While I worked with rust, I find the error messages super useful in rust, maybe we could try something new in rustpython as well with regards to error messages. So in this case, I agree with @jgirardet to use the type checker with PyStringRef in this case.

},
)?;

let bytes = PyBytes::new(objbyteinner::vec_from_str(&self.value, &encoding, vm)?);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe vm.ctx.new_bytes ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is exactly what I needed

@codecov-io
Copy link

codecov-io commented May 1, 2019

Codecov Report

Merging #901 into master will increase coverage by 0.05%.
The diff coverage is 62.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #901      +/-   ##
==========================================
+ Coverage   64.98%   65.03%   +0.05%     
==========================================
  Files          96       96              
  Lines       16755    16773      +18     
  Branches     3734     3734              
==========================================
+ Hits        10888    10909      +21     
  Misses       3348     3348              
+ Partials     2519     2516       -3
Impacted Files Coverage Δ
vm/src/obj/objstr.rs 73.89% <42.85%> (-0.62%) ⬇️
vm/src/obj/objbytes.rs 76.76% <50%> (-0.46%) ⬇️
vm/src/function.rs 58.25% <60%> (+0.04%) ⬆️
vm/src/obj/objbyteinner.rs 72.88% <77.27%> (+0.31%) ⬆️
vm/src/obj/objfloat.rs 72.54% <0%> (-0.29%) ⬇️
vm/src/obj/objdict.rs 73.38% <0%> (+0.18%) ⬆️
vm/src/stdlib/os.rs 70.46% <0%> (+0.56%) ⬆️
vm/src/obj/objset.rs 74.66% <0%> (+1.07%) ⬆️
vm/src/dictdatatype.rs 71.91% <0%> (+3.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f113342...59476c6. Read the comment docs.

@youknowone youknowone force-pushed the str-encode branch 2 times, most recently from 9d06c91 to 785fb0f Compare May 6, 2019 19:56
@@ -314,6 +323,20 @@ impl ByteInnerSplitlinesOptions {
}

impl PyByteInner {
pub fn from_string(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Self> {
let normalized = normalize_encoding(encoding);
if normalized == "utf_8" || normalized == "utf8" || normalized == "u8" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u8 is in encoding.aliases so I think it is redundant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at cpython code unicode.encode uses a getdefaultencoding function which return "utf-8" (which is a different behaviour than bytes.decode). You could do it add it and at the same time add the getdefaultencoding to the sys module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About u8, we don't have codec for now. So it is not redundant at this time. We can refactor it later once we have codec. About getdefaultencoding, it is good point, Thanks. I will go with it for next step.

@windelbouwman
Copy link
Contributor

@youknowone I guess this PR looks pretty good. Could you have a look at the conflicted file? Then I could merge this.

@windelbouwman windelbouwman merged commit 121cd43 into RustPython:master May 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants