Add str.encode for utf-8 #901

youknowone · 2019-05-01T04:49:34Z

No description provided.

jgirardet · 2019-05-01T05:36:25Z

vm/src/obj/objbyteinner.rs

@@ -25,6 +25,18 @@ pub struct PyByteInner {
    pub elements: Vec<u8>,
 }

+pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> {
+    let encoding = encoding.to_lowercase();
+    if encoding == "utf-8" || encoding == "u8" || encoding == "utf8" || encoding == "utf_8" {


Thinking bytes.decode I drafted a normalize_encoding like cpython. maybe it could be usable :

//same algorithm as cpython pub fn normalize_encoding(encoding: &str) -> String { let mut res = String::new(); let mut punct = false; for c in encoding.chars() { if c.is_alphanumeric() || c == '.' { if punct && !res.is_empty() { res.push('_') } res.push(c.to_ascii_lowercase()); punct = false; } else { punct = true; } } res }

you'll see on the same file some ideas to deal with errors.
https://github.com/jgirardet/RustPython/blob/1736d63932d5b7b4a6093eed62019255545888b2/vm/src/stdlib/_codecs.rs#L66

Both the function and the codec module looks really good. And I just recognized it is not a part of 847 😞 . Will you create a small PR with the part to adapt? Or do you prefer I import some required parts from your work? Just waiting for future and editing later will work too if it will be finally merged in future.

I 'm waiting #847 to be merge, and next there is another big PR (depending of 847) with other bytes method. I 'll wait everything is reviewed/merge( it's much work) and then I will propose something for bytes.decode.
For no just take what you need, i'll see after.

I added this function in objbyteinner.rs, thanks

jgirardet · 2019-05-01T05:38:39Z

vm/src/obj/objbyteinner.rs

@@ -25,6 +25,18 @@ pub struct PyByteInner {
    pub elements: Vec<u8>,
 }

+pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> {


maybe implementing from_string to pybyteinner ?

That was one of my early idea but then I found it is hard to turns PyByteInner into PyBytes because there is no interface to create a PyBytes from inner. Do you have a good solution for that? For now, I think creating a vector is good enough for this kind of minimal charset transition API. Because it doesn't use any feature of PyBytesInner and easy to transit to it once it is required. Tell me if I missed something. I feel like I am very new to the str/bytes (actually for everything in this project, but especially).

there is a try_from_object in #847 which creates a PyByteInner from a PyObjectRef(which could be bytes, byterray, memoryview). I think it make the API more consitent to use use a PyByteInner::from_string and the use PyByteInner.elements
It's a personal opinion and I'm not in charge of this project.

added PyBytes::from_string and PyByteInner::from_string

jgirardet · 2019-05-01T05:39:35Z

vm/src/obj/objstr.rs

+    #[pymethod]
+    fn encode(
+        &self,
+        encoding: OptionalArg<PyObjectRef>,


use PyStringRef to avoid type checking for "encoding" and "errors" wich will always be string

By trying it, I found I can't control the exception message with it. Do you have any idea about it?

RustPython doesn't aim to have exactly the same error message as cpython. But it's better to let the type checking "automatic" in function type.

I am not sure about error messages. If we can keep them the same or similar with small cost, why not? Is there priority about this? @windelbouwman

The error messages to my opinion do not have to be similar to cpython. While I worked with rust, I find the error messages super useful in rust, maybe we could try something new in rustpython as well with regards to error messages. So in this case, I agree with @jgirardet to use the type checker with PyStringRef in this case.

jgirardet · 2019-05-01T05:41:18Z

vm/src/obj/objstr.rs

+           },
+        )?;
+
+        let bytes = PyBytes::new(objbyteinner::vec_from_str(&self.value, &encoding, vm)?);


maybe vm.ctx.new_bytes ?

Thanks, this is exactly what I needed

codecov-io · 2019-05-01T13:52:01Z

Codecov Report

Merging #901 into master will increase coverage by 0.05%.
The diff coverage is 62.22%.

@@            Coverage Diff             @@
##           master     #901      +/-   ##
==========================================
+ Coverage   64.98%   65.03%   +0.05%     
==========================================
  Files          96       96              
  Lines       16755    16773      +18     
  Branches     3734     3734              
==========================================
+ Hits        10888    10909      +21     
  Misses       3348     3348              
+ Partials     2519     2516       -3

Impacted Files	Coverage Δ
vm/src/obj/objstr.rs	`73.89% <42.85%> (-0.62%)`	⬇️
vm/src/obj/objbytes.rs	`76.76% <50%> (-0.46%)`	⬇️
vm/src/function.rs	`58.25% <60%> (+0.04%)`	⬆️
vm/src/obj/objbyteinner.rs	`72.88% <77.27%> (+0.31%)`	⬆️
vm/src/obj/objfloat.rs	`72.54% <0%> (-0.29%)`	⬇️
vm/src/obj/objdict.rs	`73.38% <0%> (+0.18%)`	⬆️
vm/src/stdlib/os.rs	`70.46% <0%> (+0.56%)`	⬆️
vm/src/obj/objset.rs	`74.66% <0%> (+1.07%)`	⬆️
vm/src/dictdatatype.rs	`71.91% <0%> (+3.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f113342...59476c6. Read the comment docs.

jgirardet · 2019-05-07T08:23:50Z

vm/src/obj/objbyteinner.rs

@@ -314,6 +323,20 @@ impl ByteInnerSplitlinesOptions {
 }

 impl PyByteInner {
+    pub fn from_string(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Self> {
+        let normalized = normalize_encoding(encoding);
+        if normalized == "utf_8" || normalized == "utf8" || normalized == "u8" {


u8 is in encoding.aliases so I think it is redundant here.

looking at cpython code unicode.encode uses a getdefaultencoding function which return "utf-8" (which is a different behaviour than bytes.decode). You could do it add it and at the same time add the getdefaultencoding to the sys module

About u8, we don't have codec for now. So it is not redundant at this time. We can refactor it later once we have codec. About getdefaultencoding, it is good point, Thanks. I will go with it for next step.

windelbouwman · 2019-05-29T12:00:33Z

@youknowone I guess this PR looks pretty good. Could you have a look at the conflicted file? Then I could merge this.

jgirardet reviewed May 1, 2019

View reviewed changes

youknowone force-pushed the str-encode branch from 257c9e8 to 5acf3ed Compare May 1, 2019 13:06

youknowone force-pushed the str-encode branch 2 times, most recently from 9d06c91 to 785fb0f Compare May 6, 2019 19:56

jgirardet reviewed May 7, 2019

View reviewed changes

youknowone mentioned this pull request May 11, 2019

Add sys.getfilesystemencoding, sys.getfilesystemencodeerrors #960

Merged

jgirardet and others added 3 commits May 29, 2019 21:24

normalize_encoding

ad357d0

Add str.encode for utf-8

7f2560c

PyBytes::from_string

59476c6

youknowone force-pushed the str-encode branch from 785fb0f to 59476c6 Compare May 29, 2019 12:29

windelbouwman merged commit 121cd43 into RustPython:master May 29, 2019

Add str.encode for utf-8 #901

Add str.encode for utf-8 #901

Uh oh!

Conversation

youknowone commented May 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone May 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented May 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

windelbouwman commented May 29, 2019

Uh oh!

Uh oh!

youknowone May 1, 2019 •

edited

Loading

codecov-io commented May 1, 2019 •

edited

Loading