-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add str.encode for utf-8 #901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vm/src/obj/objbyteinner.rs
Outdated
@@ -25,6 +25,18 @@ pub struct PyByteInner { | |||
pub elements: Vec<u8>, | |||
} | |||
|
|||
pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> { | |||
let encoding = encoding.to_lowercase(); | |||
if encoding == "utf-8" || encoding == "u8" || encoding == "utf8" || encoding == "utf_8" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking bytes.decode I drafted a normalize_encoding like cpython. maybe it could be usable :
//same algorithm as cpython
pub fn normalize_encoding(encoding: &str) -> String {
let mut res = String::new();
let mut punct = false;
for c in encoding.chars() {
if c.is_alphanumeric() || c == '.' {
if punct && !res.is_empty() {
res.push('_')
}
res.push(c.to_ascii_lowercase());
punct = false;
} else {
punct = true;
}
}
res
}
you'll see on the same file some ideas to deal with errors.
https://github.com/jgirardet/RustPython/blob/1736d63932d5b7b4a6093eed62019255545888b2/vm/src/stdlib/_codecs.rs#L66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the function and the codec module looks really good. And I just recognized it is not a part of 847 😞 . Will you create a small PR with the part to adapt? Or do you prefer I import some required parts from your work? Just waiting for future and editing later will work too if it will be finally merged in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I 'm waiting #847 to be merge, and next there is another big PR (depending of 847) with other bytes method. I 'll wait everything is reviewed/merge( it's much work) and then I will propose something for bytes.decode.
For no just take what you need, i'll see after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this function in objbyteinner.rs
, thanks
vm/src/obj/objbyteinner.rs
Outdated
@@ -25,6 +25,18 @@ pub struct PyByteInner { | |||
pub elements: Vec<u8>, | |||
} | |||
|
|||
pub fn vec_from_str(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Vec<u8>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe implementing from_string
to pybyteinner ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was one of my early idea but then I found it is hard to turns PyByteInner
into PyBytes
because there is no interface to create a PyBytes
from inner. Do you have a good solution for that? For now, I think creating a vector is good enough for this kind of minimal charset transition API. Because it doesn't use any feature of PyBytesInner
and easy to transit to it once it is required. Tell me if I missed something. I feel like I am very new to the str/bytes (actually for everything in this project, but especially).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a try_from_object in #847 which creates a PyByteInner from a PyObjectRef(which could be bytes, byterray, memoryview). I think it make the API more consitent to use use a PyByteInner::from_string and the use PyByteInner.elements
It's a personal opinion and I'm not in charge of this project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added PyBytes::from_string and PyByteInner::from_string
#[pymethod] | ||
fn encode( | ||
&self, | ||
encoding: OptionalArg<PyObjectRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use PyStringRef to avoid type checking for "encoding" and "errors" wich will always be string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By trying it, I found I can't control the exception message with it. Do you have any idea about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RustPython doesn't aim to have exactly the same error message as cpython. But it's better to let the type checking "automatic" in function type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about error messages. If we can keep them the same or similar with small cost, why not? Is there priority about this? @windelbouwman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error messages to my opinion do not have to be similar to cpython. While I worked with rust, I find the error messages super useful in rust, maybe we could try something new in rustpython as well with regards to error messages. So in this case, I agree with @jgirardet to use the type checker with PyStringRef
in this case.
vm/src/obj/objstr.rs
Outdated
}, | ||
)?; | ||
|
||
let bytes = PyBytes::new(objbyteinner::vec_from_str(&self.value, &encoding, vm)?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe vm.ctx.new_bytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is exactly what I needed
Codecov Report
@@ Coverage Diff @@
## master #901 +/- ##
==========================================
+ Coverage 64.98% 65.03% +0.05%
==========================================
Files 96 96
Lines 16755 16773 +18
Branches 3734 3734
==========================================
+ Hits 10888 10909 +21
Misses 3348 3348
+ Partials 2519 2516 -3
Continue to review full report at Codecov.
|
9d06c91
to
785fb0f
Compare
@@ -314,6 +323,20 @@ impl ByteInnerSplitlinesOptions { | |||
} | |||
|
|||
impl PyByteInner { | |||
pub fn from_string(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Self> { | |||
let normalized = normalize_encoding(encoding); | |||
if normalized == "utf_8" || normalized == "utf8" || normalized == "u8" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u8
is in encoding.aliases so I think it is redundant here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at cpython code unicode.encode uses a getdefaultencoding function which return "utf-8" (which is a different behaviour than bytes.decode). You could do it add it and at the same time add the getdefaultencoding to the sys module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About u8
, we don't have codec for now. So it is not redundant at this time. We can refactor it later once we have codec. About getdefaultencoding, it is good point, Thanks. I will go with it for next step.
@youknowone I guess this PR looks pretty good. Could you have a look at the conflicted file? Then I could merge this. |
No description provided.