Skip to content

Add str.encode for utf-8 #901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions tests/snippets/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,3 +206,13 @@ def try_mutate_str():
word[0] = 'x'

assert_raises(TypeError, try_mutate_str)

ss = ['Hello', '안녕', '👋']
bs = [b'Hello', b'\xec\x95\x88\xeb\x85\x95', b'\xf0\x9f\x91\x8b']

for s, b in zip(ss, bs):
assert s.encode() == b

for s, b, e in zip(ss, bs, ['u8', 'U8', 'utf-8', 'UTF-8', 'utf_8']):
assert s.encode(e) == b
# assert s.encode(encoding=e) == b
11 changes: 11 additions & 0 deletions vm/src/function.rs
Original file line number Diff line number Diff line change
Expand Up @@ -374,6 +374,17 @@ impl<T> OptionalArg<T> {
Missing => f(),
}
}

pub fn map_or_else<U, D, F>(self, default: D, f: F) -> U
where
D: FnOnce() -> U,
F: FnOnce(T) -> U,
{
match self {
Present(value) => f(value),
Missing => default(),
}
}
}

impl<T> FromArgs for OptionalArg<T>
Expand Down
47 changes: 35 additions & 12 deletions vm/src/obj/objbyteinner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,24 +74,33 @@ pub struct ByteInnerNewOptions {
encoding: OptionalArg<PyStringRef>,
}

//same algorithm as cpython
pub fn normalize_encoding(encoding: &str) -> String {
let mut res = String::new();
let mut punct = false;

for c in encoding.chars() {
if c.is_alphanumeric() || c == '.' {
if punct && !res.is_empty() {
res.push('_')
}
res.push(c.to_ascii_lowercase());
punct = false;
} else {
punct = true;
}
}
res
}

impl ByteInnerNewOptions {
pub fn get_value(self, vm: &VirtualMachine) -> PyResult<PyByteInner> {
// First handle bytes(string, encoding[, errors])
if let OptionalArg::Present(enc) = self.encoding {
if let OptionalArg::Present(eval) = self.val_option {
if let Ok(input) = eval.downcast::<PyString>() {
let encoding = enc.as_str();
if encoding.to_lowercase() == "utf8" || encoding.to_lowercase() == "utf-8"
// TODO: different encoding
{
return Ok(PyByteInner {
elements: input.value.as_bytes().to_vec(),
});
} else {
return Err(
vm.new_value_error(format!("unknown encoding: {}", encoding)), //should be lookup error
);
}
let inner = PyByteInner::from_string(&input.value, enc.as_str(), vm)?;
return Ok(inner);
} else {
return Err(vm.new_type_error("encoding without a string argument".to_string()));
}
Expand Down Expand Up @@ -311,6 +320,20 @@ impl ByteInnerSplitlinesOptions {
}

impl PyByteInner {
pub fn from_string(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Self> {
let normalized = normalize_encoding(encoding);
if normalized == "utf_8" || normalized == "utf8" || normalized == "u8" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u8 is in encoding.aliases so I think it is redundant here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at cpython code unicode.encode uses a getdefaultencoding function which return "utf-8" (which is a different behaviour than bytes.decode). You could do it add it and at the same time add the getdefaultencoding to the sys module

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About u8, we don't have codec for now. So it is not redundant at this time. We can refactor it later once we have codec. About getdefaultencoding, it is good point, Thanks. I will go with it for next step.

Ok(PyByteInner {
elements: value.as_bytes().to_vec(),
})
} else {
// TODO: different encoding
Err(
vm.new_value_error(format!("unknown encoding: {}", encoding)), // should be lookup error
)
}
}

pub fn repr(&self) -> PyResult<String> {
let mut res = String::with_capacity(self.elements.len());
for i in self.elements.iter() {
Expand Down
7 changes: 7 additions & 0 deletions vm/src/obj/objbytes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,13 @@ impl PyBytes {
inner: PyByteInner { elements },
}
}

pub fn from_string(value: &str, encoding: &str, vm: &VirtualMachine) -> PyResult<Self> {
Ok(PyBytes {
inner: PyByteInner::from_string(value, encoding, vm)?,
})
}

pub fn get_value(&self) -> &[u8] {
&self.inner.elements
}
Expand Down
26 changes: 26 additions & 0 deletions vm/src/obj/objstr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ use crate::pyobject::{
};
use crate::vm::VirtualMachine;

use super::objbytes::PyBytes;
use super::objdict::PyDict;
use super::objint::{self, PyInt};
use super::objnone::PyNone;
Expand Down Expand Up @@ -957,6 +958,31 @@ impl PyString {
}
}
}

#[pymethod]
fn encode(
&self,
encoding: OptionalArg<PyObjectRef>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use PyStringRef to avoid type checking for "encoding" and "errors" wich will always be string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By trying it, I found I can't control the exception message with it. Do you have any idea about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RustPython doesn't aim to have exactly the same error message as cpython. But it's better to let the type checking "automatic" in function type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about error messages. If we can keep them the same or similar with small cost, why not? Is there priority about this? @windelbouwman

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error messages to my opinion do not have to be similar to cpython. While I worked with rust, I find the error messages super useful in rust, maybe we could try something new in rustpython as well with regards to error messages. So in this case, I agree with @jgirardet to use the type checker with PyStringRef in this case.

_errors: OptionalArg<PyObjectRef>,
vm: &VirtualMachine,
) -> PyResult {
let encoding = encoding.map_or_else(
|| Ok("utf-8".to_string()),
|v| {
if objtype::isinstance(&v, &vm.ctx.str_type()) {
Ok(get_value(&v))
} else {
Err(vm.new_type_error(format!(
"encode() argument 1 must be str, not {}",
v.class().name
)))
}
},
)?;

let encoded = PyBytes::from_string(&self.value, &encoding, vm)?;
Ok(encoded.into_pyobject(vm)?)
}
}

impl PyValue for PyString {
Expand Down