Skip to content

Marshaling support for ints, floats, strs, lists, dicts, tuples #3506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 21, 2022

Conversation

jakearmendariz
Copy link
Contributor

My work on #3458. Marshaling still needs support for tuples, dicts, sets, etc. But I wanted to post a PR for some feedback (and possibly an early merge).

How It Works

When marshaled, each buffer is prefixed with a byte indicating the type of data. Then while decoding we can check the first byte to see which datatype the buffer holds.

Lists are more complicated because they have different types, and can be recursive (lists inside of lists). Each list has a prefix of its size, and then each element's data is stored directly after another 8 bytes indicating its size.
Resulting in [LIST_BYTE, <bytes for size>, <size><element>,<size><element>...]

Already there are some improvement areas. Most notably, each list/element size is stored as usize (u64) and that can probably be changed to a u32.

Testing

I added a python file for testing. It passes for RustPython, marshaling/unmarshaling primitives and lists.

However, CPython marshals their data differently. I am not sure if they are supposed to match, however, I thought it was best to optimize the code to fit the RustPython Datastructures and not worry about how CPython handles marshaling.

Closing

Please let me know if there are any problems or more idiomatic ways of writing this. I've never written rust code outside of personal projects so I expect some issues.

Thanks!

@@ -30,25 +78,108 @@ mod decl {
Ok(())
}

/// Read the next 8 bytes of a slice, convert to usize
/// Has side effect of increasing position pointer
fn eat_usize(bytes: &[u8], position: &mut usize) -> usize {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors could be stemming from this function. Not all architectures have a usize of 8 bytes (could be 2, 4, 8). So instead of reading usize, I am going to switch marshaling to use u32 despite architecture, then convert back to usize. So the only possible error is an element or list being longer than 2^32 bytes, in which case we send an error that it can't be marshaled

@@ -11,7 +11,7 @@ use crate::{
sliceable::wrap_index,
types::{Constructor, Unconstructible},
PyObject, PyObjectPayload, PyObjectRef, PyObjectView, PyObjectWrap, PyRef, PyResult,
TryFromBorrowedObject, TypeProtocol, VirtualMachine,
TryFromBorrowedObject, TypeProtocol, VirtualMachine, PyValue,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cargo fmt issues stemming from this file

@youknowone
Copy link
Member

youknowone commented Dec 30, 2021

Hi, I am sorry for late review, but I think I will take a few more vacations so it might be delayed a few more days, I am sorry!

We don't need data compatibility with CPython, because marshal requires the serialization used only in the exact python version. So please don't worry about it. For fmt issue, please run cargo fmt --all before commit. It works great for most of cases.

I have no idea for mac/ubuntu test failures for now.

@jakearmendariz jakearmendariz changed the title Marshaling support for ints, floats, strs, lists Marshaling support for ints, floats, strs, lists, dicts, tuples Jan 11, 2022
@jakearmendariz
Copy link
Contributor Author

I added support for dicts and tuples as well. I'll stop adding features until after a review.

Thanks and I hope to hear back about this PR soon!

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really sorry for late review. To make a little excuse, recently I was horribly occupied by my main job.

The core approach looks good. I have a few suggestions about Rust and the project conventions.
I think you brought test_marshal.py from CPython. Could you revise your commit message to mark the CPython version you brougt it from? here is a nice example: https://github.com/RustPython/RustPython/pull/3474/commits

I left other suggestions on the codes.

Comment on lines +72 to +74
pub(crate) fn from_entries(entries: DictContentType) -> Self {
Self { entries }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are encapsulating DictContentType from outside. If we can avoid to expose this type as more as possible, I would like it.
Could you check if you can refactor PyDict::merge_object a little bit to expose merge from iterator part? it starts from line 108.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working on this, everything else is updated. Thanks for all the feedback btw!

Copy link
Member

@youknowone youknowone Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! please let me know if you meet any blocker

Comment on lines 83 to 84
let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();
let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();
let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;
let mut tuple_bytes = dump_list(pytuple.as_slice().iter(), vm)?;

let dict = DictContentType::default();
for elem in iterable {
let items = match_class!(match elem.clone() {
pytuple @ PyTuple => pytuple.fast_getitems(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pytuple @ PyTuple => pytuple.fast_getitems(),
pytuple @ PyTuple => pytuple.as_slice().to_vec(),

Comment on lines 89 to 92
pub(crate) fn fast_getitems(&self) -> Vec<PyObjectRef> {
(*self.elements.clone()).to_vec()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub(crate) fn fast_getitems(&self) -> Vec<PyObjectRef> {
(*self.elements.clone()).to_vec()
}

this function is almost covered by as_slice()

Comment on lines 93 to 95
pub(crate) fn from_elements(elements: Box<[PyObjectRef]>) -> Self {
Self { elements }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTuple::new_ref(elements, vm) is same as PyTuple::from_elements(elements).into_pyobject(vm)

let mut byte_list = size_to_bytes(pyobjs.len(), vm)?.to_vec();
// For each element, dump into binary, then add its length and value.
for element in pyobjs {
let element_bytes: PyBytes = dumps(element.clone(), vm)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reuse dumps for this, I think we need another function not to create PyBytes.
How about renaming dumps to _dumps by changing it to returns Vec<u8> and adding a new dumps as a thin wrapper? Then calling _dumps will be simpler than this one.

}
pystr @ PyStr => {
let mut str_bytes = pystr.as_str().as_bytes().to_vec();
str_bytes.insert(0, STR_BYTE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than insert(0, ...), swapping the order will prevent huge memcpy in bad cases. same for list and tuples.

})
.collect();
// Converts list of tuples to list, dump into binary
let mut dict_bytes = dumps(elements.into_pyobject(vm), vm)?.deref().to_vec();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a list, can dump_list be reused?

Comment on lines 126 to 135
let length_as_u32 =
u32::from_le_bytes(match bytes[*position..(*position + 4)].try_into() {
Ok(length_as_u32) => length_as_u32,
Err(_) => {
return Err(
vm.new_buffer_error("Could not read u32 size from byte array".to_owned())
)
}
});
*position += 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing slice itself is a preferred way to having a fixed buffer and position. (function signature does not match to this code for now)

Suggested change
let length_as_u32 =
u32::from_le_bytes(match bytes[*position..(*position + 4)].try_into() {
Ok(length_as_u32) => length_as_u32,
Err(_) => {
return Err(
vm.new_buffer_error("Could not read u32 size from byte array".to_owned())
)
}
});
*position += 4;
let (len_bytes, other_bytes) = bytes.split_at(4);
let len = u32::from_le_bytes(len_bytes.try_into().map_err(|_| vm.new_buffer_error("Could not read u32 size from byte array".to_owned())?);
bytes = other_bytes

Ok(PyCode {
code: vm.map_codeobj(code),
})
match buf[0] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf.split_first() will be helpful

@youknowone
Copy link
Member

Could you also please rebase the code? Since broken CI is recently fixed, it will be green after rebase.

@youknowone
Copy link
Member

youknowone commented Jan 24, 2022

Most of project using git including us prefer patches(PR) never use merge but only rebases. Could you try rebase if possible? When upstream is the official repository, git rebase upstream/main will be helpful. Otherwise it will be squashed into single commit except for CPython-originated parts. (it does not mean it is a problem)

@jakearmendariz jakearmendariz force-pushed the main branch 2 times, most recently from 879043e to feb3b31 Compare January 27, 2022 19:31
@jakearmendariz
Copy link
Contributor Author

I tried to rebase some of the earlier commits by squashing each of them into the first. However, it looks like my git push --force-with-lease brought in many other commits to this PR. I still need to fetch upstream into this PR as well, but I am struggling to do this without creating more merge commits/problems.

Any tips haha? Should I create an entirely new PR, not sure if it's too far gone.
Thanks again for the help and feedback!

@youknowone
Copy link
Member

The history looks twisted a bit, but I think running git fetch upstream before git rebase upstream/main again will fix the problem. Please let me know if you feel hard to fix the problems yourself.

You don't need to make another PR, because one PR is always bound to one branch. You can fix it by force push whenever you want.

@fanninpm
Copy link
Contributor

I just use git push --force when I rebase.

@youknowone
Copy link
Member

@jakearmendariz I rebased it and pushed it to your github branch

@jakearmendariz
Copy link
Contributor Author

Sorry I have been so slow at responding and making changes, took a lot of hard classes this quarter and midterms have been killer.

A couple of notes, test_marshal.py was a file I created for testing marshaling, I didn't take it from CPython. Let me know if I should remove it or make it more clear that the file wasn't copied from CPython.

Also, I couldn't refactor from_entries to use for marshaling. Not sure if it's a blocker for this PR, or something I can try to fix in a subsequent PR.

I think that addresses all of the PR comments, everything else should have been fixed.

Thanks again for the help!

@fanninpm
Copy link
Contributor

test_marshal.py was a file I created for testing marshaling, I didn't take it from CPython. Let me know if I should remove it or make it more clear that the file wasn't copied from CPython.

There's no real need to worry about that. A similar situation exists for test_dis.py.

@youknowone
Copy link
Member

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

@jakearmendariz
Copy link
Contributor Author

Done!

@fanninpm
Copy link
Contributor

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

I thought about that, but the snippets are run by both CPython and RustPython, and I'm guessing there's a chance that the snippet would fail when CPython runs it.

@jakearmendariz
Copy link
Contributor Author

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

I thought about that, but the snippets are run by both CPython and RustPython, and I'm guessing there's a chance that the snippet would fail when CPython runs it.

Yep, the snippets failed the tests because of marshal.py (CPython runs it)
Should I move the test_marshal.py back to Lib/test?

@fanninpm
Copy link
Contributor

Should I move the test_marshal.py back to Lib/test?

Go ahead and move it back, (optionally) adding a note explaining that RustPython's version of test_marshal.py is necessarily different from CPython's version.

@youknowone
Copy link
Member

I don't think putting original tests to Lib/test is a preferrable way. Rather than that, we don't need to test specific serialization form of marshal - because it is implementation-specific. I suggest to test roundtrip in the test instead of byte-by-byte comparison

@jakearmendariz
Copy link
Contributor Author

I suggest to test roundtrip in the test instead of byte-by-byte comparison

Updated! No byte-str comparison, passes test for RustPython and CPython

@youknowone
Copy link
Member

@jakearmendariz thank you for long time effort. there is a error for test_exceptions.py related to marshal. Could you also check this error please?

======================================================================
ERROR: testRaising (test.test_exceptions.ExceptionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 67, in testRaising
    unlink(TESTFN)
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 63, in testRaising
    pass
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 63, in testRaising
    pass
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 61, in testRaising
    marshal.loads(b'')
ValueError: EOF where object expected.

@jakearmendariz
Copy link
Contributor Author

Changed to send an EOF error when an empty byte string is passed in!

Copy link
Member

@youknowone youknowone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! thank you very much!

@youknowone youknowone merged commit ef90d09 into RustPython:main Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants