Marshaling support for ints, floats, strs, lists, dicts, tuples #3506

jakearmendariz · 2021-12-20T07:51:47Z

My work on #3458. Marshaling still needs support for tuples, dicts, sets, etc. But I wanted to post a PR for some feedback (and possibly an early merge).

How It Works

When marshaled, each buffer is prefixed with a byte indicating the type of data. Then while decoding we can check the first byte to see which datatype the buffer holds.

Lists are more complicated because they have different types, and can be recursive (lists inside of lists). Each list has a prefix of its size, and then each element's data is stored directly after another 8 bytes indicating its size.
Resulting in [LIST_BYTE, <bytes for size>, <size><element>,<size><element>...]

Already there are some improvement areas. Most notably, each list/element size is stored as usize (u64) and that can probably be changed to a u32.

Testing

I added a python file for testing. It passes for RustPython, marshaling/unmarshaling primitives and lists.

However, CPython marshals their data differently. I am not sure if they are supposed to match, however, I thought it was best to optimize the code to fit the RustPython Datastructures and not worry about how CPython handles marshaling.

Closing

Please let me know if there are any problems or more idiomatic ways of writing this. I've never written rust code outside of personal projects so I expect some issues.

Thanks!

jakearmendariz · 2021-12-20T18:12:02Z

vm/src/stdlib/marshal.rs

@@ -30,25 +78,108 @@ mod decl {
        Ok(())
    }

+    /// Read the next 8 bytes of a slice, convert to usize
+    /// Has side effect of increasing position pointer
+    fn eat_usize(bytes: &[u8], position: &mut usize) -> usize {


Errors could be stemming from this function. Not all architectures have a usize of 8 bytes (could be 2, 4, 8). So instead of reading usize, I am going to switch marshaling to use u32 despite architecture, then convert back to usize. So the only possible error is an element or list being longer than 2^32 bytes, in which case we send an error that it can't be marshaled

jakearmendariz · 2021-12-20T18:12:37Z

vm/src/protocol/buffer.rs

@@ -11,7 +11,7 @@ use crate::{
    sliceable::wrap_index,
    types::{Constructor, Unconstructible},
    PyObject, PyObjectPayload, PyObjectRef, PyObjectView, PyObjectWrap, PyRef, PyResult,
-    TryFromBorrowedObject, TypeProtocol, VirtualMachine,
+    TryFromBorrowedObject, TypeProtocol, VirtualMachine, PyValue,


Cargo fmt issues stemming from this file

youknowone · 2021-12-30T06:30:40Z

Hi, I am sorry for late review, but I think I will take a few more vacations so it might be delayed a few more days, I am sorry!

We don't need data compatibility with CPython, because marshal requires the serialization used only in the exact python version. So please don't worry about it. For fmt issue, please run cargo fmt --all before commit. It works great for most of cases.

I have no idea for mac/ubuntu test failures for now.

jakearmendariz · 2022-01-11T23:50:05Z

I added support for dicts and tuples as well. I'll stop adding features until after a review.

Thanks and I hope to hear back about this PR soon!

youknowone

I am really sorry for late review. To make a little excuse, recently I was horribly occupied by my main job.

The core approach looks good. I have a few suggestions about Rust and the project conventions.
I think you brought test_marshal.py from CPython. Could you revise your commit message to mark the CPython version you brougt it from? here is a nice example: https://github.com/RustPython/RustPython/pull/3474/commits

I left other suggestions on the codes.

youknowone · 2022-01-17T08:13:40Z

vm/src/builtins/dict.rs

+    pub(crate) fn from_entries(entries: DictContentType) -> Self {
+        Self { entries }
+    }


I think we are encapsulating DictContentType from outside. If we can avoid to expose this type as more as possible, I would like it.
Could you check if you can refactor PyDict::merge_object a little bit to expose merge from iterator part? it starts from line 108.

Still working on this, everything else is updated. Thanks for all the feedback btw!

great! please let me know if you meet any blocker

youknowone · 2022-01-17T08:19:16Z

vm/src/stdlib/marshal.rs

+                let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();
+                let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;


Suggested change

let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();

let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;

let mut tuple_bytes = dump_list(pytuple.as_slice().iter(), vm)?;

youknowone · 2022-01-17T08:20:14Z

vm/src/stdlib/marshal.rs

+        let dict = DictContentType::default();
+        for elem in iterable {
+            let items = match_class!(match elem.clone() {
+                pytuple @ PyTuple => pytuple.fast_getitems(),


Suggested change

pytuple @ PyTuple => pytuple.fast_getitems(),

pytuple @ PyTuple => pytuple.as_slice().to_vec(),

youknowone · 2022-01-17T08:20:45Z

vm/src/builtins/tuple.rs

+    pub(crate) fn fast_getitems(&self) -> Vec<PyObjectRef> {
+        (*self.elements.clone()).to_vec()
+    }
+


Suggested change

pub(crate) fn fast_getitems(&self) -> Vec<PyObjectRef> {

(*self.elements.clone()).to_vec()

}

this function is almost covered by as_slice()

youknowone · 2022-01-17T08:22:32Z

vm/src/builtins/tuple.rs

+    pub(crate) fn from_elements(elements: Box<[PyObjectRef]>) -> Self {
+        Self { elements }
+    }


PyTuple::new_ref(elements, vm) is same as PyTuple::from_elements(elements).into_pyobject(vm)

youknowone · 2022-01-17T09:21:16Z

vm/src/stdlib/marshal.rs

+        let mut byte_list = size_to_bytes(pyobjs.len(), vm)?.to_vec();
+        // For each element, dump into binary, then add its length and value.
+        for element in pyobjs {
+            let element_bytes: PyBytes = dumps(element.clone(), vm)?;


To reuse dumps for this, I think we need another function not to create PyBytes.
How about renaming dumps to _dumps by changing it to returns Vec<u8> and adding a new dumps as a thin wrapper? Then calling _dumps will be simpler than this one.

youknowone · 2022-01-17T09:26:59Z

vm/src/stdlib/marshal.rs

+            }
+            pystr @ PyStr => {
+                let mut str_bytes = pystr.as_str().as_bytes().to_vec();
+                str_bytes.insert(0, STR_BYTE);


rather than insert(0, ...), swapping the order will prevent huge memcpy in bad cases. same for list and tuples.

youknowone · 2022-01-17T09:30:28Z

vm/src/stdlib/marshal.rs

+                    })
+                    .collect();
+                // Converts list of tuples to list, dump into binary
+                let mut dict_bytes = dumps(elements.into_pyobject(vm), vm)?.deref().to_vec();


If this is a list, can dump_list be reused?

youknowone · 2022-01-17T09:54:01Z

vm/src/stdlib/marshal.rs

+        let length_as_u32 =
+            u32::from_le_bytes(match bytes[*position..(*position + 4)].try_into() {
+                Ok(length_as_u32) => length_as_u32,
+                Err(_) => {
+                    return Err(
+                        vm.new_buffer_error("Could not read u32 size from byte array".to_owned())
+                    )
+                }
+            });
+        *position += 4;


Changing slice itself is a preferred way to having a fixed buffer and position. (function signature does not match to this code for now)

Suggested change

let length_as_u32 =

u32::from_le_bytes(match bytes[*position..(*position + 4)].try_into() {

Ok(length_as_u32) => length_as_u32,

Err(_) => {

return Err(

vm.new_buffer_error("Could not read u32 size from byte array".to_owned())

)

}

});

*position += 4;

let (len_bytes, other_bytes) = bytes.split_at(4);

let len = u32::from_le_bytes(len_bytes.try_into().map_err(|_| vm.new_buffer_error("Could not read u32 size from byte array".to_owned())?);

bytes = other_bytes

youknowone · 2022-01-17T09:59:25Z

vm/src/stdlib/marshal.rs

-        Ok(PyCode {
-            code: vm.map_codeobj(code),
-        })
+        match buf[0] {


buf.split_first() will be helpful

youknowone · 2022-01-17T10:01:27Z

Could you also please rebase the code? Since broken CI is recently fixed, it will be green after rebase.

youknowone · 2022-01-24T09:31:29Z

Most of project using git including us prefer patches(PR) never use merge but only rebases. Could you try rebase if possible? When upstream is the official repository, git rebase upstream/main will be helpful. Otherwise it will be squashed into single commit except for CPython-originated parts. (it does not mean it is a problem)

jakearmendariz · 2022-01-27T19:53:50Z

I tried to rebase some of the earlier commits by squashing each of them into the first. However, it looks like my git push --force-with-lease brought in many other commits to this PR. I still need to fetch upstream into this PR as well, but I am struggling to do this without creating more merge commits/problems.

Any tips haha? Should I create an entirely new PR, not sure if it's too far gone.
Thanks again for the help and feedback!

youknowone · 2022-01-28T13:59:12Z

The history looks twisted a bit, but I think running git fetch upstream before git rebase upstream/main again will fix the problem. Please let me know if you feel hard to fix the problems yourself.

You don't need to make another PR, because one PR is always bound to one branch. You can fix it by force push whenever you want.

fanninpm · 2022-01-29T20:39:39Z

I just use git push --force when I rebase.

youknowone · 2022-02-06T05:16:20Z

@jakearmendariz I rebased it and pushed it to your github branch

jakearmendariz · 2022-02-12T04:53:50Z

Sorry I have been so slow at responding and making changes, took a lot of hard classes this quarter and midterms have been killer.

A couple of notes, test_marshal.py was a file I created for testing marshaling, I didn't take it from CPython. Let me know if I should remove it or make it more clear that the file wasn't copied from CPython.

Also, I couldn't refactor from_entries to use for marshaling. Not sure if it's a blocker for this PR, or something I can try to fix in a subsequent PR.

I think that addresses all of the PR comments, everything else should have been fixed.

Thanks again for the help!

fanninpm · 2022-02-12T04:58:19Z

test_marshal.py was a file I created for testing marshaling, I didn't take it from CPython. Let me know if I should remove it or make it more clear that the file wasn't copied from CPython.

There's no real need to worry about that. A similar situation exists for test_dis.py.

youknowone · 2022-02-13T14:39:31Z

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

jakearmendariz · 2022-02-13T16:13:22Z

Done!

fanninpm · 2022-02-13T17:38:17Z

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

I thought about that, but the snippets are run by both CPython and RustPython, and I'm guessing there's a chance that the snippet would fail when CPython runs it.

jakearmendariz · 2022-02-14T18:27:16Z

For test_marshal.py, could you put it in extra_tests/snippets rather than Lib/test?

I thought about that, but the snippets are run by both CPython and RustPython, and I'm guessing there's a chance that the snippet would fail when CPython runs it.

Yep, the snippets failed the tests because of marshal.py (CPython runs it)
Should I move the test_marshal.py back to Lib/test?

fanninpm · 2022-02-18T19:10:39Z

Should I move the test_marshal.py back to Lib/test?

Go ahead and move it back, (optionally) adding a note explaining that RustPython's version of test_marshal.py is necessarily different from CPython's version.

youknowone · 2022-02-18T21:44:07Z

I don't think putting original tests to Lib/test is a preferrable way. Rather than that, we don't need to test specific serialization form of marshal - because it is implementation-specific. I suggest to test roundtrip in the test instead of byte-by-byte comparison

jakearmendariz · 2022-02-19T00:31:35Z

I suggest to test roundtrip in the test instead of byte-by-byte comparison

Updated! No byte-str comparison, passes test for RustPython and CPython

youknowone · 2022-02-21T04:58:03Z

@jakearmendariz thank you for long time effort. there is a error for test_exceptions.py related to marshal. Could you also check this error please?

======================================================================
ERROR: testRaising (test.test_exceptions.ExceptionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 67, in testRaising
    unlink(TESTFN)
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 63, in testRaising
    pass
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 63, in testRaising
    pass
  File "/home/runner/work/RustPython/RustPython/vm/pylib-crate/Lib/test/test_exceptions.py", line 61, in testRaising
    marshal.loads(b'')
ValueError: EOF where object expected.

jakearmendariz · 2022-02-21T10:16:40Z

Changed to send an EOF error when an empty byte string is passed in!

youknowone

great! thank you very much!

jakearmendariz commented Dec 20, 2021

View reviewed changes

jakearmendariz changed the title ~~Marshaling support for ints, floats, strs, lists~~ Marshaling support for ints, floats, strs, lists, dicts, tuples Jan 11, 2022

youknowone requested changes Jan 17, 2022

View reviewed changes

jakearmendariz force-pushed the main branch 2 times, most recently from 879043e to feb3b31 Compare January 27, 2022 19:31

Add marshalling support for ints, floats, strs, lists, dict

25d2426

youknowone force-pushed the main branch from feb3b31 to 630fe1b Compare February 6, 2022 05:14

jakearmendariz force-pushed the main branch from 630fe1b to 5924f56 Compare February 13, 2022 16:12

jakearmendariz force-pushed the main branch from 5924f56 to ca6b93c Compare February 19, 2022 00:29

Changes to code/style of marshaling module

502d5c2

jakearmendariz force-pushed the main branch from ca6b93c to 502d5c2 Compare February 21, 2022 10:14

youknowone approved these changes Feb 21, 2022

View reviewed changes

youknowone merged commit ef90d09 into RustPython:main Feb 21, 2022

		let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();
		let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;

	let pytuple_items: Vec<PyObjectRef> = pytuple.fast_getitems();
	let mut tuple_bytes = dump_list(pytuple_items.iter(), vm)?;
	let mut tuple_bytes = dump_list(pytuple.as_slice().iter(), vm)?;

	pytuple @ PyTuple => pytuple.fast_getitems(),
	pytuple @ PyTuple => pytuple.as_slice().to_vec(),

	pub(crate) fn fast_getitems(&self) -> Vec<PyObjectRef> {
	(*self.elements.clone()).to_vec()
	}

Marshaling support for ints, floats, strs, lists, dicts, tuples #3506

Marshaling support for ints, floats, strs, lists, dicts, tuples #3506

Uh oh!

Conversation

jakearmendariz commented Dec 20, 2021

How It Works

Testing

Closing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone commented Dec 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakearmendariz commented Jan 11, 2022

Uh oh!

youknowone left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowone commented Jan 17, 2022

Uh oh!

youknowone commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakearmendariz commented Jan 27, 2022

Uh oh!

youknowone commented Jan 28, 2022

Uh oh!

fanninpm commented Jan 29, 2022

Uh oh!

youknowone commented Feb 6, 2022

Uh oh!

jakearmendariz commented Feb 12, 2022

Uh oh!

fanninpm commented Feb 12, 2022

Uh oh!

youknowone commented Feb 13, 2022

Uh oh!

jakearmendariz commented Feb 13, 2022

Uh oh!

fanninpm commented Feb 13, 2022

Uh oh!

jakearmendariz commented Feb 14, 2022

Uh oh!

fanninpm commented Feb 18, 2022

Uh oh!

youknowone commented Feb 18, 2022

Uh oh!

jakearmendariz commented Feb 19, 2022

Uh oh!

youknowone commented Feb 21, 2022

youknowone commented Dec 30, 2021 •

edited

Loading

youknowone left a comment •

edited

Loading

youknowone Jan 24, 2022 •

edited

Loading

youknowone commented Jan 24, 2022 •

edited

Loading