-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
extend the Encoder/Decoder state API to be type-aware #13017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cb430dd
to
b458e31
Compare
b458e31
to
48c788c
Compare
Test Results (amd64) - Integration, Bootstrap 5 files 5 suites 2h 19m 56s ⏱️ Results for commit b30e59a. ♻️ This comment has been updated with latest results. |
48c788c
to
b30e59a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I believe we need to account for the writer schema to be provided to the decoder (as mentioned in the comment below). The "weird" failing test I showed you yesterday was due to the fact that we were providing the reader schema as the writer one (still not 100% sure how avro
could convert it, though).
@@ -97,42 +97,48 @@ def __str__(self) -> str: | |||
|
|||
|
|||
class Encoder: | |||
def encodes(self, obj: Any) -> bytes: | |||
def encodes(self, obj: Any, py_type: type = None) -> bytes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def encodes(self, obj: Any, py_type: type = None) -> bytes: | |
def encodes(self, obj: Any, py_type: type | None = None) -> bytes: |
small nit (applies everywhere)
|
||
def decode(self, file: IO[bytes]) -> Any: | ||
def decode(self, file: IO[bytes], py_type: type = None) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on our discussion from yesterday, I believe we need to extend the decoder APIs and provide the writer schema.
The avro
DatumReader
needs both the writer and the reader schema (see here), and our initial prototype was wrongly providing the reader as the writer. If only the writer schema is provided, avro
assumes the reader schema is the same.
Maybe we could also get along with a **kwargs
for the initial iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good callout! i think this is not the right level of abstraction to do this. adding **kwargs just obfuscates the fact that it at this point has become a leaky abstraction.
my suggestion would be to find a different way to hide this. for example, we can make sure in the encoding step that the writer schema is placed into the same file, and we read it first, and then read the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! We can definitely save everything in a single file. avro
supports this out of the box. I could easily implement it.
Since you mentioned leaky abstractions, couldn't we get rid of py_type
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we derive the reader schema?
ad: i don't think i really fully understand. could you share an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the encoder, we can just type(obj)
. For the encoder, we can read the type from the writer schema we stored in the binary data. The store class might be renamed, though, and we'd need to provide an alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main comment was about the writer schema in the decoder signature.
As we discussed, we have been able to store it along with the serialized data in the avro
implementation (see companion PR), so that's not valid anymore.
Motivation
Currently our decoders serve as adapter only to dill pickle. Because they type information is stored in the pickle, you can de-serialize a pickle and get the correct type without having to specify it at deserialization. Unfortunately this process breaks down when you've moved or renamed a class that you want to deserialize into.
So to be able to de-couple code and data serialization, and separate reader/writer schema, we need the decoder to be type-aware. The Encoder has to be type aware for certain operations as well, for example if we pass a TypedDict to encode, we won't know which type to encode since python doesn't retain type information for typed dicts. So this is just to make the encoding API a more flexible.
Changes