Skip to content

Conversation

thrau
Copy link
Member

@thrau thrau commented Aug 18, 2025

Motivation

Currently our decoders serve as adapter only to dill pickle. Because they type information is stored in the pickle, you can de-serialize a pickle and get the correct type without having to specify it at deserialization. Unfortunately this process breaks down when you've moved or renamed a class that you want to deserialize into.

So to be able to de-couple code and data serialization, and separate reader/writer schema, we need the decoder to be type-aware. The Encoder has to be type aware for certain operations as well, for example if we pass a TypedDict to encode, we won't know which type to encode since python doesn't retain type information for typed dicts. So this is just to make the encoding API a more flexible.

Changes

  • state Encoder and Decoder APIs now optionally take a expected_type argument that allows an underlying implementation to be specific about the python type to serialize or deserialize into

Copy link

github-actions bot commented Aug 18, 2025

Test Results - Preflight, Unit

22 158 tests  ±0   20 421 ✅ ±0   6m 23s ⏱️ -25s
     1 suites ±0    1 737 💤 ±0 
     1 files   ±0        0 ❌ ±0 

Results for commit b30e59a. ± Comparison against base commit 38aecc7.

♻️ This comment has been updated with latest results.

@thrau thrau added the semver: patch Non-breaking changes which can be included in patch releases label Aug 18, 2025
Copy link

github-actions bot commented Aug 18, 2025

LocalStack Community integration with Pro

    2 files  ±0      2 suites  ±0   1h 45m 15s ⏱️ + 3m 39s
4 632 tests ±0  4 189 ✅ ±0  443 💤 ±0  0 ❌ ±0 
4 634 runs  ±0  4 189 ✅ ±0  445 💤 ±0  0 ❌ ±0 

Results for commit b30e59a. ± Comparison against base commit 38aecc7.

♻️ This comment has been updated with latest results.

@thrau thrau force-pushed the storage-with-avro branch 2 times, most recently from cb430dd to b458e31 Compare August 19, 2025 23:14
@thrau thrau added the skip-docs Pull request does not require documentation changes label Aug 27, 2025
@thrau thrau force-pushed the storage-with-avro branch from b458e31 to 48c788c Compare August 27, 2025 10:12
Copy link

github-actions bot commented Aug 27, 2025

Test Results (amd64) - Acceptance

7 tests  ±0   5 ✅ ±0   3m 13s ⏱️ +6s
1 suites ±0   2 💤 ±0 
1 files   ±0   0 ❌ ±0 

Results for commit b30e59a. ± Comparison against base commit 38aecc7.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Aug 27, 2025

Test Results (amd64) - Integration, Bootstrap

    5 files      5 suites   2h 19m 56s ⏱️
5 003 tests 4 402 ✅ 601 💤 0 ❌
5 009 runs  4 402 ✅ 607 💤 0 ❌

Results for commit b30e59a.

♻️ This comment has been updated with latest results.

@thrau thrau force-pushed the storage-with-avro branch from 48c788c to b30e59a Compare August 27, 2025 15:46
@thrau thrau marked this pull request as ready for review August 28, 2025 11:15
Copy link
Member

@giograno giograno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I believe we need to account for the writer schema to be provided to the decoder (as mentioned in the comment below). The "weird" failing test I showed you yesterday was due to the fact that we were providing the reader schema as the writer one (still not 100% sure how avro could convert it, though).

@@ -97,42 +97,48 @@ def __str__(self) -> str:


class Encoder:
def encodes(self, obj: Any) -> bytes:
def encodes(self, obj: Any, py_type: type = None) -> bytes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def encodes(self, obj: Any, py_type: type = None) -> bytes:
def encodes(self, obj: Any, py_type: type | None = None) -> bytes:

small nit (applies everywhere)


def decode(self, file: IO[bytes]) -> Any:
def decode(self, file: IO[bytes], py_type: type = None) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on our discussion from yesterday, I believe we need to extend the decoder APIs and provide the writer schema.

The avro DatumReader needs both the writer and the reader schema (see here), and our initial prototype was wrongly providing the reader as the writer. If only the writer schema is provided, avro assumes the reader schema is the same.
Maybe we could also get along with a **kwargs for the initial iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good callout! i think this is not the right level of abstraction to do this. adding **kwargs just obfuscates the fact that it at this point has become a leaky abstraction.

my suggestion would be to find a different way to hide this. for example, we can make sure in the encoding step that the writer schema is placed into the same file, and we read it first, and then read the data.

Copy link
Member

@giograno giograno Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We can definitely save everything in a single file. avro supports this out of the box. I could easily implement it.

Since you mentioned leaky abstractions, couldn't we get rid of py_type as well?

Copy link
Member Author

@thrau thrau Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we derive the reader schema?

ad: i don't think i really fully understand. could you share an example?

Copy link
Member

@giograno giograno Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the encoder, we can just type(obj). For the encoder, we can read the type from the writer schema we stored in the binary data. The store class might be renamed, though, and we'd need to provide an alias.

@giograno giograno self-requested a review August 29, 2025 08:10
Copy link
Member

@giograno giograno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment was about the writer schema in the decoder signature.
As we discussed, we have been able to store it along with the serialized data in the avro implementation (see companion PR), so that's not valid anymore.

@thrau thrau merged commit ffe3e4f into main Aug 29, 2025
40 checks passed
@thrau thrau deleted the storage-with-avro branch August 29, 2025 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: patch Non-breaking changes which can be included in patch releases skip-docs Pull request does not require documentation changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants