Skip to content

Add set_pledged_input_size to ZstdCompressor #134938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emmatyping opened this issue May 30, 2025 · 7 comments
Closed

Add set_pledged_input_size to ZstdCompressor #134938

emmatyping opened this issue May 30, 2025 · 7 comments
Assignees
Labels
3.14 bugs and security fixes extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@emmatyping
Copy link
Member

emmatyping commented May 30, 2025

Feature or enhancement

Proposal:

pyzstd's ZstdCompressor class had a method _set_pledged_input_size, which allowed users to set the amount of data they were going to write into a frame so it would be written into the frame header. We should support this use case in compresison.zstd.

I don't want to add a private API that is unsafe or only for advanced users, so I want to sketch out an implementation that could be used in general and catch incorrect usage:

  1. Update ZstdCompressor's struct to include two unsigned long long members current_frame_size and pledged_size, both initialized to ZSTD_CONTENTSIZE_UNKNOWN
  2. add set_pledged_size, the main difference from the pyzstd implementation is that it will update pledged_size
  3. modify ZstdCompressor's compress() and flush() to track how much data is being written to the compressor, written into current_frame_size. If the mode is FLUSH_FRAME then after writing, check that current_frame_size == pledged_size, otherwise raise a ZstdError to indicate the failure. Reset pledged_size and current_frame_size.

I think the one drawback of the above is it will notify the user if something goes wrong but if they are streaming compressed data elsewhere they could still send garbage if they use the API wrong. But that's inherently not something we can really fix.

An open question I have is should we check current_frame_size <= pledged_size at the end of writing when the mode isn't FLUSH_FRAME? I think probably yes?

cc @Rogdham, I'd be interested in your thoughts.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/pep-784-adding-zstandard-to-the-standard-library/87377/143

Linked PRs

@emmatyping emmatyping self-assigned this May 30, 2025
@emmatyping emmatyping added type-feature A feature request or enhancement stdlib Python modules in the Lib dir extension-modules C modules in the Modules dir 3.15 new features, bugs and security fixes labels May 30, 2025
@emmatyping
Copy link
Member Author

I added only the 3.15 label for now, but @zooba requested that @hugovk make an exception to include this for 3.14. I would be okay with that if we can agree the above implementation sounds reasonable.

@dholth
Copy link
Contributor

dholth commented May 30, 2025

In conda-package-handling we are using python-zstandard's stream_writer equivalent to compression.zstd.ZstdFile with pledged input size. This mattered because (at least python-zstandard) seemed to be eagerly allocating the maximum buffer on decompression which is too big at level 22.

@picnixz picnixz removed the 3.15 new features, bugs and security fixes label May 30, 2025
@picnixz
Copy link
Member

picnixz commented May 30, 2025

I added only the 3.15

When marked with type-feature, it's implicitly targetting main

@Rogdham
Copy link
Contributor

Rogdham commented May 31, 2025

Public API

With the safeguard of raising an exception if the size is not the right one in the end, I'm +1 on including the method as public API.

Initial values

Update ZstdCompressor's struct to include two unsigned long long members current_frame_size and pledged_size, both initialized to ZSTD_CONTENTSIZE_UNKNOWN

I think you want current_frame_size initialized to 0?

Catch incorrect usage

modify ZstdCompressor's compress() and flush() to track how much data is being written to the compressor […]

I think we don't do that in pyzstd, but still get an exception (coming from libzstd):

Too much data

>>> from pyzstd import ZstdCompressor
>>> c = ZstdCompressor()
>>> c._set_pledged_input_size(1)
>>> c.compress(b'aa')
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    c.compress(b'aa')
    ~~~~~~~~~~^^^^^^^
pyzstd.ZstdError: Unable to compress zstd data: Src size is incorrect

Not enough data

>>> from pyzstd import ZstdCompressor
>>> c = ZstdCompressor()
>>> c._set_pledged_input_size(3)
>>> c.compress(b'aa')
b''
>>> c.flush()
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    c.flush()
    ~~~~~~~^^
pyzstd.ZstdError: Unable to compress zstd data: Src size is incorrect

So I'm not sure if this is needed to implement it in code ourselves (if libzstd already does it).

Not enough data

An open question I have is should we check current_frame_size <= pledged_size at the end of writing when the mode isn't FLUSH_FRAME?

Based on my last point, we will get an error from libzstd in that case.

Moreover, from the standard specification, the value is expected to be the exact size:

Frame_Content_Size: This is the original (uncompressed) size.

I am personally in favor of raising an exception both if we have too much or not enough data.

Behavior and corner cases

My take on other misc points:

  • An exception is raised when set_pledged_size is called but not at a start of a frame.
  • The set_pledged_size must be called at the start of each frame, otherwise it goes back to default value.
  • set_pledged_size(0) sets the value to 0 whereas set_pledged_size(None) sets the value to ZSTD_CONTENTSIZE_UNKNOWN
  • It should be clear that users should use options={CompressionParameter.content_size_flag: False} if they don't want the size to be stored, because calling set_pledged_size(None) is not enough in some cases (e.g. comp.compress(..., ZstdCompressor.FLUSH_FRAME) called from the start of a frame)

@hugovk
Copy link
Member

hugovk commented May 31, 2025

I added only the 3.15 label for now, but @zooba requested that @hugovk make an exception to include this for 3.14. I would be okay with that if we can agree the above implementation sounds reasonable.

This makes sense to include in 3.14, please aim to merge in time for beta 3 (2025-06-17).

@hugovk hugovk added the 3.14 bugs and security fixes label May 31, 2025
@emmatyping
Copy link
Member Author

Great! I'll work on a PR for this a bit later today.

@dholth
Copy link
Contributor

dholth commented Jun 6, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 bugs and security fixes extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants