Add set_pledged_input_size to ZstdCompressor #134938

emmatyping · 2025-05-30T15:40:40Z

Feature or enhancement

Proposal:

pyzstd's ZstdCompressor class had a method _set_pledged_input_size, which allowed users to set the amount of data they were going to write into a frame so it would be written into the frame header. We should support this use case in compresison.zstd.

I don't want to add a private API that is unsafe or only for advanced users, so I want to sketch out an implementation that could be used in general and catch incorrect usage:

Update ZstdCompressor's struct to include two unsigned long long members current_frame_size and pledged_size, both initialized to ZSTD_CONTENTSIZE_UNKNOWN
add set_pledged_size, the main difference from the pyzstd implementation is that it will update pledged_size
modify ZstdCompressor's compress() and flush() to track how much data is being written to the compressor, written into current_frame_size. If the mode is FLUSH_FRAME then after writing, check that current_frame_size == pledged_size, otherwise raise a ZstdError to indicate the failure. Reset pledged_size and current_frame_size.

I think the one drawback of the above is it will notify the user if something goes wrong but if they are streaming compressed data elsewhere they could still send garbage if they use the API wrong. But that's inherently not something we can really fix.

An open question I have is should we check current_frame_size <= pledged_size at the end of writing when the mode isn't FLUSH_FRAME? I think probably yes?

cc @Rogdham, I'd be interested in your thoughts.

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/pep-784-adding-zstandard-to-the-standard-library/87377/143

Linked PRs

The text was updated successfully, but these errors were encountered:

emmatyping · 2025-05-30T15:44:28Z

I added only the 3.15 label for now, but @zooba requested that @hugovk make an exception to include this for 3.14. I would be okay with that if we can agree the above implementation sounds reasonable.

dholth · 2025-05-30T20:37:09Z

In conda-package-handling we are using python-zstandard's stream_writer equivalent to compression.zstd.ZstdFile with pledged input size. This mattered because (at least python-zstandard) seemed to be eagerly allocating the maximum buffer on decompression which is too big at level 22.

picnixz · 2025-05-30T21:03:17Z

I added only the 3.15

When marked with type-feature, it's implicitly targetting main

Rogdham · 2025-05-31T08:06:26Z

Public API

With the safeguard of raising an exception if the size is not the right one in the end, I'm +1 on including the method as public API.

Initial values

Update ZstdCompressor's struct to include two unsigned long long members current_frame_size and pledged_size, both initialized to ZSTD_CONTENTSIZE_UNKNOWN

I think you want current_frame_size initialized to 0?

Catch incorrect usage

modify ZstdCompressor's compress() and flush() to track how much data is being written to the compressor […]

I think we don't do that in pyzstd, but still get an exception (coming from libzstd):

Too much data

>>> from pyzstd import ZstdCompressor
>>> c = ZstdCompressor()
>>> c._set_pledged_input_size(1)
>>> c.compress(b'aa')
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    c.compress(b'aa')
    ~~~~~~~~~~^^^^^^^
pyzstd.ZstdError: Unable to compress zstd data: Src size is incorrect

Not enough data

>>> from pyzstd import ZstdCompressor
>>> c = ZstdCompressor()
>>> c._set_pledged_input_size(3)
>>> c.compress(b'aa')
b''
>>> c.flush()
Traceback (most recent call last):
  File "<python-input-4>", line 1, in <module>
    c.flush()
    ~~~~~~~^^
pyzstd.ZstdError: Unable to compress zstd data: Src size is incorrect

So I'm not sure if this is needed to implement it in code ourselves (if libzstd already does it).

Not enough data

An open question I have is should we check current_frame_size <= pledged_size at the end of writing when the mode isn't FLUSH_FRAME?

Based on my last point, we will get an error from libzstd in that case.

Moreover, from the standard specification, the value is expected to be the exact size:

Frame_Content_Size: This is the original (uncompressed) size.

I am personally in favor of raising an exception both if we have too much or not enough data.

Behavior and corner cases

My take on other misc points:

An exception is raised when set_pledged_size is called but not at a start of a frame.
The set_pledged_size must be called at the start of each frame, otherwise it goes back to default value.
set_pledged_size(0) sets the value to 0 whereas set_pledged_size(None) sets the value to ZSTD_CONTENTSIZE_UNKNOWN
It should be clear that users should use options={CompressionParameter.content_size_flag: False} if they don't want the size to be stored, because calling set_pledged_size(None) is not enough in some cases (e.g. comp.compress(..., ZstdCompressor.FLUSH_FRAME) called from the start of a frame)

hugovk · 2025-05-31T15:40:20Z

I added only the 3.15 label for now, but @zooba requested that @hugovk make an exception to include this for 3.14. I would be okay with that if we can agree the above implementation sounds reasonable.

This makes sense to include in 3.14, please aim to merge in time for beta 3 (2025-06-17).

emmatyping · 2025-05-31T15:52:37Z

Great! I'll work on a PR for this a bit later today.

…onGH-135010) (cherry picked from commit 4b44b34) Co-authored-by: Emma Smith <emma@emmatyping.dev>

…135010) (GH-135173) (cherry picked from commit 4b44b34) Co-authored-by: Emma Smith <emma@emmatyping.dev>

dholth · 2025-06-06T19:43:43Z

Thanks!

emmatyping self-assigned this May 30, 2025

emmatyping added type-feature A feature request or enhancement stdlib Python modules in the Lib dir extension-modules C modules in the Modules dir 3.15 new features, bugs and security fixes labels May 30, 2025

picnixz removed the 3.15 new features, bugs and security fixes label May 30, 2025

hugovk added the 3.14 bugs and security fixes label May 31, 2025

bedevere-app bot mentioned this issue Jun 1, 2025

gh-134938: Add set_pledged_input_size to ZstdCompressor #135010

Merged

serhiy-storchaka pushed a commit that referenced this issue Jun 5, 2025

gh-134938: Add set_pledged_input_size() to ZstdCompressor (GH-135010)

4b44b34

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jun 5, 2025

pythongh-134938: Add set_pledged_input_size() to ZstdCompressor (pyth…

ce208e5

…onGH-135010) (cherry picked from commit 4b44b34) Co-authored-by: Emma Smith <emma@emmatyping.dev>

bedevere-app bot mentioned this issue Jun 5, 2025

[3.14] gh-134938: Add set_pledged_input_size() to ZstdCompressor (GH-135010) #135173

Merged

serhiy-storchaka pushed a commit that referenced this issue Jun 5, 2025

[3.14] gh-134938: Add set_pledged_input_size() to ZstdCompressor (GH-…

5b39741

…135010) (GH-135173) (cherry picked from commit 4b44b34) Co-authored-by: Emma Smith <emma@emmatyping.dev>

hugovk closed this as completed Jun 6, 2025

github-project-automation bot moved this to Done in lavitaconnect@MOSTAFAAMMER Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add set_pledged_input_size to ZstdCompressor #134938

Add set_pledged_input_size to ZstdCompressor #134938

emmatyping commented May 30, 2025 •

edited by bedevere-app bot

Loading

emmatyping commented May 30, 2025

Uh oh!

dholth commented May 30, 2025

Uh oh!

picnixz commented May 30, 2025

Uh oh!

Rogdham commented May 31, 2025

Uh oh!

hugovk commented May 31, 2025

Uh oh!

emmatyping commented May 31, 2025

Uh oh!

dholth commented Jun 6, 2025

Uh oh!

Uh oh!

Add set_pledged_input_size to ZstdCompressor #134938

Add set_pledged_input_size to ZstdCompressor #134938

Comments

emmatyping commented May 30, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature or enhancement

Proposal:

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

emmatyping commented May 30, 2025

Uh oh!

dholth commented May 30, 2025

Uh oh!

picnixz commented May 30, 2025

Uh oh!

Rogdham commented May 31, 2025

Public API

Initial values

Catch incorrect usage

Not enough data

Behavior and corner cases

Uh oh!

hugovk commented May 31, 2025

Uh oh!

emmatyping commented May 31, 2025

Uh oh!

dholth commented Jun 6, 2025

Uh oh!

emmatyping commented May 30, 2025 •

edited by bedevere-app bot

Loading