Skip to content

Implement PEP 784 - Adding Zstandard to the Python standard library #132983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 of 11 tasks
emmatyping opened this issue Apr 26, 2025 · 13 comments
Open
5 of 11 tasks

Implement PEP 784 - Adding Zstandard to the Python standard library #132983

emmatyping opened this issue Apr 26, 2025 · 13 comments
Assignees
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@emmatyping
Copy link
Member

emmatyping commented Apr 26, 2025

Feature or enhancement

This is a tracking issue for implementing PEP 784. See the PEP text for more details.

Since the diff is significant (~10k lines) I wanted to split up the PRs a bit.

Implementation Plan:

  • Add compression module just re-exporting existing compression modules. Move the _compression module.
  • Add _zstd native module with Unix build config
  • Add Windows build config for _zstd (blocked on adding libzstd to cpython-source-deps) and SBOM config.
  • Add zstd Python module with tests
  • add NEWS/What's New section
  • Add documentation for zstd
  • Decide on whether or not to check parameter types for options
  • Verify style conformance of C and Python code
  • Refactor _train_dict and _finalize_dict to share common code
  • Refactor critical sections to follow style of having lock_held functions
  • Improve error message when options value can't be converted to int

Linked PRs

@emmatyping emmatyping added extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 26, 2025
@emmatyping emmatyping self-assigned this Apr 26, 2025
@picnixz
Copy link
Member

picnixz commented Apr 26, 2025

It's something I should have asked on discourse, but is compression the final name of the public module? I see in the PEP that we want to mimic hashlib.algorithms_guaranteed as well and one way to mitigate the name shadowing is to actually use compressionlib as the name (we have hashlib, annotationlib, and we could have compressionlib as well; I also think compressionlib is a less likely name to be chosen but it's not the best one as well). Feel free to ignore this question as the PEP has been accepted.

@emmatyping
Copy link
Member Author

It's something I should have asked on discourse, but is compression the final name of the public module?

Yes, the decision was to use compression. A search of Github code showed that it was used very infrequently. Literally one in a million files https://discuss.python.org/t/pep-784-adding-zstandard-to-the-standard-library/87377/66

gpshead pushed a commit that referenced this issue Apr 27, 2025
…dule (GH-133018)

* Introduces `compression` package for https://peps.python.org/pep-0784/

This commit introduces the `compression` package, specified in PEP 784
to re-export the `lzma`, `bz2`, `gzip`, and `zlib` modules. Introduction
of `compression.zstd` will be completed in a future commit once the
`_zstd` module is merged.

This commit also moves the `_compression` private module to
`compression._common._streams`.

* Re-exports existing module docstrings.
freakboy3742 pushed a commit that referenced this issue Apr 28, 2025
Include compression package contents as part of installs.
vstinner added a commit to vstinner/cpython that referenced this issue Apr 28, 2025
Add an __init__ file. Fix test_tools.test_freeze.
vstinner added a commit that referenced this issue Apr 28, 2025
Add an "__init__.py" file. Fix test_tools.test_freeze.
gpshead pushed a commit that referenced this issue May 4, 2025
* Add _zstd module for https://peps.python.org/pep-0784/

This commit introduces the `_zstd` module, with bindings to libzstd from
the pyzstd project. It also includes the unix build system configuration.
Windows build system support will be integrated independently as it
depends on integration with cpython-source-deps.

* Add _zstd to modules

* Fix path for compression.zstd module

* Ignore _zstd module like _io

* Expand module state macros to improve code quality

Also removes module state references from the classes in the _zstd
module and instead uses PyType_GetModuleState()

* Remove backticks suggested in review

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>

* Use critical sections to lock object state

This should avoid races and deadlocks.

* Remove compress/decompress and mark module as not reliant on the GIL

The `compress`/`decompress` functions will be moved to Python code for simplicity.
C implementations can always be re-added in the future.

Also, mark _zstd as not requiring the GIL.

* Lift critical section to avoid clang warning

* Respond to comments by picnixz

* Call out pyzstd explicitly in license description

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>

* Use a much more robust implementation...

... for `get_zstd_state_from_type`

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

* Use PyList_GetItemRef for thread safety purposes

* Use a macro for the minimum supported version

* remove const from primivite types

* Use PyMem_New in another spot

* Simplify error handling in _get_frame_size

* Another simplification of error handling in get_frame_info

* Rename _module_state to mod_state

* Rewrite comment explaining the context of the code

* Add link to pyzstd

* Add TODO about refactoring dict training code

* Use PyModule_AddObjectRef over PyModule_AddObject

PyModule_AddObject is soft-deprecated, so we should use PyModule_AddObjectRef

* Check result of OutputBufferGrow

* Simplify return logic in `add_constant_to_type`

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

* Ignore return value of _zstd_clear()

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

* Remove redundant comments

* Remove __reduce__ from ZstdDict

We should instead document that to pickle a dictionary a user should use
the `.dict_content` attribute.

* Use PyUnicode_FromFormat instead of a buffer

* Don't use C constants/types in error messages

* Make error messages easier to understand for Python users

* Lower minimum required version 1.4.0

* Use casts and make slot function signatures correct

* Be consistent with CPython on const usage

* Make else clauses in line with PEP 7

* Fix over-indented blocks in argument clinic

* Add critical section around ZSTD_DCtx_setParameter

* Add a TODO about refactoring critical sections

* Use Py_UNREACHABLE

* Move bytes operations out of Py_BEGIN_ALLOW_THREADS

* Add TODO about ensuring a lock is held

* Remove asserts that may not be correct

* Add TODO to make ZstdDict and others GC objects

* Make objects GC tracked

* Remove unused include

* Fix some memory issues

* Fix refleaks on module and in ZstdDict

* Update configure to check for ZDICT_finalizeDictionary

* Properly check version in configure

* exit(1) if check fails

* Use AC_RUN_IFELSE

* Use a define() to re-use version check

* Actually properly set _zstd module status based on version

---------

Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@AA-Turner
Copy link
Member

I've had a go at adding support for building on Windows in #133366, let me know thoughts.

A

diegorusso added a commit to diegorusso/cpython that referenced this issue May 4, 2025
* origin/main: (111 commits)
  pythongh-91048: Add filename and line number to external inspection routines (pythonGH-133385)
  pythongh-131178: Add tests for `ast` command-line interface (python#133329)
  Regenerate pcbuild.sln in Visual Studio 2022 (python#133394)
  pythongh-133042: disable HACL* HMAC on Emscripten (python#133064)
  pythongh-133351: Fix remote PDB's multi-line block tab completion (python#133387)
  pythongh-109700: Improve stress tests for interpreter creation (pythonGH-109946)
  pythongh-81793: Skip tests for os.link() to symlink on Android (pythonGH-133388)
  pythongh-126835: Rename `ast_opt.c` to `ast_preprocess.c` and related stuff after moving const folding to the peephole optimizier (python#131830)
  pythongh-91048: Relax test_async_global_awaited_by to fix flakyness (python#133368)
  pythongh-132457: make staticmethod and classmethod generic (python#132460)
  pythongh-132805: annotationlib: Fix handling of non-constant values in FORWARDREF (python#132812)
  pythongh-132426: Add get_annotate_from_class_namespace replacing get_annotate_function (python#132490)
  pythongh-81793: Always call linkat() from os.link(), if available (pythonGH-132517)
  pythongh-122559: Synchronize C and Python implementation of the io module about pickling (pythonGH-122628)
  pythongh-69605: Add PyREPL import autocomplete feature to 'What's New' (python#133358)
  bpo-44172: Keep reference to original window in curses subwindow objects (pythonGH-26226)
  pythonGH-133231: Changes to executor management to support proposed `sys._jit` module (pythonGH-133287)
  pythongh-133363: Fix Cmd completion for lines beginning with `! ` (python#133364)
  pythongh-132983: Introduce `_zstd` bindings module (pythonGH-133027)
  pythonGH-91048: Add utils for printing the call stack for asyncio tasks (python#133284)
  ...
@ned-deily
Copy link
Member

AA-Turner added a commit that referenced this issue May 6, 2025
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Tomas R. <tomas.roun8@gmail.com>
Co-authored-by: Rogdham <contact@rogdham.net>
@AA-Turner
Copy link
Member

Windows build support, autoconf fix, and the Python wrapper now all merged. I think that should be everything required for b1, but we could use this issue to track the outstanding items from #133365.

Thanks!

A

@emmatyping
Copy link
Member Author

Windows build support, autoconf fix, and the Python wrapper now all merged. I think that should be everything required for b1, but we could use this issue to track the outstanding items from #133365.

I went ahead and added checkboxes for all of the TODOs left after review of both #133365 and #133027, as well as any TODO comments not discussed there. If I missed anything please leave a comment or edit the OP.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 9, 2025
…-133785)

(cherry picked from commit 98e2c3a)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
AA-Turner added a commit that referenced this issue May 9, 2025
…) (#133788)

GH-132983: remove empty_bytes from _zstd module state (GH-133785)
(cherry picked from commit 98e2c3a)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
@Rogdham
Copy link
Contributor

Rogdham commented May 9, 2025

re the race in compression

Could it be the following?

  • By sharing the ZstdCompressor instance between the threads, we share the compression context of type ZSTD_CCtx
  • From a comment in zstd.h:
    • “since v1.3.0, ZSTD_CStream and ZSTD_CCtx are the same thing”
    • “For parallel execution, use one separate ZSTD_CStream per thread.”
Details

From below, it seems that ZSTD_compressStream_generic line 6165 is writing data at the location of zcs->inBuff (zcs being the compression context), so if another threads does the same or set it to null, we have the issue.

ThreadSanitizer:DEADLYSIGNAL
    #0 <null> <null> (libc.so.6+0x16c5c7) (BuildId: d056ce83eebe65ce7e52ecfa5af5363e4863d283)
    #1 memcpy <null> (python+0xe8c98) (BuildId: dea998379ba126e7933ccc59c8ddb0390562739e)
    #2 ZSTD_limitCopy /usr/src/debug/zstd/zstd-1.5.7/lib/compress/../common/zstd_internal.h:252:9 (libzstd.so.1+0x26eae) (BuildId: 9723b93a8052010d25908aaa6174df6de760859a)
    #3 ZSTD_compressStream_generic /usr/src/debug/zstd/zstd-1.5.7/lib/compress/zstd_compress.c:6165:39 (libzstd.so.1+0x26eae)
    #4 ZSTD_compressStream2 /usr/src/debug/zstd/zstd-1.5.7/lib/compress/zstd_compress.c:6540:5 (libzstd.so.1+0x26eae)
    #5 compress_impl /redacted/./Modules/_zstd/compressor.c:453:20 (_zstd.cpython-315t-x86_64-linux-gnu.so+0x77e0) (BuildId: 74b337c2def9ea3fa795cafd5a7f41ce493b69f9)
    #6 _zstd_ZstdCompressor_compress_impl /redacted/./Modules/_zstd/compressor.c:591:15 (_zstd.cpython-315t-x86_64-linux-gnu.so+0x8cbf) (BuildId: 74b337c2def9ea3fa795cafd5a7f41ce493b69f9)
    #7 _zstd_ZstdCompressor_compress /redacted/./Modules/_zstd/clinic/compressor.c.h:171:20 (_zstd.cpython-315t-x86_64-linux-gnu.so+0x8cbf)
<redacted>

miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
…-133791)

(cherry picked from commit 1978904)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
AA-Turner added a commit that referenced this issue May 10, 2025
…) (#133792)

GH-132983: PEP 7 and Argument Clinic changes for zstd (GH-133791)
(cherry picked from commit 1978904)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
(cherry picked from commit 50b5370)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
AA-Turner pushed a commit that referenced this issue May 10, 2025
…133799)

gh-132983: Don't allow trailer data in ZstdFile (GH-133736)
(cherry picked from commit 50b5370)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
(cherry picked from commit 1a548c0)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
AA-Turner added a commit that referenced this issue May 10, 2025
…133854)

gh-132983: Reduce the size of ``_zstdmodule.h`` (GH-133793)
(cherry picked from commit 1a548c0)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2025
(cherry picked from commit 1a87b6e)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
AA-Turner added a commit that referenced this issue May 10, 2025
gh-132983: Make zstd types immutable (GH-133784)
(cherry picked from commit 1a87b6e)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
AA-Turner added a commit that referenced this issue May 11, 2025
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 11, 2025
…nGH-133856)

(cherry picked from commit 878e0fb)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
AA-Turner added a commit that referenced this issue May 11, 2025
…33856) (#133859)

gh-132983: Remove leftovers from EndlessZstdDecompressor (GH-133856)
(cherry picked from commit 878e0fb)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Peter Bierma <zintensitydev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

6 participants
@Rogdham @ned-deily @AA-Turner @emmatyping @picnixz and others