Skip to content

gh-132983: Don't allow trailer data in ZstdFile #133736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 10, 2025

Conversation

Rogdham
Copy link
Contributor

@Rogdham Rogdham commented May 9, 2025

We previously made sure that an exception is raised when decompressing trailer data with decompress:

>>> from compression.zstd import compress, decompress
>>> invalid = compress(b'xxx') + b'yyy'
>>> decompress(invalid)
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    decompress(invalid)
    ~~~~~~~~~~^^^^^^^^^
  File "/redacted/Lib/compression/zstd/__init__.py", line 157, in decompress
    results.append(decomp.decompress(data))
                   ~~~~~~~~~~~~~~~~~^^^^^^
_zstd.ZstdError: Unable to decompress zstd data: Unknown frame descriptor

Indeed, the Zstandard specification says “Zstandard compressed data is made of one or more frames”, and it does not say that random data can be added at the end.

However, this is not the case in ZstdFile / zstd.open:

>>> from compression.zstd import ZstdFile
>>> from io import BytesIO
>>> ZstdFile(BytesIO(invalid)).read()
b'xxx'

After this PR, the last call becomes:

>>> ZstdFile(BytesIO(invalid)).read()
Traceback (most recent call last):
  File "<python-input-5>", line 1, in <module>
    ZstdFile(BytesIO(invalid)).read()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/redacted/Lib/compression/zstd/_zstdfile.py", line 176, in read
    return self._buffer.read(size)
           ~~~~~~~~~~~~~~~~~^^^^^^
  File "/redacted/Lib/compression/_common/_streams.py", line 118, in readall
    while data := self.read(sys.maxsize):
                  ~~~~~~~~~^^^^^^^^^^^^^
  File "/redacted/Lib/compression/_common/_streams.py", line 91, in read
    data = self._decompressor.decompress(rawblock, size)
_zstd.ZstdError: Unable to decompress zstd data: Unknown frame descriptor

@Rogdham Rogdham marked this pull request as ready for review May 9, 2025 09:36
@AA-Turner AA-Turner added the needs backport to 3.14 bugs and security fixes label May 9, 2025
@emmatyping
Copy link
Member

The current behavior matches LZMA. I think unlike decompress which is handed what is necessarily a zstd stream of one or more frames, with ZstdFile, a user may be parsing a format which has additional information after a zstd stream.

>>> from lzma import LZMAFile, compress
>>> from io import BytesIO
>>> invalid = compress(b'foo') + b'bar'
>>> LZMAFile(BytesIO(invalid)).read()
b'foo'
>>>

@Rogdham
Copy link
Contributor Author

Rogdham commented May 9, 2025

You are right this is the case for LZMAFile with format FORMAT_AUTO (which is the default) and also for BZ2File.

However, LZMAFile with format FORMAT_XZ as well as GzipFile raise an exception in that case.

>>> from lzma import LZMAFile, compress, FORMAT_XZ
>>> from io import BytesIO
>>> invalid = compress(b'foo') + b'bar'
>>> LZMAFile(BytesIO(invalid), format=FORMAT_XZ).read()
Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    LZMAFile(BytesIO(invalid), format=FORMAT_XZ).read()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/redacted/lzma.py", line 208, in read
    return self._buffer.read(size)
           ~~~~~~~~~~~~~~~~~^^^^^^
  File "/redacted/_compression.py", line 118, in readall
    while data := self.read(sys.maxsize):
                  ~~~~~~~~~^^^^^^^^^^^^^
  File "/redacted/_compression.py", line 99, in read
    raise EOFError("Compressed file ended before the "
                   "end-of-stream marker was reached")
EOFError: Compressed file ended before the end-of-stream marker was reached

Copy link
Member

@emmatyping emmatyping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this looks good then!

@Rogdham
Copy link
Contributor Author

Rogdham commented May 9, 2025

In addition, consider decompress(compress(b"xxx") + b"yyy"):

  • returns b"xxx" on: lzma (format FORMAT_AUTO), bz2
  • raises an exception on: lzma (format FORMAT_XZ), gzip

Since for zstd we raise an exception on that, I would say to do the same for ZstdFile to be consistent.

@AA-Turner AA-Turner merged commit 50b5370 into python:main May 10, 2025
48 checks passed
@miss-islington-app
Copy link

Thanks @Rogdham for the PR, and @AA-Turner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 10, 2025
(cherry picked from commit 50b5370)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented May 10, 2025

GH-133799 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label May 10, 2025
AA-Turner pushed a commit that referenced this pull request May 10, 2025
…133799)

gh-132983: Don't allow trailer data in ZstdFile (GH-133736)
(cherry picked from commit 50b5370)

Co-authored-by: Rogdham <3994389+Rogdham@users.noreply.github.com>
@Rogdham Rogdham deleted the zstdfile-trailer-exception branch May 10, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants