-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
bpo-21872: fix lzma library decompresses data incompletely #14048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@serhiy-storchaka Any particular reason you asked for me specifically to review? I don't mind making an attempt, but I'm not especially familiar with the LZMA module - so I might miss something on the low-level (C code) side. |
I consider you the |
Not sure why! I haven't done anything in this area. Are you sure you aren't mixing me up with somebody else? But I'll be happy to take a look :-) |
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again. BTW, I checked compressor's code, it looks fine.
|
Thanks for making the requested changes! @vsajip: please review the changes made to this pull request. |
closing and reopening to re-trigger CI testing. |
Thanks @animalize for the PR, and @gpshead for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8. |
…14048) * 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>
GH-16054 is a backport of this pull request to the 3.8 branch. |
GH-16055 is a backport of this pull request to the 3.7 branch. |
…14048) * 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>
* 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>
* 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>
Why this bug happens?
lzma_stream
structure is used for passing input and output buffers to liblzma library:There is a possibility: when the input and output buffers are exhausted at the same time,
lzma_stream
's internal state may still hold a few bytes that are not output.When the last data chunk inputted, and this happened to happen, those bytes will never be output.
I modified Jeffrey Kintscher's code and
_lzmamodule.c
to test all attached bad files in issue21872, here is the statistic:0 missing bytes means: although input and output buffers are exhausted at the same time, but internal state doesn't hold any bytes that can be output.
Fix the bug
To fix this bug, just let these missing bytes be output.
There are two situations, need to be handled separately:
max_length < 0
, the output buffer can grow unlimitedly.max_length >= 0
, the max length of output buffer is specified.max_length
is a parameter ofLZMADecompressor.decompress(data, max_length=-1)
function.Commit 1, add test case with wrong behavior
ISSUE_21872_DAT
comes from:more_bad_lzma_files.zip
uploaded by vnummela07_21h_ticks.bi5
in the .zip fileThe file size is 2,835 bytes.
Commit 2, fix bug when max_length < 0
If
max_length < 0
, the output buffer can grow unlimitedly.Before this commit:
When input buffer exhausted, lzs's internal state may still hold a few bytes.
If no more input data, these bytes will never be output, and
d->eof
marker will never be set.Now let's first determine if the output buffer is exhausted.
If the output buffer is exhausted, we grow the output buffer and loop again, then all decompressed data can be output.
Commit 3, allow b"" as valid input data for decompress_buf()
This commit is a prerequisite for Commit 4.
This code doesn't allow empty bytes
b""
as a valid input data fordecompress_buf()
function:Above code was introduced in issue27517.
After reading issue27517's context and changeset, we can follow the similar code for function
compress()
:Above code has the same effect for issue27517's problem.
In addition it allows
b""
as a valid input data.FYI, liblzma's document about return code
LZMA_BUF_ERROR
:Long story short:
lzma_code()
, return this error code at the second call tolzma_code()
.Commit 4, when max_length >= 0, let needs_input mechanism works
This commit lets
needs_input
mechanism in functionDecompressReader.read(size=-1)
works.When
(lzs->avail_in==0 && lzs->avail_out==0)
, lzs's internal state may still have a few bytes can be output.If no more input data, these bytes will never be output, and
d->eof
marker will never be set.Set
needs_input = 0
, thenb""
will be passed as input data next time. So it has a chance to output these bytes.Some points for code review
1, Make sure
DecompressReader.read(size=-1)
function will not loop infinitely due toneeds_input
mechanism.2, Make sure
lzma.decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None)
function will not loop infinitely.3, When
max_length == 0
(almost impossible situation),LZMADecompressor.decompress(data, max_length=-1)
function has expected behavior:As expected, return
LZMA_BUF_ERROR
when the second call tolzma_code()
.https://bugs.python.org/issue21872