Skip to content

bpo-21872: fix lzma library decompresses data incompletely #14048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from Sep 12, 2019
Merged

bpo-21872: fix lzma library decompresses data incompletely #14048

merged 6 commits into from Sep 12, 2019

Conversation

ghost
Copy link

@ghost ghost commented Jun 13, 2019

Why this bug happens?

lzma_stream structure is used for passing input and output buffers to liblzma library:

typedef struct {
	const uint8_t *next_in; /* Pointer to the next input byte. */
	size_t avail_in;    /* Number of available input bytes in next_in. */
	uint64_t total_in;  /* Total number of bytes read by liblzma. */

	uint8_t *next_out;  /* Pointer to the next output position. */
	size_t avail_out;   /* Amount of free space in next_out. */
	uint64_t total_out; /* Total number of bytes written by liblzma. */
    ...
    lzma_internal *internal; /* Internal state is not visible to applications. */
    ...
}

There is a possibility: when the input and output buffers are exhausted at the same time, lzma_stream's internal state may still hold a few bytes that are not output.
When the last data chunk inputted, and this happened to happen, those bytes will never be output.

I modified Jeffrey Kintscher's code and _lzmamodule.c to test all attached bad files in issue21872, here is the statistic:

The first column is the number of missing bytes.
The second column is number of occurrences.

 2 bytes: 96
 3 bytes: 72
11 bytes: 57
 1 bytes: 55
 6 bytes: 55
 5 bytes: 52
 0 bytes: 49  <- please note
10 bytes: 46
 4 bytes: 42
 9 bytes: 39
12 bytes: 36
 7 bytes: 32
 8 bytes: 26
14 bytes: 15
13 bytes: 14
15 bytes: 4
18 bytes: 1
27 bytes: 1

0 missing bytes means: although input and output buffers are exhausted at the same time, but internal state doesn't hold any bytes that can be output.

Fix the bug

To fix this bug, just let these missing bytes be output.

There are two situations, need to be handled separately:

  • when max_length < 0, the output buffer can grow unlimitedly.
  • when max_length >= 0, the max length of output buffer is specified.

max_length is a parameter of LZMADecompressor.decompress(data, max_length=-1) function.

Commit 1, add test case with wrong behavior

ISSUE_21872_DAT comes from:

The file size is 2,835 bytes.

Commit 2, fix bug when max_length < 0

If max_length < 0, the output buffer can grow unlimitedly.

Before this commit:

    for (;;) {
        lzret = lzma_code(lzs, LZMA_RUN);
        ...
        if (lzret == LZMA_STREAM_END) {   // stream finished
            d->eof = 1;
            break;
        } else if (lzs->avail_in == 0) {  // input buffer exhausted
            break;
        } else if (lzs->avail_out == 0) { // output buffer exhausted
            // when max_length < 0, output buffer can grow unlimitedly
            if (data_size == max_length)
                break;
            grow_output_buffer()
        }
    }

When input buffer exhausted, lzs's internal state may still hold a few bytes.
If no more input data, these bytes will never be output, and d->eof marker will never be set.

Now let's first determine if the output buffer is exhausted.
If the output buffer is exhausted, we grow the output buffer and loop again, then all decompressed data can be output.

Commit 3, allow b"" as valid input data for decompress_buf()

This commit is a prerequisite for Commit 4.

This code doesn't allow empty bytes b"" as a valid input data for decompress_buf() function:

    if (lzs->avail_in == 0)
        return PyBytes_FromStringAndSize(NULL, 0);

Above code was introduced in issue27517.

After reading issue27517's context and changeset, we can follow the similar code for function compress():

    lzret = lzma_code(lzs, LZMA_RUN);
    ...
+   if (lzret == LZMA_BUF_ERROR && lzs->avail_in == 0 && lzs->avail_out > 0)
+      lzret = LZMA_OK; /* That wasn't a real error */

Above code has the same effect for issue27517's problem.
In addition it allows b"" as a valid input data.

FYI, liblzma's document about return code LZMA_BUF_ERROR:

	LZMA_BUF_ERROR          = 10,
    /*
     * brief       No progress is possible
     *
     * This error code is returned when the coder cannot consume
     * any new input and produce any new output. The most common
     * reason for this error is that the input stream being
     * decoded is truncated or corrupt.
     *
     * This error is not fatal. Coding can be continued normally
     * by providing more input and/or more output space, if
     * possible.
     *
     * Typically the first call to lzma_code() that can do no
     * progress returns LZMA_OK instead of LZMA_BUF_ERROR. Only
     * the second consecutive call doing no progress will return
     * LZMA_BUF_ERROR. This is intentional.
     *
     * With zlib, Z_BUF_ERROR may be returned even if the
     * application is doing nothing wrong, so apps will need
     * to handle Z_BUF_ERROR specially. The above hack
     * guarantees that liblzma never returns LZMA_BUF_ERROR
     * to properly written applications unless the input file
     * is truncated or corrupt. This should simplify the
     * applications a little.
     */

Long story short:

  • When pass zero-space input or/and output buffers to lzma_code(), return this error code at the second call to lzma_code().
  • This error is not fatal, we can ignore it.

Commit 4, when max_length >= 0, let needs_input mechanism works

This commit lets needs_input mechanism in function DecompressReader.read(size=-1) works.

When (lzs->avail_in==0 && lzs->avail_out==0), lzs's internal state may still have a few bytes can be output.
If no more input data, these bytes will never be output, and d->eof marker will never be set.
Set needs_input = 0, then b"" will be passed as input data next time. So it has a chance to output these bytes.

Some points for code review

1, Make sure DecompressReader.read(size=-1) function will not loop infinitely due to needs_input mechanism.

2, Make sure lzma.decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None) function will not loop infinitely.

3, When max_length == 0 (almost impossible situation), LZMADecompressor.decompress(data, max_length=-1) function has expected behavior:

>>> from _lzma import LZMADecompressor
>>> d = LZMADecompressor()
>>> d.decompress(b'', 0)
b''
>>> d.decompress(b'', 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_lzma.LZMAError: Insufficient buffer space

As expected, return LZMA_BUF_ERROR when the second call to lzma_code().

https://bugs.python.org/issue21872

@vsajip
Copy link
Member

vsajip commented Jun 13, 2019

@serhiy-storchaka Any particular reason you asked for me specifically to review? I don't mind making an attempt, but I'm not especially familiar with the LZMA module - so I might miss something on the low-level (C code) side.

@serhiy-storchaka
Copy link
Member

I consider you the lzma expert. I am going to make the review myself, but it would be nice if you make a look too.

@vsajip
Copy link
Member

vsajip commented Jun 13, 2019

I consider you the lzma expert.

Not sure why! I haven't done anything in this area. Are you sure you aren't mixing me up with somebody else?

But I'll be happy to take a look :-)

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@ghost
Copy link
Author

ghost commented Jun 15, 2019

I have made the requested changes; please review again.

BTW, I checked compressor's code, it looks fine.
I also did some hours of tests about compress/decompress, see that the data have a round-trip.

_bz2module.c has very similar code, I'll check it and its needs_input mechanism.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@vsajip: please review the changes made to this pull request.

@gpshead
Copy link
Member

gpshead commented Sep 12, 2019

closing and reopening to re-trigger CI testing.

@gpshead gpshead closed this Sep 12, 2019
@gpshead gpshead reopened this Sep 12, 2019
@gpshead gpshead self-assigned this Sep 12, 2019
@gpshead gpshead merged commit 4ffd05d into python:master Sep 12, 2019
@miss-islington
Copy link
Contributor

Thanks @animalize for the PR, and @gpshead for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 12, 2019
…14048)

* 1. add test case with wrong behavior
* 2. fix bug when max_length == -1
* 3. allow b"" as valid input data for decompress_buf()
* 4. when max_length >= 0, let needs_input mechanism works
* add more asserts to test case
(cherry picked from commit 4ffd05d)

Co-authored-by: animalize <animalize@users.noreply.github.com>
@bedevere-bot
Copy link

GH-16054 is a backport of this pull request to the 3.8 branch.

@bedevere-bot
Copy link

GH-16055 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 12, 2019
…14048)

* 1. add test case with wrong behavior
* 2. fix bug when max_length == -1
* 3. allow b"" as valid input data for decompress_buf()
* 4. when max_length >= 0, let needs_input mechanism works
* add more asserts to test case
(cherry picked from commit 4ffd05d)

Co-authored-by: animalize <animalize@users.noreply.github.com>
miss-islington added a commit that referenced this pull request Sep 12, 2019
* 1. add test case with wrong behavior
* 2. fix bug when max_length == -1
* 3. allow b"" as valid input data for decompress_buf()
* 4. when max_length >= 0, let needs_input mechanism works
* add more asserts to test case
(cherry picked from commit 4ffd05d)

Co-authored-by: animalize <animalize@users.noreply.github.com>
miss-islington added a commit that referenced this pull request Sep 12, 2019
* 1. add test case with wrong behavior
* 2. fix bug when max_length == -1
* 3. allow b"" as valid input data for decompress_buf()
* 4. when max_length >= 0, let needs_input mechanism works
* add more asserts to test case
(cherry picked from commit 4ffd05d)

Co-authored-by: animalize <animalize@users.noreply.github.com>
@ghost ghost deleted the fix_issue21872 branch September 12, 2019 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants