bpo-21872: fix lzma library decompresses data incompletely #14048

ghost · 2019-06-13T09:42:33Z

Why this bug happens?

lzma_stream structure is used for passing input and output buffers to liblzma library:

typedef struct {
	const uint8_t *next_in; /* Pointer to the next input byte. */
	size_t avail_in;    /* Number of available input bytes in next_in. */
	uint64_t total_in;  /* Total number of bytes read by liblzma. */

	uint8_t *next_out;  /* Pointer to the next output position. */
	size_t avail_out;   /* Amount of free space in next_out. */
	uint64_t total_out; /* Total number of bytes written by liblzma. */
    ...
    lzma_internal *internal; /* Internal state is not visible to applications. */
    ...
}

There is a possibility: when the input and output buffers are exhausted at the same time, lzma_stream's internal state may still hold a few bytes that are not output.
When the last data chunk inputted, and this happened to happen, those bytes will never be output.

I modified Jeffrey Kintscher's code and _lzmamodule.c to test all attached bad files in issue21872, here is the statistic:

The first column is the number of missing bytes.
The second column is number of occurrences.

 2 bytes: 96
 3 bytes: 72
11 bytes: 57
 1 bytes: 55
 6 bytes: 55
 5 bytes: 52
 0 bytes: 49  <- please note
10 bytes: 46
 4 bytes: 42
 9 bytes: 39
12 bytes: 36
 7 bytes: 32
 8 bytes: 26
14 bytes: 15
13 bytes: 14
15 bytes: 4
18 bytes: 1
27 bytes: 1

0 missing bytes means: although input and output buffers are exhausted at the same time, but internal state doesn't hold any bytes that can be output.

Fix the bug

To fix this bug, just let these missing bytes be output.

There are two situations, need to be handled separately:

when max_length < 0, the output buffer can grow unlimitedly.
when max_length >= 0, the max length of output buffer is specified.

max_length is a parameter of LZMADecompressor.decompress(data, max_length=-1) function.

Commit 1, add test case with wrong behavior

ISSUE_21872_DAT comes from:

https://bugs.python.org/issue21872
attached file more_bad_lzma_files.zip uploaded by vnummela
the smallest file 07_21h_ticks.bi5 in the .zip file

The file size is 2,835 bytes.

Commit 2, fix bug when max_length < 0

If max_length < 0, the output buffer can grow unlimitedly.

Before this commit:

    for (;;) {
        lzret = lzma_code(lzs, LZMA_RUN);
        ...
        if (lzret == LZMA_STREAM_END) {   // stream finished
            d->eof = 1;
            break;
        } else if (lzs->avail_in == 0) {  // input buffer exhausted
            break;
        } else if (lzs->avail_out == 0) { // output buffer exhausted
            // when max_length < 0, output buffer can grow unlimitedly
            if (data_size == max_length)
                break;
            grow_output_buffer()
        }
    }

When input buffer exhausted, lzs's internal state may still hold a few bytes.
If no more input data, these bytes will never be output, and d->eof marker will never be set.

Now let's first determine if the output buffer is exhausted.
If the output buffer is exhausted, we grow the output buffer and loop again, then all decompressed data can be output.

Commit 3, allow b"" as valid input data for decompress_buf()

This commit is a prerequisite for Commit 4.

This code doesn't allow empty bytes b"" as a valid input data for decompress_buf() function:

    if (lzs->avail_in == 0)
        return PyBytes_FromStringAndSize(NULL, 0);

Above code was introduced in issue27517.

After reading issue27517's context and changeset, we can follow the similar code for function compress():

    lzret = lzma_code(lzs, LZMA_RUN);
    ...
+   if (lzret == LZMA_BUF_ERROR && lzs->avail_in == 0 && lzs->avail_out > 0)
+      lzret = LZMA_OK; /* That wasn't a real error */

Above code has the same effect for issue27517's problem.
In addition it allows b"" as a valid input data.

FYI, liblzma's document about return code LZMA_BUF_ERROR:

	LZMA_BUF_ERROR          = 10,
    /*
     * brief       No progress is possible
     *
     * This error code is returned when the coder cannot consume
     * any new input and produce any new output. The most common
     * reason for this error is that the input stream being
     * decoded is truncated or corrupt.
     *
     * This error is not fatal. Coding can be continued normally
     * by providing more input and/or more output space, if
     * possible.
     *
     * Typically the first call to lzma_code() that can do no
     * progress returns LZMA_OK instead of LZMA_BUF_ERROR. Only
     * the second consecutive call doing no progress will return
     * LZMA_BUF_ERROR. This is intentional.
     *
     * With zlib, Z_BUF_ERROR may be returned even if the
     * application is doing nothing wrong, so apps will need
     * to handle Z_BUF_ERROR specially. The above hack
     * guarantees that liblzma never returns LZMA_BUF_ERROR
     * to properly written applications unless the input file
     * is truncated or corrupt. This should simplify the
     * applications a little.
     */

Long story short:

When pass zero-space input or/and output buffers to lzma_code(), return this error code at the second call to lzma_code().
This error is not fatal, we can ignore it.

Commit 4, when max_length >= 0, let needs_input mechanism works

This commit lets needs_input mechanism in function DecompressReader.read(size=-1) works.

When (lzs->avail_in==0 && lzs->avail_out==0), lzs's internal state may still have a few bytes can be output.
If no more input data, these bytes will never be output, and d->eof marker will never be set.
Set needs_input = 0, then b"" will be passed as input data next time. So it has a chance to output these bytes.

Some points for code review

1, Make sure DecompressReader.read(size=-1) function will not loop infinitely due to needs_input mechanism.

2, Make sure lzma.decompress(data, format=FORMAT_AUTO, memlimit=None, filters=None) function will not loop infinitely.

3, When max_length == 0 (almost impossible situation), LZMADecompressor.decompress(data, max_length=-1) function has expected behavior:

>>> from _lzma import LZMADecompressor
>>> d = LZMADecompressor()
>>> d.decompress(b'', 0)
b''
>>> d.decompress(b'', 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_lzma.LZMAError: Insufficient buffer space

As expected, return LZMA_BUF_ERROR when the second call to lzma_code().

https://bugs.python.org/issue21872

vsajip · 2019-06-13T15:49:25Z

@serhiy-storchaka Any particular reason you asked for me specifically to review? I don't mind making an attempt, but I'm not especially familiar with the LZMA module - so I might miss something on the low-level (C code) side.

serhiy-storchaka · 2019-06-13T16:33:29Z

I consider you the lzma expert. I am going to make the review myself, but it would be nice if you make a look too.

vsajip · 2019-06-13T17:08:08Z

I consider you the lzma expert.

Not sure why! I haven't done anything in this area. Are you sure you aren't mixing me up with somebody else?

But I'll be happy to take a look :-)

Lib/test/test_lzma.py

bedevere-bot · 2019-06-14T15:25:54Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

ghost · 2019-06-15T00:04:16Z

I have made the requested changes; please review again.

BTW, I checked compressor's code, it looks fine.
I also did some hours of tests about compress/decompress, see that the data have a round-trip.

_bz2module.c has very similar code, I'll check it and its needs_input mechanism.

bedevere-bot · 2019-06-15T00:04:19Z

Thanks for making the requested changes!

@vsajip: please review the changes made to this pull request.

gpshead · 2019-09-12T13:54:35Z

closing and reopening to re-trigger CI testing.

miss-islington · 2019-09-12T14:20:40Z

Thanks @animalize for the PR, and @gpshead for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

…14048) * 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>

bedevere-bot · 2019-09-12T14:20:52Z

GH-16054 is a backport of this pull request to the 3.8 branch.

bedevere-bot · 2019-09-12T14:21:02Z

GH-16055 is a backport of this pull request to the 3.7 branch.

…14048) * 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>

* 1. add test case with wrong behavior * 2. fix bug when max_length == -1 * 3. allow b"" as valid input data for decompress_buf() * 4. when max_length >= 0, let needs_input mechanism works * add more asserts to test case (cherry picked from commit 4ffd05d) Co-authored-by: animalize <animalize@users.noreply.github.com>

wjssz added 4 commits June 13, 2019 14:29

1. add test case with wrong behavior

1dbbfcf

2. fix bug when max_length == -1

ad5cdd1

3. allow b"" as valid input data for decompress_buf()

454b278

4. when max_length >= 0, let needs_input mechanism works

594d525

the-knights-who-say-ni added the CLA signed label Jun 13, 2019

bedevere-bot added the awaiting review label Jun 13, 2019

serhiy-storchaka requested review from serhiy-storchaka and vsajip June 13, 2019 10:01

serhiy-storchaka added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels Jun 13, 2019

vsajip requested changes Jun 14, 2019

View reviewed changes

Lib/test/test_lzma.py Show resolved Hide resolved

Lib/test/test_lzma.py Show resolved Hide resolved

Lib/test/test_lzma.py Outdated Show resolved Hide resolved

bedevere-bot removed the awaiting review label Jun 14, 2019

bedevere-bot added the awaiting changes label Jun 14, 2019

add more asserts to test case

05951a6

bedevere-bot added awaiting change review and removed awaiting changes labels Jun 15, 2019

ghost mentioned this pull request Jun 15, 2019

Fix decompress data incompletely peterjc/backports.lzma#40

Merged

vsajip approved these changes Jun 15, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Jun 15, 2019

update NEWS entry

62ba236

gpshead closed this Sep 12, 2019

gpshead reopened this Sep 12, 2019

gpshead approved these changes Sep 12, 2019

View reviewed changes

gpshead self-assigned this Sep 12, 2019

gpshead merged commit 4ffd05d into python:master Sep 12, 2019

bedevere-bot removed the awaiting merge label Sep 12, 2019

bedevere-bot removed needs backport to 3.8 labels Sep 12, 2019

peterjc mentioned this pull request Sep 12, 2019

lzma sometimes fails to decompress a file peterjc/backports.lzma#6

Closed

ghost deleted the fix_issue21872 branch September 12, 2019 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-21872: fix lzma library decompresses data incompletely #14048

bpo-21872: fix lzma library decompresses data incompletely #14048

ghost commented Jun 13, 2019 •

edited by bedevere-bot

Loading

vsajip commented Jun 13, 2019

serhiy-storchaka commented Jun 13, 2019

vsajip commented Jun 13, 2019

bedevere-bot commented Jun 14, 2019

ghost commented Jun 15, 2019

bedevere-bot commented Jun 15, 2019

gpshead commented Sep 12, 2019

miss-islington commented Sep 12, 2019

bedevere-bot commented Sep 12, 2019

bedevere-bot commented Sep 12, 2019

bpo-21872: fix lzma library decompresses data incompletely #14048

bpo-21872: fix lzma library decompresses data incompletely #14048

Conversation

ghost commented Jun 13, 2019 • edited by bedevere-bot Loading

Why this bug happens?

Fix the bug

Commit 1, add test case with wrong behavior

Commit 2, fix bug when max_length < 0

Commit 3, allow b"" as valid input data for decompress_buf()

Commit 4, when max_length >= 0, let needs_input mechanism works

Some points for code review

vsajip commented Jun 13, 2019

serhiy-storchaka commented Jun 13, 2019

vsajip commented Jun 13, 2019

bedevere-bot commented Jun 14, 2019

ghost commented Jun 15, 2019

bedevere-bot commented Jun 15, 2019

gpshead commented Sep 12, 2019

miss-islington commented Sep 12, 2019

bedevere-bot commented Sep 12, 2019

bedevere-bot commented Sep 12, 2019

ghost commented Jun 13, 2019 •

edited by bedevere-bot

Loading