gh-115952: Fix potential virtual memory allocation denial of service in the pickle module #119204

serhiy-storchaka · 2024-05-20T09:58:07Z

Loading a small data which does not even involve arbitrary code execution could consume arbitrary large amount of memory. There were three issues:

PUT and LONG_BINPUT with large argument (the C implementation only). Since the memo is implemented in C as a continuous dynamic array, a single opcode can cause its resizing to arbitrary size. ~~Now the sparsity of memo indices is limited.~~ Now a dict is used for too sparse memo indices.
BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument. They allocated the bytes or bytearray object of the specified size before reading into it. Now they read very large data by chunks.
BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large argument. They read the whole data by calling the read() method of the underlying file object, which usually allocates the bytes object of the specified size before reading into it. Now they read very large data by chunks.

Issue: pickle.load() out-of-memory with small file #115952

Loading a small data which does not even involve arbitrary code execution could consume arbitrary large amount of memory. There were three issues: * PUT and LONG_BINPUT with large argument (the C implementation only). Since the memo is implemented in C as a continuous dynamic array, a single opcode can cause its resizing to arbitrary size. Now the sparsity of memo indices is limited. * BINBYTES, BINBYTES8 and BYTEARRAY8 with large argument. They allocated the bytes or bytearray object of the specified size before reading into it. Now they read very large data by chunks. * BINSTRING, BINUNICODE, LONG4, BINUNICODE8 and FRAME with large argument. They read the whole data by calling the read() method of the underlying file object, which usually allocates the bytes object of the specified size before reading into it. Now they read very large data by chunks.

Lib/pickle.py

gpshead · 2024-05-24T19:59:22Z

I've marked this Draft for now as discussion on this on the security response team list is not complete. (we'll summarize that in a public issue once it has settled)

…te' into unpickle-overallocate

eli-schwartz · 2024-06-30T03:39:09Z

I see this has been taken out of draft. Is the security response summary available yet?

gpshead · 2024-05-22T15:58:26Z

Lib/pickle.py

@@ -304,11 +309,23 @@ def readline(self):
        else:
            return self.file_readline()

+    def safe_file_read(self, size):


Don't use an ill-defined word such as "safe" in the name. Just describe what it does. chunked_file_read() makes more sense. Same with SAFE_BUF_SIZE just call it _READ_BUF_SIZE (with an underscore prefix so it isn't considered a public API).

Note that there is _safe_read in http.client, although ironically it is vulnerable to this kind of issue.

I afraid that name like chunked_file_read could be misleading, as the data is not read by fixed size chunks. I guess this is what leads you to misunderstanding about the O(N^2) behavior.

Lib/pickle.py

Modules/_pickle.c

gpshead · 2024-06-30T06:05:52Z

Modules/_pickle.c

 {
    PyObject *old_item;

    if (idx >= self->memo_size) {
+        if (idx > self->memo_len * 2 + (1 << 17)) {


where does (1 << 17) come from? that feels like it should be a constant or at least deserves a command as to what the significance of this value is and why this limit exists.

It is just an arbitrary constant to make the existing tests pass. It should be larger than 2**16, otherwise we will need to make input data for two tests much larger. I do not think that the real code needs more than 0 or 1. I have no good name for this. The Python code does not need this, as it does not have such limitation and is not vulnerable to this issue (but it is less efficient).

This is the part of this PR that I have the greatest doubts. It can make the unpickler to reject an unusual data that is accepted now, although such data never produced by the pickler (it can still be unpickled with the Python implementation).

I can make the C code more similar to the Python code (as a fallback). But this will add a lot of complexity, and I do not know whether there is a use case for this at all. It may be a hypothetical problem that never occurs in real world, except some artificial tests.

Modules/_pickle.c

gpshead · 2024-06-30T06:09:14Z

Modules/_pickle.c

@@ -153,6 +153,9 @@ enum {

    /* Prefetch size when unpickling (disabled on unpeekable streams) */
    PREFETCH = 8192 * 16,
+    /* Data larger that this will be read in chunks, to prevent extreme
+       overallocation. */
+    SAFE_BUF_SIZE = 1 << 20,


let's not use "SAFE" as name. match whatever new name was similarly chosen in the Python code.

gpshead · 2024-06-30T06:14:49Z

Misc/NEWS.d/next/Library/2024-05-20-12-35-52.gh-issue-115952.J6n_Kf.rst

@@ -0,0 +1,3 @@
+Fix a vulnerability in the :mod:`pickle` module which allowed to consume


suggested new wording:

:mod:`pickle` now avoids a potential allocated virtual memory address space denial of service condition when unpickling specially crafted data.

Pickle isn't for use on untrusted data and no privileges could be escalated, so calling this a vulnerability is a stretch. It also wasn't a bad one, it didn't consume memory, it just mapped virtual address space.

I consumes memory. The mapped virtual address space is filled with data (at least in debug build).

Pickle isn't safe for untrusted data by default, but it can be made safe (by strictly restricting what data it can handle) after resolving this issue.

Lib/test/pickletester.py

bedevere-app · 2024-06-30T06:26:22Z

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

gpshead · 2024-06-30T06:37:03Z

I see this has been taken out of draft. Is the security response summary available yet?

I believe that was Serhiy indicating that more review and a resolution would be nice. I can't disagree with that, we just haven't had time.

Security discussions were left hanging and need resuming (no urgency - it's pickle - which is not intended for untrusted data). There is no reason to keep them private, this isn't what I'd really call a vulnerability.

Data that'd help decide if we should just do something like this anyways: Performance testing after the PR comments to fix O(N^2) issues are addressed.

This PR makes the unpickle code more complicated (potentially slower) and causes pickle to consume twice as much memory temporarily when unpickling large data due to the extra chunked buffering copy while also making many more read calls to the underlying file while doing that, where it only made one in the past.

I'll put a "do not submit" label on the PR until that's been investigated. (it's more of a "I'm not convinced we actually want this yet, even once the implementation is in good shape")

serhiy-storchaka · 2024-06-30T09:37:14Z

I thought I left it in a draft state because I was planning to do something else. But after re-checking, I found nothing unfinished.

Only later did I notice that it was Gregory who put it in a draft state, but it was already late at night to revive the discussion.

serhiy-storchaka

I afraid, @gpshead, that you misunderstood the problem and the solution.

The current code not just allocates the address space. It fills it with data that takes time and causes swapping (at least in debug build). Just run the tests added in this PR with unmodified code.
The complexity of the new read is O(N log(N)), not O(N^2). Note that the more data is already read, the larger next read chunk will be. And the initial chunk is pretty large (1 MiB), so most of user code will be not affected at all. The algorithm is triggered only when you read very large strings. For 1 GiB it will only add 10 reads, for 1 TiB -- 20 reads. If this looks as too high cost, we can increase the initial size or the multiplier (but the code will be less effective in treating intentional attacks).

serhiy-storchaka · 2024-06-30T09:50:34Z

Lib/pickle.py

@@ -304,11 +309,23 @@ def readline(self):
        else:
            return self.file_readline()

+    def safe_file_read(self, size):


Note that there is _safe_read in http.client, although ironically it is vulnerable to this kind of issue.

I afraid that name like chunked_file_read could be misleading, as the data is not read by fixed size chunks. I guess this is what leads you to misunderstanding about the O(N^2) behavior.

Lib/pickle.py

serhiy-storchaka · 2024-06-30T10:06:46Z

Misc/NEWS.d/next/Library/2024-05-20-12-35-52.gh-issue-115952.J6n_Kf.rst

@@ -0,0 +1,3 @@
+Fix a vulnerability in the :mod:`pickle` module which allowed to consume


I consumes memory. The mapped virtual address space is filled with data (at least in debug build).

Pickle isn't safe for untrusted data by default, but it can be made safe (by strictly restricting what data it can handle) after resolving this issue.

Modules/_pickle.c

serhiy-storchaka · 2024-06-30T10:23:50Z

Modules/_pickle.c

 {
    PyObject *old_item;

    if (idx >= self->memo_size) {
+        if (idx > self->memo_len * 2 + (1 << 17)) {


It is just an arbitrary constant to make the existing tests pass. It should be larger than 2**16, otherwise we will need to make input data for two tests much larger. I do not think that the real code needs more than 0 or 1. I have no good name for this. The Python code does not need this, as it does not have such limitation and is not vulnerable to this issue (but it is less efficient).

This is the part of this PR that I have the greatest doubts. It can make the unpickler to reject an unusual data that is accepted now, although such data never produced by the pickler (it can still be unpickled with the Python implementation).

I can make the C code more similar to the Python code (as a fallback). But this will add a lot of complexity, and I do not know whether there is a use case for this at all. It may be a hypothetical problem that never occurs in real world, except some artificial tests.

gpshead · 2024-08-31T22:21:06Z

Note that this is still on my radar and I do intend to get back to looking at this.

serhiy-storchaka · 2024-09-06T08:05:36Z

I have removed the restriction on the memo keys. Now a dict is used for too sparse keys.

gpshead

Thanks for your patience on this, sorry for blocking it for so long.

Overall I do think this will be a good change. I do wonder if it'll have negative performance impacts, but as you correctly pointed out I was reading the code a bit wrong so it's NlogN in the huge size case vs today's N. There may be more (up to double-ish) ram consumption temporarily for actual huge data and an additional copy in that case - I don't know if that practically matters or not.

But for that reason and, separately, the addition of the memo_dict field I think this should just stay in main and not be backported as a bugfix.

gpshead · 2024-09-27T21:35:02Z

Lib/test/pickletester.py

+            size <<= 1
+        yield stop
+
+    def test_too_large_put(self):


Can you add a comment explaining why this and the next test method result in ([], []) being returned no matter what rather than an error when the values are too large? (I suspect readers with a knowledge of the specific pickle protocol may understand, but it isn't obvious otherwise)

gpshead · 2024-09-27T21:56:22Z

Modules/_pickle.c

+        }
+    }
+    if (self->memo_dict != NULL) {
+        PyObject *key = PyLong_FromSsize_t(idx);


size_t being fed into FromSsize_t could be a problem if size_t idx were large enough to overflow that. is that plausible in this code?

Good catch. Actually, the argument of _Unpickler_MemoGet() is always a Py_size_t casted to size_t, but it is better to always treat it as size_t here.

gpshead · 2024-09-27T22:25:26Z

Modules/_pickle.c

+        if (s == (size_t)-1) {
+            return -1;
+        }
+        res += s;


to be pedantic, check for the possibility of overflow first. (practically speaking: I don't think res or s could ever be large enough for that to actually happen) The existing code being replaced already had this "problem" FWIW.

When I wrote code for all these __sizeof__ implementations, I initially added integer overflow checks. But MvL said me to remove them. If we are going to add integer overflow checks, we should add it not only here, but few lines above (for res += self->memo_size * sizeof(PyObject *)) and in dozens of other places. This is a separate large issue.

serhiy-storchaka

I do wonder if it'll have negative performance impacts

The worst case -- loading a 1GB bytes object -- is now about 2% slower. This may be insignificant, because the 3.13 results lie between them and the ranges almost intersect.

For other tested cases I have not detected any difference.

There may be more (up to double-ish) ram consumption temporarily for actual huge data

This is only possible in extreme case -- loading a large bytes object with extremely fragmented memory (so that realloc() is inefficient). I did not reproduce such case, so cannot say that it is practically possible. In all other cases this is insignificant. In more realistic scenario you use FRAME, and the total size of objects loaded from the frame data exceed the size of this data.

serhiy-storchaka · 2024-09-28T07:18:07Z

Modules/_pickle.c

+        }
+    }
+    if (self->memo_dict != NULL) {
+        PyObject *key = PyLong_FromSsize_t(idx);


Good catch. Actually, the argument of _Unpickler_MemoGet() is always a Py_size_t casted to size_t, but it is better to always treat it as size_t here.

serhiy-storchaka · 2024-09-28T07:24:50Z

Modules/_pickle.c

+        if (s == (size_t)-1) {
+            return -1;
+        }
+        res += s;


When I wrote code for all these __sizeof__ implementations, I initially added integer overflow checks. But MvL said me to remove them. If we are going to add integer overflow checks, we should add it not only here, but few lines above (for res += self->memo_size * sizeof(PyObject *)) and in dozens of other places. This is a separate large issue.

serhiy-storchaka added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels May 20, 2024

bedevere-app bot added the awaiting core review label May 20, 2024

bedevere-app bot mentioned this pull request May 20, 2024

pickle.load() out-of-memory with small file #115952

Open

serhiy-storchaka added 2 commits May 20, 2024 14:01

Try to fix tests of 32-bit platforms.

88f1461

Try to fix more tests on 32-bit platforms.

048099b

serhiy-storchaka commented May 22, 2024

View reviewed changes

Lib/pickle.py Outdated Show resolved Hide resolved

serhiy-storchaka added 2 commits May 22, 2024 16:18

Apply suggestions from code review

d9d1d1d

Merge branch 'main' into unpickle-overallocate

6f6f765

gpshead marked this pull request as draft May 24, 2024 19:58

bedevere-app bot removed the awaiting core review label May 24, 2024

serhiy-storchaka added 3 commits June 29, 2024 11:10

Remove empty lines.

d0e667e

Merge branch 'main' into unpickle-overallocate

3462d0e

Merge remote-tracking branch 'refs/remotes/origin/unpickle-overalloca…

becbd25

…te' into unpickle-overallocate

serhiy-storchaka marked this pull request as ready for review June 29, 2024 08:12

bedevere-app bot added the awaiting core review label Jun 29, 2024

gpshead requested changes Jun 30, 2024

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting core review labels Jun 30, 2024

gpshead changed the title ~~gh-115952: Fix vulnerability in the pickle module~~ gh-115952: Fix potential virtual memory allocation denial of service in the pickle module Jun 30, 2024

gpshead added the DO-NOT-MERGE label Jun 30, 2024

Change names, add more commentis and update the NEWS entry.

b257974

serhiy-storchaka commented Jun 30, 2024

View reviewed changes

gpshead self-assigned this Aug 31, 2024

serhiy-storchaka added 2 commits September 6, 2024 09:06

Merge branch 'main' into unpickle-overallocate

1e487ca

Support arbitrary non-continuous memo keys.

184984d

Reworded NEWS a bit.

f0c0728

gpshead approved these changes Sep 28, 2024

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting changes labels Sep 28, 2024

gpshead added 3.14 new features, bugs and security fixes and removed DO-NOT-MERGE needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Sep 28, 2024

serhiy-storchaka added 3 commits September 28, 2024 09:25

Merge branch 'main' into unpickle-overallocate

1f4e2f1

Fix C to Python integer conversion.

c72d095

Add more comments.

e89bfea

serhiy-storchaka commented Sep 28, 2024

View reviewed changes

Fix test on 32-bit platforms.

a80106c

gpshead mentioned this pull request Jan 20, 2025

gh-119511: Fix OOM vulnerability in imaplib #119514

Merged

serhiy-storchaka added 2 commits April 8, 2025 17:15

Merge branch 'main' into unpickle-overallocate

01bc6b9

Fix __sizeof__.

20aa1bf

serhiy-storchaka requested a review from a team as a code owner April 8, 2025 14:22

Merge branch 'main' into unpickle-overallocate

ab58869

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115952: Fix potential virtual memory allocation denial of service in the pickle module #119204

gh-115952: Fix potential virtual memory allocation denial of service in the pickle module #119204

serhiy-storchaka commented May 20, 2024 •

edited

Loading

gpshead commented May 24, 2024

eli-schwartz commented Jun 30, 2024

gpshead May 22, 2024

serhiy-storchaka Jun 30, 2024

gpshead Jun 30, 2024

serhiy-storchaka Jun 30, 2024

gpshead Jun 30, 2024

gpshead Jun 30, 2024

serhiy-storchaka Jun 30, 2024

bedevere-app bot commented Jun 30, 2024

gpshead commented Jun 30, 2024

serhiy-storchaka commented Jun 30, 2024

serhiy-storchaka left a comment

serhiy-storchaka Jun 30, 2024

serhiy-storchaka Jun 30, 2024

serhiy-storchaka Jun 30, 2024

gpshead commented Aug 31, 2024

serhiy-storchaka commented Sep 6, 2024

gpshead left a comment

gpshead Sep 27, 2024

gpshead Sep 27, 2024

serhiy-storchaka Sep 28, 2024

gpshead Sep 27, 2024

serhiy-storchaka Sep 28, 2024

serhiy-storchaka left a comment

serhiy-storchaka Sep 28, 2024

serhiy-storchaka Sep 28, 2024

		@@ -0,0 +1,3 @@
		Fix a vulnerability in the :mod:`pickle` module which allowed to consume

gh-115952: Fix potential virtual memory allocation denial of service in the pickle module #119204

Are you sure you want to change the base?

gh-115952: Fix potential virtual memory allocation denial of service in the pickle module #119204

Conversation

serhiy-storchaka commented May 20, 2024 • edited Loading

gpshead commented May 24, 2024

eli-schwartz commented Jun 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-app bot commented Jun 30, 2024

gpshead commented Jun 30, 2024

serhiy-storchaka commented Jun 30, 2024

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpshead commented Aug 31, 2024

serhiy-storchaka commented Sep 6, 2024

gpshead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serhiy-storchaka commented May 20, 2024 •

edited

Loading