Skip to content

Memory keeps increasing with fixed-size dict during multi-threaded set/delete in 3.13.3t #133136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Yiling-J opened this issue Apr 29, 2025 · 9 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@Yiling-J
Copy link

Yiling-J commented Apr 29, 2025

Bug report

Bug description:

import random
import time
from threading import Lock, Thread

import psutil


class Cache:
    def __init__(self, cap):
        self.m = {}
        self.mu = Lock()
        self.cap = cap

    def get(self, key):
        with self.mu:
            try:
                v = self.m[key]
            except KeyError:
                return 0, False
            return v, True

    def set(self, key, value):
        with self.mu:
            self.m[key] = value
            if len(self.m) > self.cap:
                self.m.popitem()

def set_key(cache):
    c = 0
    while True:
        key = random.randint(0, 1000000000)
        cache.set(key, key)
        c += 1
        if c  == 100000:
            c = 0
            print(".", end="", flush=True)

def run():
    cache = Cache(500000)

    for _ in range(4):
        thread = Thread(target=set_key, args=[cache])
        thread.start()

    process = psutil.Process()
    while True:
        time.sleep(10)
        with cache.mu:
            print(f"rss: {process.memory_info().rss}, dict len: {len(cache.m)}")

if __name__ == "__main__":
    run()

Platform: macOS-14.6.1-x86_64-i386-64bit-Mach-O

Python 3.13.3t

uv run --python 3.13.3t mem.py

............................................................rss: 384696320, dict len: 500000
.......................................................rss: 636743680, dict len: 500000
.......................................................rss: 888782848, dict len: 500000
.........................................................rss: 1140793344, dict len: 500000
......................................................rss: 1392799744, dict len: 500000

Python 3.13.3

uv run --python 3.13.3 mem.py

....................................................................................................rss: 159633408, dict len: 500000
...............................................................................................................rss: 159813632, dict len: 500000
.........................................................................................................rss: 159981568, dict len: 500000
...........................................................................................................rss: 160149504, dict len: 500000
......................................................................................................rss: 160301056, dict len: 500000

CPython versions tested on:

3.13

Operating systems tested on:

macOS

Linked PRs

@Yiling-J Yiling-J added the type-bug An unexpected behavior, bug, or error label Apr 29, 2025
@Zheaoli
Copy link
Contributor

Zheaoli commented Apr 29, 2025

This bug can be reproduced on my Linux environment. I think may need some time to figure it out

@picnixz picnixz added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 1, 2025
@Zheaoli
Copy link
Contributor

Zheaoli commented May 6, 2025

For now, I think this may related with the gc timing

import random
import time
from threading import Lock, Thread

import resource
import gc

class Cache:
    def __init__(self, cap):
        self.m = {}
        self.mu = Lock()
        self.cap = cap

    def get(self, key):
        with self.mu:
            try:
                v = self.m[key]
            except KeyError:
                return 0, False
            return v, True

    def set(self, key, value):
        with self.mu:
            self.m[key] = value
            if len(self.m) > self.cap:
                self.m.popitem()

def set_key(cache):
    c = 0
    while True:
        key = random.randint(0, 1000000000)
        cache.set(key, key)
        c += 1
        if c  == 100000:
            c = 0
            print(".", end="", flush=True)

def run():
    cache = Cache(500000)

    for _ in range(4):
        thread = Thread(target=set_key, args=[cache])
        thread.start()

    while True:
        time.sleep(1)
        with cache.mu:
            usage = resource.getrusage(resource.RUSAGE_SELF)

            rss_mb = usage.ru_maxrss
            print(f"Process RSS: {rss_mb:.2f} KB")
            # gc.collect()

if __name__ == "__main__":
    run()

if we run the gc.collect manually, the memory size will keep normal. Otherwise, the memory size will increase in 40MB in regular peroid

@Zheaoli
Copy link
Contributor

Zheaoli commented May 6, 2025

It seems to run normally in v3.13.0a3.

After 10-minute loops, the memory size remains at less than 200MB.

This is a good signal. I think I can bisect for the bad commit.

@Yiling-J
Copy link
Author

Yiling-J commented May 7, 2025

@Zheaoli Thanks for investigating the issue. I tested by manually triggering GC every 10 seconds and every 1 second. For the 10-second interval, the RSS stabilized around rss: 893140992, dict len: 500000. For the 1-second interval, the RSS stabilized around rss: 490344448, dict len: 500000. I also tested 3.13.0t without manually triggering GC, the memory usage still increases, though slower compared to 3.13.3t.

@Zheaoli
Copy link
Contributor

Zheaoli commented May 7, 2025

Got it. I think this bug is introduced into the codebase in #116336

But I don't have enough idea about how to fix this. Would you mind taking a little bit of help? @colesbury

@Zheaoli
Copy link
Contributor

Zheaoli commented May 12, 2025

@colesbury PTAL when you got time

@Zheaoli
Copy link
Contributor

Zheaoli commented May 20, 2025

I think this issue is still related with the GC. I count the free_keys_object time. The normal version is much more than the free thread version.

@Yiling-J
Copy link
Author

I think this issue is still related with the GC. I count the free_keys_object time. The normal version is much more than the free thread version.

This seems like a reasonable explanation for the issue. Hopefully, @colesbury can take a look when he has time, or if he's too busy, perhaps someone else familiar with free-threading could also help investigate.

@colesbury
Copy link
Contributor

colesbury commented Jun 3, 2025

I think we need better heuristics for the QSBR here. The repeated insertions and popitem() calls lead to the dictionary keys object getting reallocated every so often. The PyDictKeysObjects here are large here so keeping a bunch of them means a lot of extra memory usage in the example.

We "buffer" QSBR frees in two ways:

  1. We only advance the global QSBR counter every 10 calls to _Py_qsbr_deferred_advance. This is to reduce contention on the shared counter for frequent calls, but that's probably too infrequent here given the size of the keys objects

uint64_t seq = _Py_qsbr_deferred_advance(tstate->qsbr);

  1. We only process the delayed free requests every WORK_ITEMS_PER_CHUNK (254) calls, again to limit work for frequent calls.

cpython/Objects/obmalloc.c

Lines 1208 to 1210 in 1ffe913

if (buf->wr_idx == WORK_ITEMS_PER_CHUNK) {
_PyMem_ProcessDelayed((PyThreadState *)tstate);
}

In addition to periodic processing every 10 or 254 calls, I think we probably want to take into account the amount of memory that the QSBR subsystem is holding.

cc @mpage, @Yhg1s, @nascheme, @DinoV

colesbury added a commit to colesbury/cpython that referenced this issue Jun 3, 2025
The free threading build uses QSBR to delay the freeing of dictionary
keys and list arrays when the objects are accessed by multiple threads
in order to allow concurrent reads to proceeed with holding the object
lock. The requests are processed in batches to reduce execution
overhead, but for large memory blocks this can lead to excess memory
usage.

Take into account the size of the memory block when deciding when to
process QSBR requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants