Skip to content

nogil segmentation fault on ordered dict operations #125996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luisggpina opened this issue Oct 25, 2024 · 7 comments
Open

nogil segmentation fault on ordered dict operations #125996

luisggpina opened this issue Oct 25, 2024 · 7 comments
Assignees
Labels
extension-modules C modules in the Modules dir topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@luisggpina
Copy link

luisggpina commented Oct 25, 2024

Crash report

What happened?

Hi,

We're a research group focused on testing concurrent runtimes. Our work-in-progress prototype found a segmentation fault on the current nogil build when using concurrent operations on the same ordered dict. The program below shows the crash.

from threading import Barrier, Thread
import sys
from collections import OrderedDict

d = OrderedDict({"k1": "v1", "k2": "v2", 1: 2})

def t1(b1):
    b1.wait()
    try:
        d.move_to_end("k1")
    except:
        None
def t2(b1):
    b1.wait()
    d.clear()
    

def test(): 
    
    global d
    barrier = Barrier(2) 
    threads = [ Thread(target= t1, args=(barrier,)) , Thread(target= t2, args=(barrier,)) ]
    
    d = OrderedDict({"k1": "v1", "k2": "v2", 1: 2})

    for t in threads:
        t.start()
    for t in threads:
        t.join()

print("Test begin...")

for i in range(0,1000):
    test()

print("Test Done")

We ran this on our build from source configured with ./configure --disable-gil.

Here's some details about the OS, compiler, binutils: Linux version 5.15.0-124-generic (buildd@lcy02-amd64-118) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024

@flypoodles and @overlorde are part of the team, adding them so they get notified about further discussion.

We're happy to provide more details about the crash.

CPython versions tested on:

3.13, 3.14, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.14.0a1+ experimental free-threading build (heads/main:c5b99f5c2c5, Oct 25 2024, 17:31:30) [GCC 11.4.0]

@luisggpina luisggpina added the type-crash A hard crash of the interpreter, possibly with a core dump label Oct 25, 2024
@rruuaanng

This comment has been minimized.

@luisggpina
Copy link
Author

gdb uses ptrace to attach to the process, our experience is that ptrace changes timings and thread schedulings. Can you confirm it crashes outside of gdb?

We'll try to get a stacktrace and more info on this, we'll also try your version.

@rruuaanng

This comment has been minimized.

@mpage
Copy link
Contributor

mpage commented Oct 26, 2024

I can reproduce this if I run the script provided by @luisggpina in a loop. Here is a sample stacktrace: https://gist.github.com/mpage/60b274f14c26cca1429fdfe98cc74ef9.

Built with ./configure --disable-gil --with-pydebug:

> ./python -VV
Python 3.14.0a1+ experimental free-threading build (heads/main:13844094609, Oct 25 2024, 17:01:23) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)]

I don't think anyone has looked at making ordereddict thread-safe yet, so I'm not surprised to see a crash here.

@rruuaanng
Copy link
Contributor

rruuaanng commented Oct 26, 2024

cpython/Objects/odictobject.c

Lines 1300 to 1322 in c5b99f5

node = last ? _odict_LAST(self) : _odict_FIRST(self);
if (key != _odictnode_KEY(node)) {
node = _odict_find_node(self, key);
if (node == NULL) {
if (!PyErr_Occurred())
PyErr_SetObject(PyExc_KeyError, key);
return NULL;
}
if (last) {
/* Only move if not already the last one. */
if (node != _odict_LAST(self)) {
_odict_remove_node(self, node);
_odict_add_tail(self, node);
}
}
else {
/* Only move if not already the first one. */
if (node != _odict_FIRST(self)) {
_odict_remove_node(self, node);
_odict_add_head(self, node);
}
}
}

I successfully triggered the error, and it turns out that the underlying operation of move_to_end is not atomic. If this is not considered from the beginning of the design, and the author does not intend to take over this task, I request to fix the issue.

Edit:
Actually, you only need to make the following changes, for example

 static void
 _odict_add_head(PyODictObject *od, _ODictNode *node)
 {
+#ifdef Py_GIL_DISABLED
+   _Py_atomic_store_ptr_release(_odictnode_PREV(node), NULL);
+   _Py_atomic_store_ptr_release(_odictnode_NEXT(node), _odict_FIRST(od));
+    if (_Py_atomic_load_ptr_acquire(_odict_FIRST(od)) == NULL) {
+        _Py_atomic_store_ptr_release(_odict_LAST(od), node);
+    }
+    else {
+        _Py_atomic_store_ptr_release(_odictnode_PREV(_odict_FIRST(od)), node);
+    }
+    _Py_atomic_add_int64(&od->od_state, 1);
+#endif
     _odictnode_PREV(node) = NULL;
     _odictnode_NEXT(node) = _odict_FIRST(od);
     if (_odict_FIRST(od) == NULL)

It will be consistent with the append operation of the list.

#ifdef Py_GIL_DISABLED
        _Py_atomic_store_ptr_release(&self->ob_item[len], newitem);
#else
        PyList_SET_ITEM(self, len, newitem);
#endif

@luisggpina
Copy link
Author

luisggpina commented Oct 26, 2024

Thanks for the proposed fix!

Here's a case that might be problematic:

  1. The current code starts by checking if the dict is empty. If so, the method returns:

    cpython/Objects/odictobject.c

    Lines 1296 to 1299 in c5b99f5

    if (_odict_EMPTY(self)) {
    PyErr_SetObject(PyExc_KeyError, key);
    return NULL;
    }
  2. After the check, other threads mutate the dict such that it becomes empty (e.g., calling clear, removing the last element from a dict with a single element, N threads concurrently removing N elements from a dict with N elements, etc)
  3. This line assumes that node will never be NULL due to an empty dict, but now that assumption fails:
    node = last ? _odict_LAST(self) : _odict_FIRST(self);
  4. The next line dereferences NULL:
    if (key != _odictnode_KEY(node)) {

To increase the odds of finding this behavior, you can add calls to yield (maybe it's _Py_yield()) after the NULL check I mentioned above in (1).

Apologies if I misunderstood your proposed fix, but I'm curious about the case where the dict changes in a meaningful way (e.g., non empty to empty) while the method is active.

@Eclips4 Eclips4 added the extension-modules C modules in the Modules dir label Oct 26, 2024
@colesbury
Copy link
Contributor

Thanks for the bug report.

I don't think anyone has looked at making ordereddict thread-safe yet, so I'm not surprised to see a crash here.

It's slightly more embarrassing than that. Dino started to look at OrderedDict back in January, and I suggested we defer it at the time and then forgot to revisit it.

Given that 3.13.0 is already released, my preference would be:

@kumaraditya303 kumaraditya303 self-assigned this Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extension-modules C modules in the Modules dir topic-free-threading type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

6 participants