Skip to content

Python 3.14+: python: Objects/unicodeobject.c:10387: _PyUnicode_JoinArray: Assertion res_data == PyUnicode_1BYTE_DATA(res) + kind * PyUnicode_GET_LENGTH(res)' failed.` in sqlglot #134889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mgorny opened this issue May 29, 2025 · 7 comments
Labels
3.14 bugs and security fixes 3.15 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@mgorny
Copy link
Contributor

mgorny commented May 29, 2025

Crash report

What happened?

The pure Python code in sqlglot package manages to trigger an assertion in CPython:

python: Objects/unicodeobject.c:10387: _PyUnicode_JoinArray: Assertion `res_data == PyUnicode_1BYTE_DATA(res) + kind * PyUnicode_GET_LENGTH(res)' failed.
Aborted (core dumped)

Unfortunately, due to limited this is as far as I've been able to reduce it:

from sqlglot import parse_one

parse_one("SELECT * FROM taxi ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY").sql()

I can reproduce with 4.14.0b2 and 4109a9c, built with --with-assertions (but for some reason, doesn't happen if I build --with-pydebug), against sqlglot 26.23.0, i.e.:

CFLAGS='-O0 -g' ./configure -C --with-assertions
make -j$(nproc)
./python -m venv .venv
.venv/bin/pip install sqlglot
.venv/bin/python -c 'from sqlglot import parse_one; parse_one("SELECT * FROM taxi ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY").sql()'
(gdb) bt
#0  0x00007feb24e84dbc in ?? () from /usr/lib64/libc.so.6
#1  0x00007feb24e2c8e6 in raise () from /usr/lib64/libc.so.6
#2  0x00007feb24e1434b in abort () from /usr/lib64/libc.so.6
#3  0x00007feb24e142b5 in ?? () from /usr/lib64/libc.so.6
#4  0x000055e03a36ecf3 in _PyUnicode_JoinArray (separator=0x55e03a899bc8 <_PyRuntime+35496>, items=0x7ffeef0e9278, seqlen=4)
    at Objects/unicodeobject.c:10387
#5  0x000055e03a423f4a in _PyEval_EvalFrameDefault (tstate=0x55e03a8de070 <_PyRuntime+315216>, frame=0x7feb250e9690, throwflag=0)
    at Python/generated_cases.c.h:1414
#6  0x000055e03a41a1f2 in _PyEval_EvalFrame (tstate=0x55e03a8de070 <_PyRuntime+315216>, frame=0x7feb250e9120, throwflag=0)
    at ./Include/internal/pycore_ceval.h:119
#7  0x000055e03a45c7d0 in _PyEval_Vector (tstate=0x55e03a8de070 <_PyRuntime+315216>, func=0x7feb237f54e0, locals=0x0, 
    args=0x7ffeef0e95f0, argcount=2, kwnames=0x0) at Python/ceval.c:1975
#8  0x000055e03a21d64d in _PyFunction_Vectorcall (func=0x7feb237f54e0, stack=0x7ffeef0e95f0, nargsf=2, kwnames=0x0)
    at Objects/call.c:413
#9  0x000055e03a2213da in _PyObject_VectorcallTstate (tstate=0x55e03a8de070 <_PyRuntime+315216>, callable=0x7feb237f54e0, 
    args=0x7ffeef0e95f0, nargsf=2, kwnames=0x0) at ./Include/internal/pycore_call.h:169
#10 0x000055e03a221f4e in method_vectorcall (method=0x7feb237a7c80, args=0x7feb240d5220, nargsf=1, kwnames=0x0)
    at Objects/classobject.c:94
#11 0x000055e03a21cfc4 in _PyVectorcall_Call (tstate=0x55e03a8de070 <_PyRuntime+315216>, func=0x55e03a221c5a <method_vectorcall>, 
    callable=0x7feb237a7c80, tuple=0x7feb240d5200, kwargs=0x7feb23810ac0) at Objects/call.c:273
#12 0x000055e03a21d36b in _PyObject_Call (tstate=0x55e03a8de070 <_PyRuntime+315216>, callable=0x7feb237a7c80, args=0x7feb240d5200, 
    kwargs=0x7feb23810ac0) at Objects/call.c:348
#13 0x000055e03a21d446 in PyObject_Call (callable=0x7feb237a7c80, args=0x7feb240d5200, kwargs=0x7feb23810ac0) at Objects/call.c:373
#14 0x000055e03a42a262 in _PyEval_EvalFrameDefault (tstate=0x55e03a8de070 <_PyRuntime+315216>, frame=0x7feb250e9088, throwflag=0)
    at Python/generated_cases.c.h:2654
#15 0x000055e03a41a1f2 in _PyEval_EvalFrame (tstate=0x55e03a8de070 <_PyRuntime+315216>, frame=0x7feb250e9020, throwflag=0)
    at ./Include/internal/pycore_ceval.h:119
#16 0x000055e03a45c7d0 in _PyEval_Vector (tstate=0x55e03a8de070 <_PyRuntime+315216>, func=0x7feb240e73d0, locals=0x7feb240f0300, 
    args=0x0, argcount=0, kwnames=0x0) at Python/ceval.c:1975
#17 0x000055e03a41d431 in PyEval_EvalCode (co=0x7feb24112780, globals=0x7feb240f0300, locals=0x7feb240f0300) at Python/ceval.c:866
#18 0x000055e03a51b066 in run_eval_code_obj (tstate=0x55e03a8de070 <_PyRuntime+315216>, co=0x7feb24112780, globals=0x7feb240f0300, 
    locals=0x7feb240f0300) at Python/pythonrun.c:1365
#19 0x000055e03a51b5dc in run_mod (mod=0x55e05c7c3728, filename=0x7feb240f0370, globals=0x7feb240f0300, locals=0x7feb240f0300, 
    flags=0x7ffeef0ed160, arena=0x7feb24d07cb0, interactive_src=0x7feb241259d0, generate_new_source=0) at Python/pythonrun.c:1436
#20 0x000055e03a51ac55 in _PyRun_StringFlagsWithName (
    str=0x7feb24125a90 "from sqlglot import parse_one; parse_one(\"SELECT * FROM taxi ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY\").sql()\n", name=0x7feb240f0370, start=257, globals=0x7feb240f0300, locals=0x7feb240f0300, flags=0x7ffeef0ed160, 
    generate_new_source=0) at Python/pythonrun.c:1259
#21 0x000055e03a518cf4 in _PyRun_SimpleStringFlagsWithName (
    command=0x7feb24125a90 "from sqlglot import parse_one; parse_one(\"SELECT * FROM taxi ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY\").sql()\n", name=0x55e03a6ce96e "<string>", flags=0x7ffeef0ed160) at Python/pythonrun.c:578
#22 0x000055e03a55dec7 in pymain_run_command (
    command=0x55e05c6afa60 L"from sqlglot import parse_one; parse_one(\"SELECT * FROM taxi ORDER BY 1 OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY\").sql()\n") at Modules/main.c:261
#23 0x000055e03a55f2ad in pymain_run_python (exitcode=0x7ffeef0ed254) at Modules/main.c:682
#24 0x000055e03a55f4ae in Py_RunMain () at Modules/main.c:772
#25 0x000055e03a55f569 in pymain_main (args=0x7ffeef0ed2d0) at Modules/main.c:802
#26 0x000055e03a55f631 in Py_BytesMain (argc=3, argv=0x7ffeef0ed438) at Modules/main.c:826
#27 0x000055e03a1849bd in main (argc=3, argv=0x7ffeef0ed438) at ./Programs/python.c:15
(gdb) up 4
#4  0x000055e03a36ecf3 in _PyUnicode_JoinArray (separator=0x55e03a899bc8 <_PyRuntime+35496>, items=0x7ffeef0e9278, seqlen=4)
    at Objects/unicodeobject.c:10387
10387	        assert(res_data == PyUnicode_1BYTE_DATA(res)
(gdb) p res_data
$1 = (unsigned char *) 0x7feb2381b97c "\340U"
(gdb) p * (PyASCIIObject*) res
$5 = {ob_base = {{ob_refcnt_full = 1, {ob_refcnt = 1, ob_overflow = 0, ob_flags = 0}}, ob_type = 0x55e03a872040 <PyUnicode_Type>}, 
  length = 23, hash = -1, state = {interned = 0, kind = 1, compact = 1, ascii = 1, statically_allocated = 0}}
(gdb) p * (PyUnicodeObject*) res
$6 = {_base = {_base = {ob_base = {{ob_refcnt_full = 1, {ob_refcnt = 1, ob_overflow = 0, ob_flags = 0}}, 
        ob_type = 0x55e03a872040 <PyUnicode_Type>}, length = 23, hash = -1, state = {interned = 0, kind = 1, compact = 1, ascii = 1, 
        statically_allocated = 0}}, utf8_length = 5629578988226954784, 
    utf8 = 0x4546203320545845 <error: Cannot access memory at address 0x4546203320545845>}, data = {any = 0x5458454e20484354, 
    latin1 = 0x5458454e20484354 <error: Cannot access memory at address 0x5458454e20484354>, ucs2 = 0x5458454e20484354, 
    ucs4 = 0x5458454e20484354}}
(gdb) p * (PyCompactUnicodeObject*) res
$7 = {_base = {ob_base = {{ob_refcnt_full = 1, {ob_refcnt = 1, ob_overflow = 0, ob_flags = 0}}, 
      ob_type = 0x55e03a872040 <PyUnicode_Type>}, length = 23, hash = -1, state = {interned = 0, kind = 1, compact = 1, ascii = 1, 
      statically_allocated = 0}}, utf8_length = 5629578988226954784, 
  utf8 = 0x4546203320545845 <error: Cannot access memory at address 0x4546203320545845>}

(that's just my guesswork of what to print)

CPython versions tested on:

3.14, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.15.0a0 (heads/main:51910dc5620, May 29 2025, 16:12:29) [GCC 14.3.0]

Linked PRs

@mgorny mgorny added the type-crash A hard crash of the interpreter, possibly with a core dump label May 29, 2025
@Zheaoli
Copy link
Contributor

Zheaoli commented May 29, 2025

Confirmed, This bug is introduced in 053c285

I need a little bit time to dive more deeper. BTW I may need some help here. cc @mpage

I think we need a more smaller script to reproduce this bug

@mgorny
Copy link
Contributor Author

mgorny commented May 29, 2025

I can try to reduce it further, but I really need to work on $dayjob right now, so probably no earlier than the weekend.

@chilaxan
Copy link
Contributor

After quite a lot of tracing of the sqlglot code base, I believe this is a minimal reproducer of the root cause of this issue

def broken():
    variable = f"{1}"
    variable = f"{variable}"
    return variable
ASAN Output
Python 3.15.0a0 (heads/main-dirty:d96343679fd, May 30 2025, 01:50:46) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def broken():
...     variable = f"{1}"
...     variable = f"{variable}"
...     return variable
...     
>>> broken()
=================================================================
==3741==ERROR: AddressSanitizer: heap-use-after-free on address 0x00010a8489d0 at pc 0x000102b588dc bp 0x00016d6e8910 sp 0x00016d6e8908
READ of size 4 at 0x00010a8489d0 thread T0
    #0 0x102b588d8 in _PyEval_EvalFrameDefault generated_cases.c.h:10576
    #1 0x102b2d24c in PyEval_EvalCode ceval.c:866
    #2 0x102b23968 in builtin_exec bltinmodule.c.h:568
    #3 0x102b45ffc in _PyEval_EvalFrameDefault generated_cases.c.h:2383
    #4 0x102b2d90c in _PyEval_Vector ceval.c:1975
    #5 0x1027f27ec in _PyVectorcall_Call call.c:285
    #6 0x102ce0924 in pymain_start_pyrepl main.c:310
    #7 0x102cde040 in Py_RunMain main.c:772
    #8 0x102cdf010 in pymain_main main.c:802
    #9 0x102cdf53c in Py_BytesMain main.c:826
    #10 0x1881820dc  (<unknown module>)

0x00010a8489d0 is located 0 bytes inside of 42-byte region [0x00010a8489d0,0x00010a8489fa)
freed by thread T0 here:
    #0 0x103e3f380 in wrap_free+0x98 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x53380)
    #1 0x1029ee280 in unicode_dealloc unicodeobject.c:1801
    #2 0x1028ee0c0 in _Py_Dealloc object.c:3194
    #3 0x102b3b098 in _PyEval_EvalFrameDefault generated_cases.c.h:11209
    #4 0x102b2d24c in PyEval_EvalCode ceval.c:866
    #5 0x102b23968 in builtin_exec bltinmodule.c.h:568
    #6 0x102b45ffc in _PyEval_EvalFrameDefault generated_cases.c.h:2383
    #7 0x102b2d90c in _PyEval_Vector ceval.c:1975
    #8 0x1027f27ec in _PyVectorcall_Call call.c:285
    #9 0x102ce0924 in pymain_start_pyrepl main.c:310
    #10 0x102cde040 in Py_RunMain main.c:772
    #11 0x102cdf010 in pymain_main main.c:802
    #12 0x102cdf53c in Py_BytesMain main.c:826
    #13 0x1881820dc  (<unknown module>)

previously allocated by thread T0 here:
    #0 0x103e3f244 in wrap_malloc+0x94 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x53244)
    #1 0x10299423c in PyUnicode_New unicodeobject.c:1417
    #2 0x10287c178 in long_to_decimal_string_internal longobject.c:2157
    #3 0x102885dfc in long_to_decimal_string longobject.c:2247
    #4 0x1028dfe54 in PyObject_Str object.c:822
    #5 0x102b2fe0c in _PyEval_EvalFrameDefault generated_cases.c.h:5664
    #6 0x102b2d24c in PyEval_EvalCode ceval.c:866
    #7 0x102b23968 in builtin_exec bltinmodule.c.h:568
    #8 0x102b45ffc in _PyEval_EvalFrameDefault generated_cases.c.h:2383
    #9 0x102b2d90c in _PyEval_Vector ceval.c:1975
    #10 0x1027f27ec in _PyVectorcall_Call call.c:285
    #11 0x102ce0924 in pymain_start_pyrepl main.c:310
    #12 0x102cde040 in Py_RunMain main.c:772
    #13 0x102cdf010 in pymain_main main.c:802
    #14 0x102cdf53c in Py_BytesMain main.c:826
    #15 0x1881820dc  (<unknown module>)

SUMMARY: AddressSanitizer: heap-use-after-free generated_cases.c.h:10576 in _PyEval_EvalFrameDefault
Shadow bytes around the buggy address:
  0x00010a848700: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848780: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848800: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848880: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848900: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x00010a848980: fa fa fa fa fa fa fa fa fa fa[fd]fd fd fd fd fd
  0x00010a848a00: fa fa 00 00 00 00 00 03 fa fa fd fd fd fd fd fa
  0x00010a848a80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848b00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848b80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x00010a848c00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==3741==ABORTING
[1]    3741 abort      ./python.exe

As far as I can tell, this reproducer does not trigger the assertion as that assert was being triggered as a side effect of the underlying use-after-free

@ZeroIntensity
Copy link
Member

I suspect something is being borrowed where it shouldn't be.

@ZeroIntensity ZeroIntensity added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.14 bugs and security fixes 3.15 new features, bugs and security fixes labels May 30, 2025
@Zheaoli
Copy link
Contributor

Zheaoli commented May 30, 2025

I suspect something is being borrowed where it shouldn't be.

Same idea, I'm trying to trace it.

@chilaxan
Copy link
Contributor

chilaxan commented May 30, 2025

Looks like this block of code is the root cause:

cpython/Python/bytecodes.c

Lines 4854 to 4868 in 053c285

inst(FORMAT_SIMPLE, (value -- res)) {
PyObject *value_o = PyStackRef_AsPyObjectBorrow(value);
/* If value is a unicode object, then we know the result
* of format(value) is value itself. */
if (!PyUnicode_CheckExact(value_o)) {
PyObject *res_o = PyObject_Format(value_o, NULL);
PyStackRef_CLOSE(value);
ERROR_IF(res_o == NULL, error);
res = PyStackRef_FromPyObjectSteal(res_o);
}
else {
res = value;
DEAD(value);
}
}

When the value argument is a borrowed local reference and a PyUnicode_Object, it results in an invalid decref

The reason sqlglot triggers this use-after-free is this block of code here :

  def fetch_sql(self, expression: exp.Fetch) -> str:
      direction = expression.args.get("direction")
      direction = f" {direction}" if direction else ""
      count = self.sql(expression, "count")
      count = f" {count}" if count else ""
      limit_options = self.sql(expression, "limit_options")
      limit_options = f"{limit_options}" if limit_options else " ROWS ONLY"
      return f"{self.seg('FETCH')}{direction}{count}{limit_options}"

limit_options = self.sql(expression, "limit_options") generates a new string instance, with a single reference count
That reference is then borrowed on the following line, and passed into FORMAT_SIMPLE, causing the limit_options variable's reference count to drop.

@mpage
Copy link
Contributor

mpage commented May 30, 2025

The issue is that FORMAT_SIMPLE leaves its operand on the stack when it's a unicode value, but the analysis pass for LOAD_FAST_BORROW assumes that it always consumes its operand. The fix is pretty simple: we need to treat FORMAT_SIMPLE conservatively in the analysis and assume that it always leaves the operand on the stack. I'll put up a PR to fix it later today and also check for other cases where operands may be left conditionally on the stack.

mpage added a commit to mpage/cpython that referenced this issue Jun 4, 2025
mpage added a commit that referenced this issue Jun 4, 2025
…134958)

We were incorrectly handling a few opcodes that leave their operands on the stack. Treat all of these conservatively; assume that they always leave operands on the stack.
mpage added a commit to mpage/cpython that referenced this issue Jun 5, 2025
…FAST` (python#134958)

We were incorrectly handling a few opcodes that leave their operands on the stack. Treat all of these conservatively; assume that they always leave operands on the stack.

(cherry picked from commit 6b77af2)
mpage added a commit that referenced this issue Jun 5, 2025
…_FAST` (#134958) (#135187)

We were incorrectly handling a few opcodes that leave their operands on the stack. Treat all of these conservatively; assume that they always leave operands on the stack.

(cherry picked from commit 6b77af2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 bugs and security fixes 3.15 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

5 participants