3.130b1 Performance Issue with Free Threading build

# Bug report

### Bug description:

Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil.
I installed it with pyenv using this command
```bash
env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1
```
I didn't specify --enable-optimizations and --with-lto because with those the build would fail.
Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python.
Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:
```python
import threading
import time
import random

def multiply_row(A, B, row_index, result):
    # Compute the row result
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, result, row_indices):
    for row_index in row_indices:
        multiply_row(a, b, row_index, result)

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = [[0] * len(b[0]) for _ in range(num_rows)]
    row_chunk = num_rows // num_threads

    threads = []
    for i in range(num_threads):
        start_row = i * row_chunk
        end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
        thread = threading.Thread(target=parallel_matrix_multiplication, args=(a, b, result, range(start_row, end_row)))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)
    num_threads = 8  # Define number of threads

    start = time.perf_counter()
    
    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()
```
When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10.
When I run it with 3.13.0b1, the time skyrockets to ~48 seconds.
I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.

The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.

All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.

### CPython versions tested on:

3.9, 3.10, 3.11, 3.12, 3.13

### Operating systems tested on:

macOS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

3.130b1 Performance Issue with Free Threading build #120040

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

3.130b1 Performance Issue with Free Threading build #120040

Description

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions