free-threading story #1339

stonebig · 2024-05-12T18:12:37Z

external progress follow-up:

progress in cpython: 3.13 'free-threading' progress
progress out of cpython: https://github.com/Quansight-Labs/free-threaded-compatibility/issues

WinPython internal progress follow-up

Winpython-3.13.0b1
- moving to wppm command line makes it compatible
- we have for the build to patch tempfile.py and copy python-3.13t.exe as python.exe
- pip-23.2 works for source packages
- problems:
  - no IDLE/TKINTER: free-threading implementation unfinished
  - slow:
    - only 40% more performant in total with 4 threads
    - python-3.13.0b1 free-threading doesn't activate the weak reference counting yet,
  - binary wheels toolchain, and pip-24.1 not yet there
WinPython-3.13.0b1b:
- add ptpython to have in "WinPython Interpreter.exe" the equivalent of cpython-3.13 REPL on linux, and compensate IDLE loss: with jedi and latest parso-0.8.4
- add interpreters_pep_734 and examples (miss the channel example)
- add pyperf-2.7.0 that may be usefull to run examples
- tweak examples from @FeldrinH remarks

stonebig · 2024-05-18T14:30:55Z

is looking ok:

ptpython, added to compensate IDLE not working and new REPL not for windows
flit
jedi
sympy

missing for free-threading experience:

cython-3.1 for compatibility
pip-24.1

stonebig · 2024-05-19T09:04:22Z

WinPython-3.13.0b1b result is with sudoku_thread_perf_comparison_typed.py:
3.13 free-threading:

there is 8 logical processors, 4.0 physical processors
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (30 Hz), max 0.03 secs).
solved 40 tests with 1 threads in 1.30 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (26 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.76 seconds, 1.71 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.06 secs).
solved 40 tests with 4 threads in 0.52 seconds, 2.51 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.45 seconds, 2.87 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.16 secs (6 Hz), max 0.24 secs).
solved 40 tests with 16 threads in 0.46 seconds, 2.82 speed-up

3.13 no free-threading

there is 8 logical processors, 4.0 physical processors
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.08 secs).
solved 40 tests with 2 threads in 0.82 seconds, 0.99 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (15 Hz), max 0.17 secs).
solved 40 tests with 4 threads in 0.84 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (13 Hz), max 0.22 secs).
solved 40 tests with 8 threads in 0.84 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (19 Hz), max 0.35 secs).
solved 40 tests with 16 threads in 0.84 seconds, 0.97 speed-up

FeldrinH · 2024-05-19T11:42:24Z

I would like to try this benchmark myself for comparison. Where can I find sudoku_thread_perf_comparison_typed.py?

stonebig · 2024-05-19T12:01:30Z

I would like to try this benchmark myself for comparison. Where can I find sudoku_thread_perf_comparison_typed.py?

https://github.com/winpython/winpython_afterdoc/tree/master/docs/free-threading_test

or a non single-file variant from "the master"

https://github.com/colesbury/sudopy-python3

adding thread1-4-20 basic test

FeldrinH · 2024-05-19T12:24:25Z

I used https://github.com/winpython/winpython_afterdoc/blob/master/docs/free-threading_test/sudoku_thread_perf_comparison_typed.py with three small modifications:

I replaced time.time() with time.perf_counter(), because time.time() is AFAIK fairly inaccurate on Windows and this script measures comparatively short durations.
I added more intermediate thread counts to thread_list, because I discovered that for my system peak speedup is somewhere between 10 and 14 threads.
I increased nbsudoku to 100 to make the measured durations a little longer.

Results on an Intel i5-13600KF (20 logical processors, 14 cores):

3.13.0b1 (free-threading):

there is 20 logical processors, 10.0 physical processors
Solved 100 of 100 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 100 tests with 1 threads in 2.02 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.02 secs (42 Hz), max 0.03 secs).
solved 100 tests with 2 threads in 1.19 seconds, 1.69 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (37 Hz), max 0.04 secs).
solved 100 tests with 4 threads in 0.68 seconds, 2.95 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (34 Hz), max 0.04 secs).
solved 100 tests with 6 threads in 0.50 seconds, 4.02 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (28 Hz), max 0.04 secs).
solved 100 tests with 8 threads in 0.45 seconds, 4.45 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.05 secs).
solved 100 tests with 10 threads in 0.42 seconds, 4.79 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.05 secs (21 Hz), max 0.05 secs).
solved 100 tests with 12 threads in 0.41 seconds, 4.90 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.07 secs).
solved 100 tests with 14 threads in 0.43 seconds, 4.69 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.08 secs).
solved 100 tests with 16 threads in 0.47 seconds, 4.33 speed-up

3.13.0b1 (no free-threading):

there is 20 logical processors, 10.0 physical processors
Solved 100 of 100 hard2 puzzles (avg 0.01 secs (85 Hz), max 0.01 secs).
solved 100 tests with 1 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.02 secs (43 Hz), max 0.04 secs).
solved 100 tests with 2 threads in 1.17 seconds, 1.01 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (23 Hz), max 0.27 secs).
solved 100 tests with 4 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.38 secs).
solved 100 tests with 6 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.52 secs).
solved 100 tests with 8 threads in 1.19 seconds, 0.99 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.58 secs).
solved 100 tests with 10 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (22 Hz), max 0.35 secs).
solved 100 tests with 12 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (29 Hz), max 0.25 secs).
solved 100 tests with 14 threads in 1.18 seconds, 1.00 speed-up

FeldrinH · 2024-05-19T12:41:42Z

Another thing to note: I increased nbsudoku to a large value and observed that adding more threads I could not get the CPU usage to go above 80% (according to Task Manager).

stonebig · 2024-05-19T13:13:50Z

Deferred reference counting is maybe not yet activated in beta 1, smells rather beta 3 for the 10x speed-up experience

see top link on '3.13 Free-threading progress':

stonebig · 2024-05-19T13:50:12Z

added time.perf_counter() and this:

import sys
try:
    _is_gil_enabled = sys._is_gil_enabled()
except:
    _is_gil_enabled = True
print(f"Gil Enabled = {_is_gil_enabled}")

stonebig · 2024-05-19T14:10:02Z

most awaited talk on free-threading and subinterpreters from @tonybaloney our advanced multi-threading experimenter is late today... but hey, a repository just appeared : https://github.com/tonybaloney/subinterpreter-web

stonebig · 2024-05-20T13:17:19Z

tried pypy and cythonise.... but both failed.... created tickets

stonebig · 2024-06-07T18:48:45Z

so for b2, we shall have won 2%*4.4 speed-up, 10%...

stonebig · 2024-06-08T07:58:34Z

python-3.13.0b2 free-thread: no really significant change. from b1 (windows testing is still noisy)

there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (30 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.33 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.79 seconds, 1.69 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.05 secs).
solved 40 tests with 4 threads in 0.49 seconds, 2.69 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.43 seconds, 3.07 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.15 secs (6 Hz), max 0.21 secs).
solved 40 tests with 16 threads in 0.42 seconds, 3.13 speed-up

no free-threadin 3.13.0b2:

there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.08 secs).
solved 40 tests with 2 threads in 0.81 seconds, 1.01 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (13 Hz), max 0.21 secs).
solved 40 tests with 4 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (10 Hz), max 0.50 secs).
solved 40 tests with 8 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.06 secs (15 Hz), max 0.35 secs).
solved 40 tests with 16 threads in 0.83 seconds, 0.99 speed-up

stonebig · 2024-06-29T12:22:27Z

As would suggest faster-cpython github site, python-3.13.0b3 doesn't improve on free-threading scaling.

there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.04 secs (28 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.41 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.05 secs).
solved 40 tests with 2 threads in 0.83 seconds, 1.70 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (18 Hz), max 0.08 secs).
solved 40 tests with 4 threads in 0.57 seconds, 2.48 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.10 secs).
solved 40 tests with 8 threads in 0.45 seconds, 3.14 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.16 secs (6 Hz), max 0.22 secs).
solved 40 tests with 16 threads in 0.46 seconds, 3.08 speed-up

stonebig · 2024-09-09T18:29:54Z

JIT and maybe free-threading shall become performant only in 3.14 cycle

stonebig · 2024-12-24T17:43:05Z

2024-12-24:

free-threading was pre-alpha in 3.13, for Windows
it seems it's not even sure it will be beta on Windows for 3.14
so droping the idea till free-threading gets closer to ready for Windows, probably 3.15 cycle

stonebig · 2025-03-17T14:34:19Z

2025-03-17: let retry in 3.14 as there was a hope of 20% better
.... just 2x faster with a 4+4 cpu, so nope on alpha6, on windows

with normal python

python sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (44 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.90 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (22 Hz), max 0.07 secs).
solved 40 tests with 2 threads in 0.90 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (12 Hz), max 0.24 secs).
solved 40 tests with 4 threads in 0.91 seconds, 0.99 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.22 secs).
solved 40 tests with 8 threads in 0.95 seconds, 0.94 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.15 secs).
solved 40 tests with 16 threads in 1.02 seconds, 0.88 speed-up

with free-threading

python3.14t sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (29 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.35 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.05 secs).
solved 40 tests with 2 threads in 0.81 seconds, 1.66 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.07 secs).
solved 40 tests with 4 threads in 0.51 seconds, 2.63 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.43 seconds, 3.12 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.15 secs (6 Hz), max 0.27 secs).
solved 40 tests with 16 threads in 0.44 seconds, 3.04 speed-up````

stonebig · 2025-04-10T19:13:21Z

with python-3.14.0a7, the mono-thread free-threading sees an notable improvement vs standard cpython:

only 15% slower now in mono-thread
x2.48 faster on that multi-thread test vs x2.09 for a6
... not yet the 3x dream, but a significant half-way jump

standard

python C:\WinP\bdDocs\docs.nogil\free-threading_test\sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (43 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 0.92 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.10 secs).
solved 40 tests with 2 threads in 0.90 seconds, 1.02 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (13 Hz), max 0.24 secs).
solved 40 tests with 4 threads in 0.90 seconds, 1.02 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (13 Hz), max 0.32 secs).
solved 40 tests with 8 threads in 0.94 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (12 Hz), max 0.29 secs).
solved 40 tests with 16 threads in 0.96 seconds, 0.96 speed-up

free-threading

python3.14t C:\WinP\bdDocs\docs.nogil\free-threading_test\sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (38 Hz), max 0.03 secs).
solved 40 tests with 1 threads in 1.05 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.03 secs (31 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.63 seconds, 1.66 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (23 Hz), max 0.06 secs).
solved 40 tests with 4 threads in 0.43 seconds, 2.42 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.07 secs).
solved 40 tests with 8 threads in 0.34 seconds, 3.05 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.12 secs (8 Hz), max 0.20 secs).
solved 40 tests with 16 threads in 0.37 seconds, 2.88 speed-up

stonebig · 2025-04-12T08:43:52Z

interesting on the mythical fib test:

free-threading is same speed as standard python in mono-thread
x2.94 faster on 8 threads (with 4 cpus with multi-threading)
x2.5 faster on 4 threads (with 4 cpus with multi-threading)

.... there may be a bit of "skill" needed to unlock the scaling part.

old intel laptop cpus may struggle to use SMT CPUs, small cache + small bandwith +thermals.
thermal limit shall remain true on modern laptop cpus, biasing free-threading "benchmark" benefit: more threads ==> throlling

thread-fib.py

import time, random
from itertools import repeat
from multiprocessing.pool import ThreadPool
startall = time.perf_counter()
reference_delta = 0
threads = 8
def fib(n):
    if n < 3:
        return n
    return fib(n-1) + fib(n-2)

# calculate the 30th fibonacci number 16 times
# sequentially
for i in repeat(30, threads):
    print(fib(i))
new_delta = time.perf_counter()-startall
if reference_delta ==0 :
    reference_delta = new_delta
ratio = reference_delta/(new_delta) 
print(f'solved {30} tests with {1} threads in {new_delta:.2f} seconds, {ratio:.2f} speed-up' + '\n')


# calculate the 30th fibonacci number 16 times
# in parallel
startall = time.perf_counter()
with ThreadPool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, threads)):
        print(result)
new_delta = time.perf_counter()-startall
if reference_delta ==0 :
    reference_delta = new_delta
ratio = reference_delta/(new_delta) 
print(f'solved {30} tests with {threads} threads in {new_delta:.2f} seconds, {ratio:.2f} speed-up' + '\n')

stonebig · 2025-05-09T12:26:33Z

no perf change with b1.... let re-consider when jupyterlab becomes possible

stonebig · 2025-05-11T17:10:28Z

remember ? https://www.theregister.com/2021/05/19/faster_python_mark_shannon_author/

"the "Shannon plan" here as the basis for achieving a "2x speedup in 3.11" with a hope for 5x in four years"

Haaa, the optimism of youth ....
As we had about 1 year of cold start, that would still mean about:

50% speed-up for 3.15
50% speed-up for 3.16

25% seems the best hope in mono-thread, 3x total.
...yet, it will be true partially for:

our cloud sponsors: ruff / uv / ty
free-threaded applications
... modular ?

... and the goal Python to not become irrelevant has been reached, according to TIOBE Index for May 2025: https://www.tiobe.com/tiobe-index/

"The only reason other languages still have a reason for existing is because of Python's low performance"

... hold my beard

stonebig · 2025-05-29T17:05:04Z

b2: no real change

stonebig added this to the 2024-03 Numpy-2 / Jupyterlab-4.2 milestone May 12, 2024

FeldrinH mentioned this issue May 19, 2024

nogil multi-threading is slower than multi-threading with gil for CPU bound python/cpython#118749

Closed

stonebig closed this as completed May 20, 2024

stonebig reopened this May 20, 2024

stonebig modified the milestones: 2024-03 Jupyterlab-4.2, 2024-04 Numpy-2 Jun 22, 2024

stonebig modified the milestones: 2024-04 Numpy-2 / .7z, 2024-05 Python-3.13.0 / Spyder-6 Sep 9, 2024

stonebig closed this as completed Dec 24, 2024

stonebig reopened this Apr 10, 2025

stonebig modified the milestones: 2024-05 Jupyterlab-4.3, 2025-03 PEP 751 / AI local May 4, 2025

stonebig added the Experiment Experiment label May 10, 2025

free-threading story #1339

free-threading story #1339

Comments

stonebig commented May 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stonebig commented May 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented May 19, 2024

Uh oh!

FeldrinH commented May 19, 2024

Uh oh!

stonebig commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FeldrinH commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FeldrinH commented May 19, 2024

Uh oh!

stonebig commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented May 19, 2024

Uh oh!

stonebig commented May 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented May 20, 2024

Uh oh!

stonebig commented Jun 7, 2024

Uh oh!

stonebig commented Jun 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented Jun 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented Sep 9, 2024

Uh oh!

stonebig commented Dec 24, 2024

Uh oh!

stonebig commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented May 9, 2025

Uh oh!

stonebig commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stonebig commented May 29, 2025

Uh oh!

stonebig commented May 12, 2024 •

edited

Loading

stonebig commented May 18, 2024 •

edited

Loading

stonebig commented May 19, 2024 •

edited

Loading

FeldrinH commented May 19, 2024 •

edited

Loading

stonebig commented May 19, 2024 •

edited

Loading

stonebig commented May 19, 2024 •

edited

Loading

stonebig commented Jun 8, 2024 •

edited

Loading

stonebig commented Jun 29, 2024 •

edited

Loading

stonebig commented Mar 17, 2025 •

edited

Loading

stonebig commented Apr 10, 2025 •

edited

Loading

stonebig commented Apr 12, 2025 •

edited

Loading

stonebig commented May 11, 2025 •

edited

Loading