Skip to content

free-threading story #1339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
stonebig opened this issue May 12, 2024 · 21 comments
Open

free-threading story #1339

stonebig opened this issue May 12, 2024 · 21 comments
Labels
Experiment Experiment

Comments

@stonebig
Copy link
Contributor

stonebig commented May 12, 2024

external progress follow-up:

WinPython internal progress follow-up

  • Winpython-3.13.0b1
    • moving to wppm command line makes it compatible
    • we have for the build to patch tempfile.py and copy python-3.13t.exe as python.exe
    • pip-23.2 works for source packages
    • problems:
      • no IDLE/TKINTER: free-threading implementation unfinished
      • slow:
        • only 40% more performant in total with 4 threads
        • python-3.13.0b1 free-threading doesn't activate the weak reference counting yet,
      • binary wheels toolchain, and pip-24.1 not yet there
  • WinPython-3.13.0b1b:
    • add ptpython to have in "WinPython Interpreter.exe" the equivalent of cpython-3.13 REPL on linux, and compensate IDLE loss: with jedi and latest parso-0.8.4
    • add interpreters_pep_734 and examples (miss the channel example)
    • add pyperf-2.7.0 that may be usefull to run examples
    • tweak examples from @FeldrinH remarks
@stonebig
Copy link
Contributor Author

stonebig commented May 18, 2024

is looking ok:

  • ptpython, added to compensate IDLE not working and new REPL not for windows
  • flit
  • jedi
  • sympy

missing for free-threading experience:

  • cython-3.1 for compatibility
  • pip-24.1

@stonebig
Copy link
Contributor Author

WinPython-3.13.0b1b result is with sudoku_thread_perf_comparison_typed.py:
3.13 free-threading:

there is 8 logical processors, 4.0 physical processors
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (30 Hz), max 0.03 secs).
solved 40 tests with 1 threads in 1.30 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (26 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.76 seconds, 1.71 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.06 secs).
solved 40 tests with 4 threads in 0.52 seconds, 2.51 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.45 seconds, 2.87 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.16 secs (6 Hz), max 0.24 secs).
solved 40 tests with 16 threads in 0.46 seconds, 2.82 speed-up

3.13 no free-threading

there is 8 logical processors, 4.0 physical processors
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.08 secs).
solved 40 tests with 2 threads in 0.82 seconds, 0.99 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (15 Hz), max 0.17 secs).
solved 40 tests with 4 threads in 0.84 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (13 Hz), max 0.22 secs).
solved 40 tests with 8 threads in 0.84 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (19 Hz), max 0.35 secs).
solved 40 tests with 16 threads in 0.84 seconds, 0.97 speed-up

@FeldrinH
Copy link

I would like to try this benchmark myself for comparison. Where can I find sudoku_thread_perf_comparison_typed.py?

@stonebig
Copy link
Contributor Author

stonebig commented May 19, 2024

I would like to try this benchmark myself for comparison. Where can I find sudoku_thread_perf_comparison_typed.py?

https://github.com/winpython/winpython_afterdoc/tree/master/docs/free-threading_test

or a non single-file variant from "the master"

https://github.com/colesbury/sudopy-python3

adding thread1-4-20 basic test

@FeldrinH
Copy link

FeldrinH commented May 19, 2024

I used https://github.com/winpython/winpython_afterdoc/blob/master/docs/free-threading_test/sudoku_thread_perf_comparison_typed.py with three small modifications:

  • I replaced time.time() with time.perf_counter(), because time.time() is AFAIK fairly inaccurate on Windows and this script measures comparatively short durations.
  • I added more intermediate thread counts to thread_list, because I discovered that for my system peak speedup is somewhere between 10 and 14 threads.
  • I increased nbsudoku to 100 to make the measured durations a little longer.

Results on an Intel i5-13600KF (20 logical processors, 14 cores):

3.13.0b1 (free-threading):

there is 20 logical processors, 10.0 physical processors
Solved 100 of 100 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 100 tests with 1 threads in 2.02 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.02 secs (42 Hz), max 0.03 secs).
solved 100 tests with 2 threads in 1.19 seconds, 1.69 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (37 Hz), max 0.04 secs).
solved 100 tests with 4 threads in 0.68 seconds, 2.95 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (34 Hz), max 0.04 secs).
solved 100 tests with 6 threads in 0.50 seconds, 4.02 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (28 Hz), max 0.04 secs).
solved 100 tests with 8 threads in 0.45 seconds, 4.45 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.05 secs).
solved 100 tests with 10 threads in 0.42 seconds, 4.79 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.05 secs (21 Hz), max 0.05 secs).
solved 100 tests with 12 threads in 0.41 seconds, 4.90 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.07 secs).
solved 100 tests with 14 threads in 0.43 seconds, 4.69 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.08 secs).
solved 100 tests with 16 threads in 0.47 seconds, 4.33 speed-up

3.13.0b1 (no free-threading):

there is 20 logical processors, 10.0 physical processors
Solved 100 of 100 hard2 puzzles (avg 0.01 secs (85 Hz), max 0.01 secs).
solved 100 tests with 1 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.02 secs (43 Hz), max 0.04 secs).
solved 100 tests with 2 threads in 1.17 seconds, 1.01 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (23 Hz), max 0.27 secs).
solved 100 tests with 4 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.38 secs).
solved 100 tests with 6 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.52 secs).
solved 100 tests with 8 threads in 1.19 seconds, 0.99 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.58 secs).
solved 100 tests with 10 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.04 secs (22 Hz), max 0.35 secs).
solved 100 tests with 12 threads in 1.18 seconds, 1.00 speed-up

Solved 100 of 100 hard2 puzzles (avg 0.03 secs (29 Hz), max 0.25 secs).
solved 100 tests with 14 threads in 1.18 seconds, 1.00 speed-up

@FeldrinH
Copy link

Another thing to note: I increased nbsudoku to a large value and observed that adding more threads I could not get the CPU usage to go above 80% (according to Task Manager).

@stonebig
Copy link
Contributor Author

stonebig commented May 19, 2024

Deferred reference counting is maybe not yet activated in beta 1, smells rather beta 3 for the 10x speed-up experience

see top link on '3.13 Free-threading progress':

image

@stonebig
Copy link
Contributor Author

added time.perf_counter() and this:

import sys
try:
    _is_gil_enabled = sys._is_gil_enabled()
except:
    _is_gil_enabled = True
print(f"Gil Enabled = {_is_gil_enabled}")

@stonebig
Copy link
Contributor Author

stonebig commented May 19, 2024

most awaited talk on free-threading and subinterpreters from @tonybaloney our advanced multi-threading experimenter is late today... but hey, a repository just appeared : https://github.com/tonybaloney/subinterpreter-web

@stonebig
Copy link
Contributor Author

tried pypy and cythonise.... but both failed.... created tickets

@stonebig stonebig reopened this May 20, 2024
@stonebig
Copy link
Contributor Author

stonebig commented Jun 7, 2024

so for b2, we shall have won 2%*4.4 speed-up, 10%...
image

@stonebig
Copy link
Contributor Author

stonebig commented Jun 8, 2024

python-3.13.0b2 free-thread: no really significant change. from b1 (windows testing is still noisy)

there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (30 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.33 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.79 seconds, 1.69 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.05 secs).
solved 40 tests with 4 threads in 0.49 seconds, 2.69 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.43 seconds, 3.07 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.15 secs (6 Hz), max 0.21 secs).
solved 40 tests with 16 threads in 0.42 seconds, 3.13 speed-up

no free-threadin 3.13.0b2:

there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (49 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.08 secs).
solved 40 tests with 2 threads in 0.81 seconds, 1.01 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (13 Hz), max 0.21 secs).
solved 40 tests with 4 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (10 Hz), max 0.50 secs).
solved 40 tests with 8 threads in 0.82 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.06 secs (15 Hz), max 0.35 secs).
solved 40 tests with 16 threads in 0.83 seconds, 0.99 speed-up

@stonebig
Copy link
Contributor Author

stonebig commented Jun 29, 2024

As would suggest faster-cpython github site, python-3.13.0b3 doesn't improve on free-threading scaling.

there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.04 secs (28 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.41 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.05 secs).
solved 40 tests with 2 threads in 0.83 seconds, 1.70 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (18 Hz), max 0.08 secs).
solved 40 tests with 4 threads in 0.57 seconds, 2.48 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.09 secs (11 Hz), max 0.10 secs).
solved 40 tests with 8 threads in 0.45 seconds, 3.14 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.16 secs (6 Hz), max 0.22 secs).
solved 40 tests with 16 threads in 0.46 seconds, 3.08 speed-up

@stonebig
Copy link
Contributor Author

stonebig commented Sep 9, 2024

JIT and maybe free-threading shall become performant only in 3.14 cycle

@stonebig
Copy link
Contributor Author

2024-12-24:

  • free-threading was pre-alpha in 3.13, for Windows
  • it seems it's not even sure it will be beta on Windows for 3.14
  • so droping the idea till free-threading gets closer to ready for Windows, probably 3.15 cycle

@stonebig
Copy link
Contributor Author

stonebig commented Mar 17, 2025

2025-03-17: let retry in 3.14 as there was a hope of 20% better
.... just 2x faster with a 4+4 cpu, so nope on alpha6, on windows

with normal python

python sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (44 Hz), max 0.02 secs).
solved 40 tests with 1 threads in 0.90 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (22 Hz), max 0.07 secs).
solved 40 tests with 2 threads in 0.90 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (12 Hz), max 0.24 secs).
solved 40 tests with 4 threads in 0.91 seconds, 0.99 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.22 secs).
solved 40 tests with 8 threads in 0.95 seconds, 0.94 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.06 secs (17 Hz), max 0.15 secs).
solved 40 tests with 16 threads in 1.02 seconds, 0.88 speed-up

with free-threading

python3.14t sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (29 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 1.35 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (25 Hz), max 0.05 secs).
solved 40 tests with 2 threads in 0.81 seconds, 1.66 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.05 secs (20 Hz), max 0.07 secs).
solved 40 tests with 4 threads in 0.51 seconds, 2.63 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (11 Hz), max 0.09 secs).
solved 40 tests with 8 threads in 0.43 seconds, 3.12 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.15 secs (6 Hz), max 0.27 secs).
solved 40 tests with 16 threads in 0.44 seconds, 3.04 speed-up````

@stonebig
Copy link
Contributor Author

stonebig commented Apr 10, 2025

with python-3.14.0a7, the mono-thread free-threading sees an notable improvement vs standard cpython:

  • only 15% slower now in mono-thread
  • x2.48 faster on that multi-thread test vs x2.09 for a6
    ... not yet the 3x dream, but a significant half-way jump

standard

python C:\WinP\bdDocs\docs.nogil\free-threading_test\sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = True
Solved 40 of 40 hard2 puzzles (avg 0.02 secs (43 Hz), max 0.04 secs).
solved 40 tests with 1 threads in 0.92 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (24 Hz), max 0.10 secs).
solved 40 tests with 2 threads in 0.90 seconds, 1.02 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (13 Hz), max 0.24 secs).
solved 40 tests with 4 threads in 0.90 seconds, 1.02 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (13 Hz), max 0.32 secs).
solved 40 tests with 8 threads in 0.94 seconds, 0.98 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.08 secs (12 Hz), max 0.29 secs).
solved 40 tests with 16 threads in 0.96 seconds, 0.96 speed-up

free-threading

python3.14t C:\WinP\bdDocs\docs.nogil\free-threading_test\sudoku_thread_perf_comparison_typed.py
there is 8 logical processors, 4.0 physical processors, Gil Enabled = False
Solved 40 of 40 hard2 puzzles (avg 0.03 secs (38 Hz), max 0.03 secs).
solved 40 tests with 1 threads in 1.05 seconds, 1.00 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.03 secs (31 Hz), max 0.04 secs).
solved 40 tests with 2 threads in 0.63 seconds, 1.66 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.04 secs (23 Hz), max 0.06 secs).
solved 40 tests with 4 threads in 0.43 seconds, 2.42 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.07 secs (14 Hz), max 0.07 secs).
solved 40 tests with 8 threads in 0.34 seconds, 3.05 speed-up

Solved 40 of 40 hard2 puzzles (avg 0.12 secs (8 Hz), max 0.20 secs).
solved 40 tests with 16 threads in 0.37 seconds, 2.88 speed-up

@stonebig stonebig reopened this Apr 10, 2025
@stonebig
Copy link
Contributor Author

stonebig commented Apr 12, 2025

interesting on the mythical fib test:

  • free-threading is same speed as standard python in mono-thread
  • x2.94 faster on 8 threads (with 4 cpus with multi-threading)
  • x2.5 faster on 4 threads (with 4 cpus with multi-threading)

.... there may be a bit of "skill" needed to unlock the scaling part.

old intel laptop cpus may struggle to use SMT CPUs, small cache + small bandwith +thermals.
thermal limit shall remain true on modern laptop cpus, biasing free-threading "benchmark" benefit: more threads ==> throlling

thread-fib.py

import time, random
from itertools import repeat
from multiprocessing.pool import ThreadPool
startall = time.perf_counter()
reference_delta = 0
threads = 8
def fib(n):
    if n < 3:
        return n
    return fib(n-1) + fib(n-2)

# calculate the 30th fibonacci number 16 times
# sequentially
for i in repeat(30, threads):
    print(fib(i))
new_delta = time.perf_counter()-startall
if reference_delta ==0 :
    reference_delta = new_delta
ratio = reference_delta/(new_delta) 
print(f'solved {30} tests with {1} threads in {new_delta:.2f} seconds, {ratio:.2f} speed-up' + '\n')


# calculate the 30th fibonacci number 16 times
# in parallel
startall = time.perf_counter()
with ThreadPool() as pool:
    for result in pool.imap_unordered(fib, repeat(30, threads)):
        print(result)
new_delta = time.perf_counter()-startall
if reference_delta ==0 :
    reference_delta = new_delta
ratio = reference_delta/(new_delta) 
print(f'solved {30} tests with {threads} threads in {new_delta:.2f} seconds, {ratio:.2f} speed-up' + '\n')
 

@stonebig
Copy link
Contributor Author

stonebig commented May 9, 2025

no perf change with b1.... let re-consider when jupyterlab becomes possible

@stonebig stonebig added the Experiment Experiment label May 10, 2025
@stonebig
Copy link
Contributor Author

stonebig commented May 11, 2025

remember ? https://www.theregister.com/2021/05/19/faster_python_mark_shannon_author/

"the "Shannon plan" here as the basis for achieving a "2x speedup in 3.11" with a hope for 5x in four years"

Haaa, the optimism of youth ....
As we had about 1 year of cold start, that would still mean about:

  • 50% speed-up for 3.15
  • 50% speed-up for 3.16

25% seems the best hope in mono-thread, 3x total.
...yet, it will be true partially for:

  • our cloud sponsors: ruff / uv / ty
  • free-threaded applications
  • ... modular ?

... and the goal Python to not become irrelevant has been reached, according to TIOBE Index for May 2025: https://www.tiobe.com/tiobe-index/

"The only reason other languages still have a reason for existing is because of Python's low performance"

... hold my beard

@stonebig
Copy link
Contributor Author

b2: no real change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Experiment Experiment
Projects
None yet
Development

No branches or pull requests

2 participants