py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

jimmo · 2022-07-06T02:10:57Z

This is an alternative to #8845 that applies the same fix but also de-inlines this code from the VM.

Justification for de-inlining is that it improves performance (likely by reducing overall size of mp_execute_bytecode), reduces code size (-72 bytes overall), and reduces complexity. There should be no impact on scheduled task execution, but for the case of raising a pending exception the C-stack usage is slightly higher as there's now a function call (and a full NLR jump) rather than the inlined code before. (I think this is worth it as it only applies when there's a pending exception and doesn't impact the common case).

Performance results for pybv11 (non-threading)

$ ./run-perfbench.py -s ~/sched-fix-baseline ~/sched-fix-v1
diff of scores (higher is better)
N=100 M=100                /home/jimmo/sched-fix-baseline -> /home/jimmo/sched-fix-v1         diff      diff% (error%)
bm_chaos.py                    359.45 ->     365.76 :      +6.31 =  +1.755% (+/-0.00%)
bm_fannkuch.py                  77.39 ->      78.22 :      +0.83 =  +1.072% (+/-0.00%)
bm_fft.py                     2517.50 ->    2556.19 :     +38.69 =  +1.537% (+/-0.00%)
bm_float.py                   5720.43 ->    5872.82 :    +152.39 =  +2.664% (+/-0.01%)
bm_hexiom.py                    46.21 ->      46.66 :      +0.45 =  +0.974% (+/-0.00%)
bm_nqueens.py                 4393.94 ->    4468.55 :     +74.61 =  +1.698% (+/-0.00%)
bm_pidigits.py                 633.47 ->     640.35 :      +6.88 =  +1.086% (+/-0.24%)
misc_aes.py                    406.41 ->     417.71 :     +11.30 =  +2.780% (+/-0.00%)
misc_mandel.py                3478.31 ->    3520.85 :     +42.54 =  +1.223% (+/-0.01%)
misc_pystone.py               2424.83 ->    2467.74 :     +42.91 =  +1.770% (+/-0.01%)
misc_raytrace.py               371.74 ->     379.20 :      +7.46 =  +2.007% (+/-0.01%)

For pybv11 (with threading) (note: the previous code was incorrect on non-GIL builds)

$ ./run-perfbench.py -s ~/sched-fix-baseline-thread ~/sched-fix-v1-thread 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/sched-fix-baseline-thread -> /home/jimmo/sched-fix-v1-thread         diff      diff% (error%)
bm_chaos.py                    352.09 ->     347.38 :      -4.71 =  -1.338% (+/-0.06%)
bm_fannkuch.py                  76.03 ->      75.32 :      -0.71 =  -0.934% (+/-0.01%)
bm_fft.py                     2488.77 ->    2457.25 :     -31.52 =  -1.266% (+/-0.00%)
bm_float.py                   5731.78 ->    5640.76 :     -91.02 =  -1.588% (+/-0.00%)
bm_hexiom.py                    45.78 ->      45.42 :      -0.36 =  -0.786% (+/-0.00%)
bm_nqueens.py                 4201.08 ->    4161.82 :     -39.26 =  -0.935% (+/-0.00%)
bm_pidigits.py                 633.47 ->     630.27 :      -3.20 =  -0.505% (+/-0.26%)
misc_aes.py                    400.97 ->     405.59 :      +4.62 =  +1.152% (+/-0.01%)
misc_mandel.py                3397.27 ->    3349.65 :     -47.62 =  -1.402% (+/-0.00%)
misc_pystone.py               2305.62 ->    2308.08 :      +2.46 =  +0.107% (+/-0.01%)
misc_raytrace.py               367.85 ->     363.13 :      -4.72 =  -1.283% (+/-0.00%)

The optimisation that allows a single check in the VM for either a pending
exception or non-empty scheduler queue doesn't work when threading is
enabled, as one thread can clear the sched_state if it has no pending
exception, meaning the thread with the pending exception will never see it.

This removes that optimisation for threaded builds.

Also fixes a race in non-scheduler builds where get-and-clear of the
pending exception is not protected by the atomic section.

Also removes the bulk of the inlining of pending exceptions and scheduler
handling from the VM. This just costs code size and complexity at no
performance benefit.

jimmo · 2022-07-06T05:05:14Z

Size and performance diffs (arm-none-eabi-gcc & gcc 12.1)

$ ./tools/metrics.py diff ../results/07cae9178/size ../results/c05ef433a/size
   bare-arm:   +12 +0.021% 
minimal x86:   +10 +0.006% 
   unix x64:   +39 +0.007% 
      stm32:   -72 -0.018% PYBV10
        rp2:  -136 -0.027% PICO

Linux:

$ ./run-perfbench.py -s ../../results/07cae9178/perf_unix ../../results/c05ef433a/perf_unix
diff of scores (higher is better)
N=2000 M=2000              ../../results/07cae9178/perf_unix -> ../../results/c05ef433a/perf_unix         diff      diff% (error%)
bm_chaos.py                   6066.92 ->    6028.54 :     -38.38 =  -0.633% (+/-3.25%)
bm_fannkuch.py                  26.58 ->      26.33 :      -0.25 =  -0.941% (+/-0.83%)
bm_fft.py                    29593.69 ->   29522.35 :     -71.34 =  -0.241% (+/-1.68%)
bm_float.py                  66202.86 ->   66772.51 :    +569.65 =  +0.860% (+/-4.43%)
bm_hexiom.py                   186.49 ->     186.30 :      -0.19 =  -0.102% (+/-1.44%)
bm_nqueens.py                98153.19 ->   97100.91 :   -1052.28 =  -1.072% (+/-1.60%)
bm_pidigits.py                9072.19 ->    9143.81 :     +71.62 =  +0.789% (+/-1.36%)
misc_aes.py                    119.46 ->     117.49 :      -1.97 =  -1.649% (+/-4.22%)
misc_mandel.py               75294.67 ->   74531.23 :    -763.44 =  -1.014% (+/-2.31%)
misc_pystone.py              26560.88 ->   28047.73 :   +1486.85 =  +5.598% (+/-1.89%)
misc_raytrace.py             10729.20 ->   10589.82 :    -139.38 =  -1.299% (+/-1.82%)

pybv11:

$ ./run-perfbench.py -s ../../results/07cae9178/perf_pybv11 ../../results/c05ef433a/perf_pybv11
diff of scores (higher is better)
N=100 M=100                ../../results/07cae9178/perf_pybv11 -> ../../results/c05ef433a/perf_pybv11         diff      diff% (error%)
bm_chaos.py                    354.30 ->     360.44 :      +6.14 =  +1.733% (+/-0.00%)
bm_fannkuch.py                  77.39 ->      78.22 :      +0.83 =  +1.072% (+/-0.01%)
bm_fft.py                     2516.90 ->    2555.46 :     +38.56 =  +1.532% (+/-0.00%)
bm_float.py                   5817.34 ->    5976.02 :    +158.68 =  +2.728% (+/-0.01%)
bm_hexiom.py                    42.35 ->      42.74 :      +0.39 =  +0.921% (+/-0.00%)
bm_nqueens.py                 4366.19 ->    4439.68 :     +73.49 =  +1.683% (+/-0.00%)
bm_pidigits.py                 630.87 ->     636.39 :      +5.52 =  +0.875% (+/-0.21%)
misc_aes.py                    406.33 ->     417.62 :     +11.29 =  +2.779% (+/-0.00%)
misc_mandel.py                3469.92 ->    3513.08 :     +43.16 =  +1.244% (+/-0.01%)
misc_pystone.py               2312.20 ->    2353.02 :     +40.82 =  +1.765% (+/-0.00%)
misc_raytrace.py               368.00 ->     375.35 :      +7.35 =  +1.997% (+/-0.01%)

Just as an interesting aside into compiler versions... Originally I ran this with gcc 11.2 and got quite different results...

$ ./tools/metrics.py diff ../results/07cae9178/size ../results/c05ef433a/size
   bare-arm:   +32 +0.056% 
minimal x86:   +10 +0.006% 
   unix x64:   +39 +0.007% 
      stm32:  -128 -0.032% PYBV10
        rp2:  -120 -0.024% PICO

pybv11 (gcc 11.2)

$ ./run-perfbench.py -s ../../results/07cae9178/perf_pybv11 ../../results/c05ef433a/perf_pybv11
diff of scores (higher is better)
N=100 M=100                ../../results/07cae9178/perf_pybv11 -> ../../results/c05ef433a/perf_pybv11         diff      diff% (error%)
bm_chaos.py                    358.87 ->     359.73 :      +0.86 =  +0.240% (+/-0.01%)
bm_fannkuch.py                  76.84 ->      77.30 :      +0.46 =  +0.599% (+/-0.00%)
bm_fft.py                     2544.49 ->    2541.73 :      -2.76 =  -0.108% (+/-0.00%)
bm_float.py                   5864.83 ->    5881.52 :     +16.69 =  +0.285% (+/-0.03%)
bm_hexiom.py                    43.36 ->      43.48 :      +0.12 =  +0.277% (+/-0.00%)
bm_nqueens.py                 4402.39 ->    4383.47 :     -18.92 =  -0.430% (+/-0.00%)
bm_pidigits.py                 626.93 ->     627.53 :      +0.60 =  +0.096% (+/-0.16%)
misc_aes.py                    413.35 ->     415.54 :      +2.19 =  +0.530% (+/-0.01%)
misc_mandel.py                3484.64 ->    3501.76 :     +17.12 =  +0.491% (+/-0.01%)
misc_pystone.py               2332.62 ->    2341.21 :      +8.59 =  +0.368% (+/-0.00%)
misc_raytrace.py               372.24 ->     374.14 :      +1.90 =  +0.510% (+/-0.00%)

The optimisation that allows a single check in the VM for either a pending exception or non-empty scheduler queue doesn't work when threading is enabled, as one thread can clear the sched_state if it has no pending exception, meaning the thread with the pending exception will never see it. This removes that optimisation for threaded builds. Also fixes a race in non-scheduler builds where get-and-clear of the pending exception is not protected by the atomic section. Also removes the bulk of the inlining of pending exceptions and scheduler handling from the VM. This just costs code size and complexity at no performance benefit. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

dlech · 2022-07-06T14:57:55Z

py/scheduler.c

-        } else {
-            MICROPY_END_ATOMIC_SECTION(atomic_state);
-        }
+static inline void mp_sched_run_pending(void) {


inline seems unnecessary here. The compiler probably ignores it anyway.

Are you saying the compiler ignores it and doesn't inline? Or that it's best to leave the decision to the compiler?

In this case I definitely want it to inline (to reduce stack usage and code size), I am hoping that -Os does the right thing here and would be very surprised if the compiler wasn't inlining. The alternative is to manually inline it.

I did some disassembly and it looks like it gets inlined even without the inline keyword (gcc 10, default stm32 build).

That's good, thanks for checking. I don't see the harm in leaving the inline keyword there though (if nothing else than as a form of documentation), and there are other functions in this file that use inline too.

I wonder if perhaps we should introduce a MP_INLINE macro (matching the existing MP_NOINLINE) to use the __attribute__((always_inline)) attribute. (After all, when we put inline on a function we're doing it on purpose, we may as well force it).

(While we're at it, other functions in this file use static when they really should be using STATIC...)

dpgeorge · 2022-07-12T06:08:43Z

Thanks, this is a good fix and clean up. Merged in 8db99f1

Disable i2ctarget on ESP

jimmo mentioned this pull request Jul 6, 2022

py/scheduler: Fix race with scheduler and pending exceptions. #8845

Closed

jimmo force-pushed the sched-main-thread-v2 branch from c05ef43 to 3fea7d7 Compare July 6, 2022 06:07

jimmo mentioned this pull request Jul 6, 2022

py/vm.c: Document the SELECTIVE_EXC_IP optimisation. #8870

Closed

dlech reviewed Jul 6, 2022

View reviewed changes

dpgeorge added the py-core Relates to py/ directory in source label Jul 12, 2022

dpgeorge closed this Jul 12, 2022

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Feb 7, 2024

Merge pull request micropython#8869 from tannewt/disable_esp_i2ctarget

ed46b4c

Disable i2ctarget on ESP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

jimmo commented Jul 6, 2022

Uh oh!

jimmo commented Jul 6, 2022

Uh oh!

dlech Jul 6, 2022

Uh oh!

jimmo Jul 6, 2022

Uh oh!

dlech Jul 6, 2022 •

edited

Loading

Uh oh!

jimmo Jul 6, 2022

Uh oh!

dpgeorge commented Jul 12, 2022

Uh oh!

Uh oh!

Uh oh!

py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

Conversation

jimmo commented Jul 6, 2022

Uh oh!

jimmo commented Jul 6, 2022

Uh oh!

dlech Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

jimmo Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

dlech Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmo Jul 6, 2022

Choose a reason for hiding this comment

Uh oh!

dpgeorge commented Jul 12, 2022

Uh oh!

Uh oh!

dlech Jul 6, 2022 •

edited

Loading