Skip to content

py/scheduler: De-inline and fix race with pending exception / scheduler (v2) #8869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jimmo
Copy link
Member

@jimmo jimmo commented Jul 6, 2022

This is an alternative to #8845 that applies the same fix but also de-inlines this code from the VM.

Justification for de-inlining is that it improves performance (likely by reducing overall size of mp_execute_bytecode), reduces code size (-72 bytes overall), and reduces complexity. There should be no impact on scheduled task execution, but for the case of raising a pending exception the C-stack usage is slightly higher as there's now a function call (and a full NLR jump) rather than the inlined code before. (I think this is worth it as it only applies when there's a pending exception and doesn't impact the common case).

Performance results for pybv11 (non-threading)

$ ./run-perfbench.py -s ~/sched-fix-baseline ~/sched-fix-v1
diff of scores (higher is better)
N=100 M=100                /home/jimmo/sched-fix-baseline -> /home/jimmo/sched-fix-v1         diff      diff% (error%)
bm_chaos.py                    359.45 ->     365.76 :      +6.31 =  +1.755% (+/-0.00%)
bm_fannkuch.py                  77.39 ->      78.22 :      +0.83 =  +1.072% (+/-0.00%)
bm_fft.py                     2517.50 ->    2556.19 :     +38.69 =  +1.537% (+/-0.00%)
bm_float.py                   5720.43 ->    5872.82 :    +152.39 =  +2.664% (+/-0.01%)
bm_hexiom.py                    46.21 ->      46.66 :      +0.45 =  +0.974% (+/-0.00%)
bm_nqueens.py                 4393.94 ->    4468.55 :     +74.61 =  +1.698% (+/-0.00%)
bm_pidigits.py                 633.47 ->     640.35 :      +6.88 =  +1.086% (+/-0.24%)
misc_aes.py                    406.41 ->     417.71 :     +11.30 =  +2.780% (+/-0.00%)
misc_mandel.py                3478.31 ->    3520.85 :     +42.54 =  +1.223% (+/-0.01%)
misc_pystone.py               2424.83 ->    2467.74 :     +42.91 =  +1.770% (+/-0.01%)
misc_raytrace.py               371.74 ->     379.20 :      +7.46 =  +2.007% (+/-0.01%)

For pybv11 (with threading) (note: the previous code was incorrect on non-GIL builds)

$ ./run-perfbench.py -s ~/sched-fix-baseline-thread ~/sched-fix-v1-thread 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/sched-fix-baseline-thread -> /home/jimmo/sched-fix-v1-thread         diff      diff% (error%)
bm_chaos.py                    352.09 ->     347.38 :      -4.71 =  -1.338% (+/-0.06%)
bm_fannkuch.py                  76.03 ->      75.32 :      -0.71 =  -0.934% (+/-0.01%)
bm_fft.py                     2488.77 ->    2457.25 :     -31.52 =  -1.266% (+/-0.00%)
bm_float.py                   5731.78 ->    5640.76 :     -91.02 =  -1.588% (+/-0.00%)
bm_hexiom.py                    45.78 ->      45.42 :      -0.36 =  -0.786% (+/-0.00%)
bm_nqueens.py                 4201.08 ->    4161.82 :     -39.26 =  -0.935% (+/-0.00%)
bm_pidigits.py                 633.47 ->     630.27 :      -3.20 =  -0.505% (+/-0.26%)
misc_aes.py                    400.97 ->     405.59 :      +4.62 =  +1.152% (+/-0.01%)
misc_mandel.py                3397.27 ->    3349.65 :     -47.62 =  -1.402% (+/-0.00%)
misc_pystone.py               2305.62 ->    2308.08 :      +2.46 =  +0.107% (+/-0.01%)
misc_raytrace.py               367.85 ->     363.13 :      -4.72 =  -1.283% (+/-0.00%)

The optimisation that allows a single check in the VM for either a pending
exception or non-empty scheduler queue doesn't work when threading is
enabled, as one thread can clear the sched_state if it has no pending
exception, meaning the thread with the pending exception will never see it.

This removes that optimisation for threaded builds.

Also fixes a race in non-scheduler builds where get-and-clear of the
pending exception is not protected by the atomic section.

Also removes the bulk of the inlining of pending exceptions and scheduler
handling from the VM. This just costs code size and complexity at no
performance benefit.

@jimmo
Copy link
Member Author

jimmo commented Jul 6, 2022

Size and performance diffs (arm-none-eabi-gcc & gcc 12.1)

$ ./tools/metrics.py diff ../results/07cae9178/size ../results/c05ef433a/size
   bare-arm:   +12 +0.021% 
minimal x86:   +10 +0.006% 
   unix x64:   +39 +0.007% 
      stm32:   -72 -0.018% PYBV10
        rp2:  -136 -0.027% PICO

Linux:

$ ./run-perfbench.py -s ../../results/07cae9178/perf_unix ../../results/c05ef433a/perf_unix
diff of scores (higher is better)
N=2000 M=2000              ../../results/07cae9178/perf_unix -> ../../results/c05ef433a/perf_unix         diff      diff% (error%)
bm_chaos.py                   6066.92 ->    6028.54 :     -38.38 =  -0.633% (+/-3.25%)
bm_fannkuch.py                  26.58 ->      26.33 :      -0.25 =  -0.941% (+/-0.83%)
bm_fft.py                    29593.69 ->   29522.35 :     -71.34 =  -0.241% (+/-1.68%)
bm_float.py                  66202.86 ->   66772.51 :    +569.65 =  +0.860% (+/-4.43%)
bm_hexiom.py                   186.49 ->     186.30 :      -0.19 =  -0.102% (+/-1.44%)
bm_nqueens.py                98153.19 ->   97100.91 :   -1052.28 =  -1.072% (+/-1.60%)
bm_pidigits.py                9072.19 ->    9143.81 :     +71.62 =  +0.789% (+/-1.36%)
misc_aes.py                    119.46 ->     117.49 :      -1.97 =  -1.649% (+/-4.22%)
misc_mandel.py               75294.67 ->   74531.23 :    -763.44 =  -1.014% (+/-2.31%)
misc_pystone.py              26560.88 ->   28047.73 :   +1486.85 =  +5.598% (+/-1.89%)
misc_raytrace.py             10729.20 ->   10589.82 :    -139.38 =  -1.299% (+/-1.82%)

pybv11:

$ ./run-perfbench.py -s ../../results/07cae9178/perf_pybv11 ../../results/c05ef433a/perf_pybv11
diff of scores (higher is better)
N=100 M=100                ../../results/07cae9178/perf_pybv11 -> ../../results/c05ef433a/perf_pybv11         diff      diff% (error%)
bm_chaos.py                    354.30 ->     360.44 :      +6.14 =  +1.733% (+/-0.00%)
bm_fannkuch.py                  77.39 ->      78.22 :      +0.83 =  +1.072% (+/-0.01%)
bm_fft.py                     2516.90 ->    2555.46 :     +38.56 =  +1.532% (+/-0.00%)
bm_float.py                   5817.34 ->    5976.02 :    +158.68 =  +2.728% (+/-0.01%)
bm_hexiom.py                    42.35 ->      42.74 :      +0.39 =  +0.921% (+/-0.00%)
bm_nqueens.py                 4366.19 ->    4439.68 :     +73.49 =  +1.683% (+/-0.00%)
bm_pidigits.py                 630.87 ->     636.39 :      +5.52 =  +0.875% (+/-0.21%)
misc_aes.py                    406.33 ->     417.62 :     +11.29 =  +2.779% (+/-0.00%)
misc_mandel.py                3469.92 ->    3513.08 :     +43.16 =  +1.244% (+/-0.01%)
misc_pystone.py               2312.20 ->    2353.02 :     +40.82 =  +1.765% (+/-0.00%)
misc_raytrace.py               368.00 ->     375.35 :      +7.35 =  +1.997% (+/-0.01%)

Just as an interesting aside into compiler versions... Originally I ran this with gcc 11.2 and got quite different results...

$ ./tools/metrics.py diff ../results/07cae9178/size ../results/c05ef433a/size
   bare-arm:   +32 +0.056% 
minimal x86:   +10 +0.006% 
   unix x64:   +39 +0.007% 
      stm32:  -128 -0.032% PYBV10
        rp2:  -120 -0.024% PICO

pybv11 (gcc 11.2)

$ ./run-perfbench.py -s ../../results/07cae9178/perf_pybv11 ../../results/c05ef433a/perf_pybv11
diff of scores (higher is better)
N=100 M=100                ../../results/07cae9178/perf_pybv11 -> ../../results/c05ef433a/perf_pybv11         diff      diff% (error%)
bm_chaos.py                    358.87 ->     359.73 :      +0.86 =  +0.240% (+/-0.01%)
bm_fannkuch.py                  76.84 ->      77.30 :      +0.46 =  +0.599% (+/-0.00%)
bm_fft.py                     2544.49 ->    2541.73 :      -2.76 =  -0.108% (+/-0.00%)
bm_float.py                   5864.83 ->    5881.52 :     +16.69 =  +0.285% (+/-0.03%)
bm_hexiom.py                    43.36 ->      43.48 :      +0.12 =  +0.277% (+/-0.00%)
bm_nqueens.py                 4402.39 ->    4383.47 :     -18.92 =  -0.430% (+/-0.00%)
bm_pidigits.py                 626.93 ->     627.53 :      +0.60 =  +0.096% (+/-0.16%)
misc_aes.py                    413.35 ->     415.54 :      +2.19 =  +0.530% (+/-0.01%)
misc_mandel.py                3484.64 ->    3501.76 :     +17.12 =  +0.491% (+/-0.01%)
misc_pystone.py               2332.62 ->    2341.21 :      +8.59 =  +0.368% (+/-0.00%)
misc_raytrace.py               372.24 ->     374.14 :      +1.90 =  +0.510% (+/-0.00%)

The optimisation that allows a single check in the VM for either a pending
exception or non-empty scheduler queue doesn't work when threading is
enabled, as one thread can clear the sched_state if it has no pending
exception, meaning the thread with the pending exception will never see it.

This removes that optimisation for threaded builds.

Also fixes a race in non-scheduler builds where get-and-clear of the
pending exception is not protected by the atomic section.

Also removes the bulk of the inlining of pending exceptions and scheduler
handling from the VM. This just costs code size and complexity at no
performance benefit.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
} else {
MICROPY_END_ATOMIC_SECTION(atomic_state);
}
static inline void mp_sched_run_pending(void) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline seems unnecessary here. The compiler probably ignores it anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying the compiler ignores it and doesn't inline? Or that it's best to leave the decision to the compiler?

In this case I definitely want it to inline (to reduce stack usage and code size), I am hoping that -Os does the right thing here and would be very surprised if the compiler wasn't inlining. The alternative is to manually inline it.

Copy link
Contributor

@dlech dlech Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some disassembly and it looks like it gets inlined even without the inline keyword (gcc 10, default stm32 build).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good, thanks for checking. I don't see the harm in leaving the inline keyword there though (if nothing else than as a form of documentation), and there are other functions in this file that use inline too.

I wonder if perhaps we should introduce a MP_INLINE macro (matching the existing MP_NOINLINE) to use the __attribute__((always_inline)) attribute. (After all, when we put inline on a function we're doing it on purpose, we may as well force it).

(While we're at it, other functions in this file use static when they really should be using STATIC...)

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jul 12, 2022
@dpgeorge
Copy link
Member

Thanks, this is a good fix and clean up. Merged in 8db99f1

@dpgeorge dpgeorge closed this Jul 12, 2022
tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants