Skip to content

py/gc: Speed up incremental GC cycles by tracking the high watermark. #10235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 4, 2023

Conversation

damz
Copy link
Contributor

@damz damz commented Dec 15, 2022

Why?

In applications that use little memory and run GC regularly, the cost of the sweep phase quickly becomes prohibitive as the amount of RAM increases.

On an ESP32-S3 with 2 MB of external SPIRAM, for example, a trivial GC cycle takes a minimum of 20 ms, a big part of it in the sweep phase.

Similarly, on the UNIX port with 1 GB of heap, a trivial GC takes 47 ms, virtually all of it in the sweep phase.

What?

Speed up the sweep phase in the case most of the heap is empty by keeping track of the ID of the highest block we allocated in an area since the last GC.

@github-actions
Copy link

Code size report:

   bare-arm:    +8 +0.014% [incl +8(bss)]
minimal x86:   +56 +0.034% [incl +8(bss)]

@damz
Copy link
Contributor Author

damz commented Dec 15, 2022

Some performance measurements. Using this trivial script:

import gc
import time

for i in range(30):
    start = time.ticks_ms(); gc.collect(); print(time.ticks_diff(time.ticks_ms(), start))

On the ESP32-S3 with 2 MB of SPIRAM, right after boot:

  • Before: ~20ms
  • After: <1ms

On the Unix port with a 1 GB heap:

  • Before: ~47ms
  • After: <1ms

In our real-life application (the Ribbit Network frog sensor on the same ESP32-S3 with 2 MB of SPIRAM), we run GC every minute and use ~43 kB of memory after the GC completes, this cuts the GC duration as well:

  • Before: 41-42ms
  • After: 22-23ms

@github-actions
Copy link

github-actions bot commented Dec 15, 2022

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:   +88 +0.047% [incl +8(bss)]
   unix x64:  +120 +0.015% standard[incl +8(bss)]
      stm32:   +52 +0.013% PYBV10[incl +4(bss)]
     mimxrt:   +64 +0.018% TEENSY40[incl +8(bss)]
        rp2:   +36 +0.011% PICO[incl +4(bss)]
       samd:   +52 +0.020% ADAFRUIT_ITSYBITSY_M4_EXPRESS[incl +4(bss)]

@jimmo
Copy link
Member

jimmo commented Dec 16, 2022

This is +56 bytes on PYBV11. Performance tests look positive too. Note that the performance benchmarks don't explicitly test the GC, but many tests will trigger collections.

$ ./run-perfbench.py -s ~/gc-hwm-pybv11-baseline ~/gc-hwm-pybv11-damz 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/gc-hwm-pybv11-baseline -> /home/jimmo/gc-hwm-pybv11-damz         diff      diff% (error%)
bm_chaos.py                    345.65 ->     352.86 :      +7.21 =  +2.086% (+/-0.01%)
bm_fannkuch.py                  75.36 ->      75.71 :      +0.35 =  +0.464% (+/-0.00%)
bm_fft.py                     2291.09 ->    2333.62 :     +42.53 =  +1.856% (+/-0.00%)
bm_float.py                   5672.70 ->    5709.64 :     +36.94 =  +0.651% (+/-0.02%)
bm_hexiom.py                    46.20 ->      46.46 :      +0.26 =  +0.563% (+/-0.00%)
bm_nqueens.py                 4218.15 ->    4246.60 :     +28.45 =  +0.674% (+/-0.00%)
bm_pidigits.py                 628.75 ->     621.14 :      -7.61 =  -1.210% (+/-0.27%)
core_import_mpy_multi.py       513.34 ->     512.49 :      -0.85 =  -0.166% (+/-0.00%)
core_import_mpy_single.py       64.71 ->      64.64 :      -0.07 =  -0.108% (+/-0.01%)
core_qstr.py                    64.39 ->      64.31 :      -0.08 =  -0.124% (+/-0.00%)
core_yield_from.py             353.02 ->     354.95 :      +1.93 =  +0.547% (+/-0.00%)
misc_aes.py                    406.43 ->     409.95 :      +3.52 =  +0.866% (+/-0.01%)
misc_mandel.py                3035.28 ->    3125.21 :     +89.93 =  +2.963% (+/-0.01%)
misc_pystone.py               2324.45 ->    2349.28 :     +24.83 =  +1.068% (+/-0.01%)
misc_raytrace.py               362.16 ->     367.64 :      +5.48 =  +1.513% (+/-0.00%)

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Dec 16, 2022
@damz damz marked this pull request as ready for review December 16, 2022 01:25
@jimmo
Copy link
Member

jimmo commented Dec 16, 2022

As a very rudimentary attempt to make the benchmarks more sensitive to GC performance, I added a gc.threshold(30000) (which is ~1/3 of the available heap on PYBV11).

First, this diff shows the effect that has on the tests (without the diff from this PR).

$ ./run-perfbench.py -s ~/gc-hwm-pybv11-baseline ~/gc-hwm-pybv11-threshold30-baseline 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/gc-hwm-pybv11-baseline -> /home/jimmo/gc-hwm-pybv11-threshold-baseline         diff      diff% (error%)
bm_chaos.py                    345.65 ->     324.05 :     -21.60 =  -6.249% (+/-0.00%)
bm_fannkuch.py                  75.36 ->      71.60 :      -3.76 =  -4.989% (+/-0.00%)
bm_fft.py                     2291.09 ->    2149.10 :    -141.99 =  -6.197% (+/-0.01%)
bm_float.py                   5672.70 ->    4892.95 :    -779.75 = -13.746% (+/-0.00%)
bm_hexiom.py                    46.20 ->      45.51 :      -0.69 =  -1.494% (+/-0.00%)
bm_nqueens.py                 4218.15 ->    3995.01 :    -223.14 =  -5.290% (+/-0.00%)
bm_pidigits.py                 628.75 ->     431.93 :    -196.82 = -31.303% (+/-0.30%)
core_import_mpy_multi.py       513.34 ->     497.46 :     -15.88 =  -3.093% (+/-0.01%)
core_import_mpy_single.py       64.71 ->      60.42 :      -4.29 =  -6.630% (+/-0.01%)
core_qstr.py                    64.39 ->      64.37 :      -0.02 =  -0.031% (+/-0.00%)
core_yield_from.py             353.02 ->     353.02 :      +0.00 =  +0.000% (+/-0.00%)
misc_aes.py                    406.43 ->     406.45 :      +0.02 =  +0.005% (+/-0.01%)
misc_mandel.py                3035.28 ->    2903.97 :    -131.31 =  -4.326% (+/-0.00%)
misc_pystone.py               2324.45 ->    2239.00 :     -85.45 =  -3.676% (+/-0.01%)
misc_raytrace.py               362.16 ->     339.32 :     -22.84 =  -6.307% (+/-0.00%)

Now with this PR added:

$ ./run-perfbench.py -s ~/gc-hwm-pybv11-threshold30-baseline ~/gc-hwm-pybv11-threshold30-damz 
diff of scores (higher is better)
N=100 M=100                /home/jimmo/gc-hwm-pybv11-threshold-baseline -> /home/jimmo/gc-hwm-pybv11-threshold-damz         diff      diff% (error%)
bm_chaos.py                    324.05 ->     340.00 :     +15.95 =  +4.922% (+/-0.01%)
bm_fannkuch.py                  71.60 ->      74.22 :      +2.62 =  +3.659% (+/-0.00%)
bm_fft.py                     2149.10 ->    2207.07 :     +57.97 =  +2.697% (+/-0.01%)
bm_float.py                   4892.95 ->    5026.46 :    +133.51 =  +2.729% (+/-0.01%)
bm_hexiom.py                    45.51 ->      45.83 :      +0.32 =  +0.703% (+/-0.00%)
bm_nqueens.py                 3995.01 ->    4164.99 :    +169.98 =  +4.255% (+/-0.00%)
bm_pidigits.py                 431.93 ->     455.80 :     +23.87 =  +5.526% (+/-0.59%)
core_import_mpy_multi.py       497.46 ->     507.83 :     +10.37 =  +2.085% (+/-0.01%)
core_import_mpy_single.py       60.42 ->      62.65 :      +2.23 =  +3.691% (+/-0.00%)
core_qstr.py                    64.37 ->      64.30 :      -0.07 =  -0.109% (+/-0.00%)
core_yield_from.py             353.02 ->     354.93 :      +1.91 =  +0.541% (+/-0.01%)
misc_aes.py                    406.45 ->     411.28 :      +4.83 =  +1.188% (+/-0.01%)
misc_mandel.py                2903.97 ->    3083.66 :    +179.69 =  +6.188% (+/-0.00%)
misc_pystone.py               2239.00 ->    2271.89 :     +32.89 =  +1.469% (+/-0.02%)
misc_raytrace.py               339.32 ->     354.83 :     +15.51 =  +4.571% (+/-0.01%)

@AutoPilotjn
Copy link

Hi.
Looks very interesting. Has the PR been integrated in the nightly builds ?

@dpgeorge
Copy link
Member

No this PR has not been merged yet.

@jkhax0r
Copy link

jkhax0r commented Mar 26, 2023

I'm on an ESP32S3 WROOM module with 8MB octal SPIRAM on custom hardware. Found this PR b/c was struggling with ~105ms gc.collect() times when I'm only actually using about 80KB of RAM (after GC)

Running gc.collect() every 250ms I was seeing ~100ms. And then ~105ms when I run every e.g. 1-2 minutes.

I manually merged this PR and went from that ~100ms down to ~46ms when I run gc.collect() from a 250ms loop. And the 105ms down to 50ms when run every 1-2 minutes.

Thanks for the PR... so simple for such a huge improvement!

Do these numbers seem right to you guys?

Prior to this PR I was seeing ~60ms IIRC with just the below loop from the REPL with no code running. Now at 0 to 1ms. Nice.

for i in range(0,100):
    t = utime.ticks_ms()
    gc.collect()
    print('t=', utime.ticks_ms() - t)

@peterhinch
Copy link
Contributor

For comparison an RP2 Pico with standard firmware shows 2-3ms. On past experience a Pyboard would show 1-2ms.

Incidentally, while OK in quick tests, in production code subtracting ticks values is unsafe because of rollover: use ticks_diff().

@jkhax0r
Copy link

jkhax0r commented Mar 29, 2023

@peterhinch Thanks for comment on ticks_diff()... yeah was being lazy but that had also lazily made its way into some of my other more useful debug code too.

I've also reviewed this PR now a bit more carefully to try to understand the gc better too. Code seems to check out.

@damz I had a thought while reviewing that is sort of related to this watermark, where I was trying to figure out how to deal with the case of a few bursty periods where e.g. 4-byte allocations end up high in RAM and completely neuter the improvements from this PR...

What if I was to make a bit-map and cache of e.g. every 1024 bytes of RAM that is 100% free. Call that the free page map. For an 8MB chip that would only be 1KB in SRAM with instant access. If I'm understanding things, gc_sweep() previously touched every bit of RAM, so with this "free page map", it would now only read pages that are known to be used, and clear them free as it comes across. And similarly allocate would unconditionally dirty new ones. You think this would have benefit?

Regardless the current PR is very beneficial for these fairly new, large, and slow ESP32S3 SPIRAM chips! Thanks!

@jkhax0r
Copy link

jkhax0r commented Apr 7, 2023

I've implemented what I discussed above plus did a bunch of timing profiling and related optimization/refactoring. Basically the core "used map" code is a way to track ranges of free GC memory so they can be quickly skipped over. This made some SIGNIFICANT improvements to gc timings on ESP32 with large/slow SPIRAM.

I was at 80ms to 100ms before finding this thread a week ago, and now down to about 6ms running the same code. Using ~170KB of RAM (8MB SPIRAM). My gc_info() or python gc.mem_free() gc.mem_alloc() calls dropped from 77ms down to 1.5ms!

jkhax0r@2d1c3d6

SHA-1: 2d1c3d6

Commit comments to describe more...

  • Various optimizations for garbage collection to drastically improve the timings on ESP32S3 with 8MB SPIRAM. This builds on @damz impressive yet simple gc_high_watermark code.
  • Implemented "usedmap" garbage collector cache or bitmap to indicate free blocks. This will create a sort of cached lookup of ranges of free blocks (the usedmap), which will let the sweep routines quickly skip over full ranges of blocks that are known to be free. This has drastic speed improvements on systems with large unused SPIRAM (and should on any system that has free memory ranges) to avoid having to check every block to know its free. It should also continue to work in cases when there's moderate fragmentation to keep the speed relatively constant for a given memory usage-level. A lower blocks per bit will result in more RAM, the scanning will take slighly longer, but fragmentation should become less of a factor.

    Memory required is relative to heap size ::: mem_required = total_heap_size / (MICROPY_GC_USEDMAP_BLOCKS_PER_BIT * 8 * BYTES_PER_BLOCK)

    As an example, a setting of 32 blocks per bit, or 512 bytes per bit. Basically will skip 512 bytes at a time if its free. Also an optimization when the entire 32-bit word (16KB RAM) will skip it all. Very useful with large unused and slow SPIRAM. Likely also helpful on SRAM systems and shouldn't hurt much on low SRAM/highly utilized systems

  • Implemented various optimizations of more common loop paths. Specifically moved AT_FREE checks to the start to give a hint or get the code to do a single check in the typical case, vs the switch case and presumably AT_FREE checked last. Also added the __builtin_likely gcc macro in places where it may be relevant, although I have no clue if this actually changes code on the ESP32 xtensa core (it does on ARM to my knowledge). These simple optimizations also resulted in a very measurable performance increase

  • Added #pragma gcc unroll 4 or 8 or 16 on the top of a few loops depending on their relative size vs tradeoff. Again another big improvement but this may need a setting for non-ESP32 systems that might care more about flash size. The results are good, but my SPECULATION is that perhaps it allows the memory and pipeline to queue e.g. 4 or 8 reads of RAM and then write them more efficiently too without a write-block. Also, I believe the ESP32 SPIRAM reads in 32-byte chunks so maybe this helps out if cache sucks or something.

  • Applied usedmap as well as @damz gc_high_watermark code to gc_info_end_of_sequence() as well. And then rewrote this mostly to avoid the logic where it checks the next ATB, but now rather tracks the last to avoid the redundant reads. Also did all the AT_FREE, and likely() optimization too.

    // Some benchmarks... WOW!!...
    // ESP32S3 use case with 8MB SPIRAM and about ~170KB used
    // gc_info() old is 77.7ms
    // gc_info() new is 1.53ms!
    // gc_info() new, but with full iterations is 24.4ms. So this was with the gc_high_watermark, and the quick usedmap_ndx "continue" exits removed to show just what the other code optimization/refactoring did.

  • Also @damz missed the gc_deal_with_stack_overflow() function free block range skipping. This was actually the root of my problems that got me started on this code. The default GC nest/stack depth is 32 and I was unknowingly hitting that after some time and then triggering the full memory scan for the MARK phase.
    And unfortunately this code was not very optimized. So I'd see gc.collect() times go from e.g. 50 to 100ms at random times. By adding the gc_high_watermark checks, this of course went down proportional with my usage (about 200KB of 8000KB... so A LOT). And then the usedmap work will hopefully keep this somewhat stable. And then the loop unrolling and even shuffling the AT_FREE helped a lot too to give a hint for the quick loop path.

  • Added gc.mem_info() library function. Added because I previously had to call gc_mem_free() and then gc_mem_alloc(), each taking 50ms+ for my webserver to retrieve get total memory, free mem, alloc mem. Now its only 1.5ms in my scenario, but still worth putting all the info in 1 call

@raphick99
Copy link

Any updates on when this will be merged?

py/gc.c Outdated
@@ -421,14 +429,18 @@ STATIC void gc_sweep(void) {
memset((void *)PTR_FROM_BLOCK(area, block), 0, BYTES_PER_BLOCK);
#endif
}
last_block = block;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this line needed? In this case here the sweep is collecting the head/tail block(s) and converting them to free blocks, so why should the block be recorded as the last block?

Copy link
Member

@jimmo jimmo Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is necessary but should be inside an else. So if we're freeing the tail, then don't record it as the last block, but if we're not then this is an (effectively unmarked) tail, so it does need to be recorded.

Also, I think we should rename last_block to last_used_block.

py/mpstate.h Outdated
@@ -87,6 +87,7 @@ typedef struct _mp_state_mem_area_t {
byte *gc_pool_end;

size_t gc_last_free_atb_index;
size_t gc_high_watermark; // The block ID of the highest block allocated in the area
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest renaming this variable to gc_last_used_block or gc_last_allocated_block or gc_last_non_free_block (because watermark could mean a mark for other statistics/counters).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to gc_last_used_block.

@dpgeorge
Copy link
Member

This is a very nice improvement! Combined with appropriate use of gc.threshold() it effectively allows a program to act as though it has a small and fast heap.

py/gc.c Outdated
@@ -680,6 +692,10 @@ void *gc_alloc(size_t n_bytes, unsigned int alloc_flags) {
area->gc_last_free_atb_index = (i + 1) / BLOCKS_PER_ATB;
}

if (area->gc_high_watermark < end_block) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to write as area->gc_high_watermark = MAX(area->gc_high_watermark, end_block); as I feel that conveys the intent better.

py/gc.c Outdated
assert(ATB_GET_KIND(area, bl) == AT_FREE);
ATB_FREE_TO_TAIL(area, bl);
}

if (area->gc_high_watermark < end_block) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

In applications that use little memory and run GC regularly, the cost of
the sweep phase quickly becomes prohibitives as the amount of RAM
increases.

On an ESP32-S3 with 2 MB of external SPIRAM, for example, a trivial GC
cycle takes a minimum of 40ms, virtually all of it in the sweep phase.

Similarly, on the UNIX port with 1 GB of heap, a trivial GC takes 47 ms,
again virtually all of it in the sweep phase.

This commit speeds up the sweep phase in the case most of the heap is empty
by keeping track of the ID of the highest block we allocated in an area
since the last GC.

The performance benchmark run on PYBV10 shows between +0 and +2%
improvement across the existing performance tests.  These tests don't
really stress the GC, so they were also run with gc.threshold(30000) and
gc.threshold(10000).  For the 30000 case, performance improved by up to
+10% with this commit.  For the 10000 case, performance improved by at
least +10% on 6 tests, and up to +25%.

Signed-off-by: Damien George <damien@micropython.org>
@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Merging #10235 (f841525) into master (70c5643) will increase coverage by 0.09%.
The diff coverage is 100.00%.

❗ Current head f841525 differs from pull request most recent head 2dcd745. Consider uploading reports for the commit 2dcd745 to get more accurate results

@@            Coverage Diff             @@
##           master   #10235      +/-   ##
==========================================
+ Coverage   98.38%   98.47%   +0.09%     
==========================================
  Files         157      155       -2     
  Lines       20704    20508     -196     
==========================================
- Hits        20369    20195     -174     
+ Misses        335      313      -22     
Files Changed Coverage Δ
py/gc.c 99.21% <100.00%> (+0.01%) ⬆️

... and 78 files with indirect coverage changes

@dpgeorge
Copy link
Member

dpgeorge commented Aug 4, 2023

I updated the PR based on the above comments.

I ran some performance tests on PYBv1.0. The tests (as they are) are between +0 and +2% improved.

Adding gc.threshold(10000) at the start of each performance test (as suggested by @jimmo above) gives the following improvement for this PR (baseline with gc.threshold, compared to this PR with gc.threshold):

diff of scores (higher is better)
N=100 M=100                   p-10k-0 ->    p-10k-1         diff      diff% (error%)
bm_chaos.py                    282.52 ->     315.70 :     +33.18 = +11.744% (+/-0.00%)
bm_fannkuch.py                  62.40 ->      71.18 :      +8.78 = +14.071% (+/-0.00%)
bm_fft.py                     1819.01 ->    1920.81 :    +101.80 =  +5.596% (+/-0.00%)
bm_float.py                   3676.06 ->    3973.67 :    +297.61 =  +8.096% (+/-0.01%)
bm_hexiom.py                    44.13 ->      44.92 :      +0.79 =  +1.790% (+/-0.00%)
bm_nqueens.py                 3468.26 ->    4010.27 :    +542.01 = +15.628% (+/-0.00%)
bm_pidigits.py                 328.60 ->     416.59 :     +87.99 = +26.777% (+/-0.66%)
core_import_mpy_multi.py       391.22 ->     420.58 :     +29.36 =  +7.505% (+/-0.00%)
core_import_mpy_single.py       60.31 ->      63.44 :      +3.13 =  +5.190% (+/-0.00%)
core_qstr.py                    61.66 ->      63.32 :      +1.66 =  +2.692% (+/-0.00%)
core_yield_from.py             355.04 ->     353.97 :      -1.07 =  -0.301% (+/-0.00%)
misc_aes.py                    403.14 ->     404.25 :      +1.11 =  +0.275% (+/-0.00%)
misc_mandel.py                2545.45 ->    2923.04 :    +377.59 = +14.834% (+/-0.01%)
misc_pystone.py               2072.82 ->    2130.02 :     +57.20 =  +2.760% (+/-0.02%)
misc_raytrace.py               289.02 ->     323.55 :     +34.53 = +11.947% (+/-0.00%)
viper_call0.py                 502.48 ->     502.51 :      +0.03 =  +0.006% (+/-0.00%)
viper_call1a.py                526.02 ->     526.06 :      +0.04 =  +0.008% (+/-0.00%)
viper_call1b.py                384.11 ->     392.19 :      +8.08 =  +2.104% (+/-0.00%)
viper_call1c.py                388.55 ->     396.83 :      +8.28 =  +2.131% (+/-0.00%)
viper_call2a.py                513.15 ->     513.20 :      +0.05 =  +0.010% (+/-0.00%)
viper_call2b.py                339.13 ->     344.70 :      +5.57 =  +1.642% (+/-0.00%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants