-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/gc: Speed up incremental GC cycles by tracking the high watermark. #10235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Code size report:
|
Some performance measurements. Using this trivial script: import gc
import time
for i in range(30):
start = time.ticks_ms(); gc.collect(); print(time.ticks_diff(time.ticks_ms(), start)) On the ESP32-S3 with 2 MB of SPIRAM, right after boot:
On the Unix port with a 1 GB heap:
In our real-life application (the Ribbit Network frog sensor on the same ESP32-S3 with 2 MB of SPIRAM), we run GC every minute and use ~43 kB of memory after the GC completes, this cuts the GC duration as well:
|
9fa5629
to
f841525
Compare
Code size report:
|
This is +56 bytes on PYBV11. Performance tests look positive too. Note that the performance benchmarks don't explicitly test the GC, but many tests will trigger collections.
|
As a very rudimentary attempt to make the benchmarks more sensitive to GC performance, I added a First, this diff shows the effect that has on the tests (without the diff from this PR).
Now with this PR added:
|
Hi. |
No this PR has not been merged yet. |
I'm on an ESP32S3 WROOM module with 8MB octal SPIRAM on custom hardware. Found this PR b/c was struggling with ~105ms gc.collect() times when I'm only actually using about 80KB of RAM (after GC) Running gc.collect() every 250ms I was seeing ~100ms. And then ~105ms when I run every e.g. 1-2 minutes. I manually merged this PR and went from that ~100ms down to ~46ms when I run gc.collect() from a 250ms loop. And the 105ms down to 50ms when run every 1-2 minutes. Thanks for the PR... so simple for such a huge improvement! Do these numbers seem right to you guys? Prior to this PR I was seeing ~60ms IIRC with just the below loop from the REPL with no code running. Now at 0 to 1ms. Nice.
|
For comparison an RP2 Pico with standard firmware shows 2-3ms. On past experience a Pyboard would show 1-2ms. Incidentally, while OK in quick tests, in production code subtracting |
@peterhinch Thanks for comment on ticks_diff()... yeah was being lazy but that had also lazily made its way into some of my other more useful debug code too. I've also reviewed this PR now a bit more carefully to try to understand the gc better too. Code seems to check out. @damz I had a thought while reviewing that is sort of related to this watermark, where I was trying to figure out how to deal with the case of a few bursty periods where e.g. 4-byte allocations end up high in RAM and completely neuter the improvements from this PR... What if I was to make a bit-map and cache of e.g. every 1024 bytes of RAM that is 100% free. Call that the free page map. For an 8MB chip that would only be 1KB in SRAM with instant access. If I'm understanding things, gc_sweep() previously touched every bit of RAM, so with this "free page map", it would now only read pages that are known to be used, and clear them free as it comes across. And similarly allocate would unconditionally dirty new ones. You think this would have benefit? Regardless the current PR is very beneficial for these fairly new, large, and slow ESP32S3 SPIRAM chips! Thanks! |
I've implemented what I discussed above plus did a bunch of timing profiling and related optimization/refactoring. Basically the core "used map" code is a way to track ranges of free GC memory so they can be quickly skipped over. This made some SIGNIFICANT improvements to gc timings on ESP32 with large/slow SPIRAM. I was at 80ms to 100ms before finding this thread a week ago, and now down to about 6ms running the same code. Using ~170KB of RAM (8MB SPIRAM). My gc_info() or python gc.mem_free() gc.mem_alloc() calls dropped from 77ms down to 1.5ms! SHA-1: 2d1c3d6 Commit comments to describe more...
|
Any updates on when this will be merged? |
py/gc.c
Outdated
@@ -421,14 +429,18 @@ STATIC void gc_sweep(void) { | |||
memset((void *)PTR_FROM_BLOCK(area, block), 0, BYTES_PER_BLOCK); | |||
#endif | |||
} | |||
last_block = block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this line needed? In this case here the sweep is collecting the head/tail block(s) and converting them to free blocks, so why should the block be recorded as the last block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is necessary but should be inside an else
. So if we're freeing the tail, then don't record it as the last block, but if we're not then this is an (effectively unmarked) tail, so it does need to be recorded.
Also, I think we should rename last_block
to last_used_block
.
py/mpstate.h
Outdated
@@ -87,6 +87,7 @@ typedef struct _mp_state_mem_area_t { | |||
byte *gc_pool_end; | |||
|
|||
size_t gc_last_free_atb_index; | |||
size_t gc_high_watermark; // The block ID of the highest block allocated in the area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest renaming this variable to gc_last_used_block
or gc_last_allocated_block
or gc_last_non_free_block
(because watermark could mean a mark for other statistics/counters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to gc_last_used_block
.
This is a very nice improvement! Combined with appropriate use of |
py/gc.c
Outdated
@@ -680,6 +692,10 @@ void *gc_alloc(size_t n_bytes, unsigned int alloc_flags) { | |||
area->gc_last_free_atb_index = (i + 1) / BLOCKS_PER_ATB; | |||
} | |||
|
|||
if (area->gc_high_watermark < end_block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to write as area->gc_high_watermark = MAX(area->gc_high_watermark, end_block);
as I feel that conveys the intent better.
py/gc.c
Outdated
assert(ATB_GET_KIND(area, bl) == AT_FREE); | ||
ATB_FREE_TO_TAIL(area, bl); | ||
} | ||
|
||
if (area->gc_high_watermark < end_block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
In applications that use little memory and run GC regularly, the cost of the sweep phase quickly becomes prohibitives as the amount of RAM increases. On an ESP32-S3 with 2 MB of external SPIRAM, for example, a trivial GC cycle takes a minimum of 40ms, virtually all of it in the sweep phase. Similarly, on the UNIX port with 1 GB of heap, a trivial GC takes 47 ms, again virtually all of it in the sweep phase. This commit speeds up the sweep phase in the case most of the heap is empty by keeping track of the ID of the highest block we allocated in an area since the last GC. The performance benchmark run on PYBV10 shows between +0 and +2% improvement across the existing performance tests. These tests don't really stress the GC, so they were also run with gc.threshold(30000) and gc.threshold(10000). For the 30000 case, performance improved by up to +10% with this commit. For the 10000 case, performance improved by at least +10% on 6 tests, and up to +25%. Signed-off-by: Damien George <damien@micropython.org>
f841525
to
2dcd745
Compare
Codecov Report
@@ Coverage Diff @@
## master #10235 +/- ##
==========================================
+ Coverage 98.38% 98.47% +0.09%
==========================================
Files 157 155 -2
Lines 20704 20508 -196
==========================================
- Hits 20369 20195 -174
+ Misses 335 313 -22
|
I updated the PR based on the above comments. I ran some performance tests on PYBv1.0. The tests (as they are) are between +0 and +2% improved. Adding
|
Why?
In applications that use little memory and run GC regularly, the cost of the sweep phase quickly becomes prohibitive as the amount of RAM increases.
On an ESP32-S3 with 2 MB of external SPIRAM, for example, a trivial GC cycle takes a minimum of 20 ms, a big part of it in the sweep phase.
Similarly, on the UNIX port with 1 GB of heap, a trivial GC takes 47 ms, virtually all of it in the sweep phase.
What?
Speed up the sweep phase in the case most of the heap is empty by keeping track of the ID of the highest block we allocated in an area since the last GC.