py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

projectgus · 2023-09-20T08:32:30Z

This was based on a discussion about providing a more optimal way to copy data between buffers, however based on benchmarks so far it seems like it might not be worth it compared to optimising "copy to/from slice" code paths written in idiomatic Python.

Summary

Adds two functions to micropython module, gated behind a new config option:

micropython.memmove(dest, dest_idx, src, src_idx, [len]) - an optimised equivalent of dest[dest_idx:dest_idx+len] = src[src_idx:src_idx+len]. Copies memory contents with semantics of C memmove, hence the name. len argument is optional, length defaults to the minimum of the length of the source and destination regions.
micropython.memset(dest, dest_idx=0, c=0, len=len(dest)-dest_idx) - an optimised equivalent of dest[dest_idx:] = bytes([c]*len). Modelled on C's memset.

Unlike assigning to a slice, the destination buffer size never changes as a result of calling either of these functions. Out of bounds assignment raises an exception.

Benchmarks - memmove

Comparing memmove to current MicroPython "best practices" (unix port, i5-1248P CPU):

❯ ./run-internalbench.py --user-time internal_bench/slice_copy*.py
internal_bench/slice_copy:
    1.143s (+00.00%) internal_bench/slice_copy-1-lvalue_start.py
    1.248s (+09.19%) internal_bench/slice_copy-2-lvalue_start_end.py
    2.607s (+128.08%) internal_bench/slice_copy-3-lvalue_rvalue_start.py
    1.144s (+00.09%) internal_bench/slice_copy-4-lvalue_memoryview.py
    2.047s (+79.09%) internal_bench/slice_copy-5-lvalue_rvalue_memoryview.py
    1.072s (-06.21%) internal_bench/slice_copy-6-memmove.py
1 tests performed (6 individual testcases)

Honestly I found this a little underwhelming! Admittedly, slice_copy-6-memmove.py can do the equivalent of slice_copy-5-lvalue_rvalue_memoryview.py (slices on both sides of the assignment) and it's almost twice as fast, but it's only twice as fast (in a tight loop that does nothing else, working with pretty short buffers.)

Maybe the C implementation of memmove() needs some tweaks to streamline the error checking 🤷 .

When rebased against PR #10160 things get even closer:

❯ ./run-internalbench.py --user-time internal_bench/slice_copy*.py
internal_bench/slice_copy:
    0.969s (+00.00%) internal_bench/slice_copy-1-lvalue_start.py
    1.062s (+09.60%) internal_bench/slice_copy-2-lvalue_start_end.py
    2.218s (+128.90%) internal_bench/slice_copy-3-lvalue_rvalue_start.py
    0.968s (-00.10%) internal_bench/slice_copy-4-lvalue_memoryview.py
    1.743s (+79.88%) internal_bench/slice_copy-5-lvalue_rvalue_memoryview.py
    1.092s (+12.69%) internal_bench/slice_copy-6-memmove.py
1 tests performed (6 individual testcases)

Now slice_copy-6-memmove.py is only 1.6x faster than slice_copy-5-lvalue_rvalue_memoryview.py, and no faster than assigning a buffer to an lvalue slice...

Benchmarks - memset

❯ ./run-internalbench.py internal_bench/set_bytearray*
internal_bench/set_bytearray:
    10.591s (+00.00%) internal_bench/set_bytearray-1-naive.py
    9.405s (-11.20%) internal_bench/set_bytearray-2-naive-while.py
    1.110s (-89.52%) internal_bench/set_bytearray-3-copy_bytes.py
    0.935s (-91.17%) internal_bench/set_bytearray-4-memset.py
1 tests performed (4 individual testcases)

Kind of the same story with memset, writing out a bytes array (which can be frozen to flash) is basically as fast as using the memset() function. The naive versions of this are a lot slower, though!

Disclaimer: The new test file names take some liberties with the meaning of lvalue and rvalue, happy to take suggestions for more accurate term to use.

This work was funded through GitHub Sponsors.

github-actions · 2023-09-20T08:44:15Z

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  +968 +0.121% standard[incl +96(data)]
      stm32:  +348 +0.089% PYBV10
     mimxrt:  +328 +0.091% TEENSY40
        rp2:  +376 +0.115% RPI_PICO
       samd:  +348 +0.134% ADAFRUIT_ITSYBITSY_M4_EXPRESS

codecov · 2023-09-20T08:54:07Z

Codecov Report

Attention: Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Project coverage is 98.18%. Comparing base (00930b2) to head (b55ac53).
Report is 1266 commits behind head on master.

Files with missing lines	Patch %	Lines
py/modmicropython.c	0.00%	43 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12487      +/-   ##
==========================================
- Coverage   98.38%   98.18%   -0.20%     
==========================================
  Files         158      158              
  Lines       20940    20981      +41     
==========================================
- Hits        20602    20601       -1     
- Misses        338      380      +42

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This was based on a discussion about providing a more optimal way to copy data between buffers, however based on the benchmarking so far it seems like it might not be worth the overhead. Signed-off-by: Angus Gratton <angus@redyak.com.au>

projectgus · 2023-09-20T09:32:42Z

Now slice_copy-6-memmove.py is only 1.6x faster than slice_copy-5-lvalue_rvalue_memoryview.py, and no faster than assigning a buffer to an lvalue slice...

After applying the thread-local slice optimisation and poking around with flamegraph and gdb, all of the time is spent allocating the "load" side slice when the LOAD_SUBSCR opcode is executed. This ends up at objarray.c:531 which needs to heap allocate another mp_obj_array_t (memoryview copy) in order to hold the slice parameters wrapped around the underlying memoryview.

This is another very short-lived heap allocation, but it looks like it would be much harder to optimise than the thread-local slice case.

projectgus · 2024-03-07T23:49:31Z

This is an automated heads-up that we've just merged a Pull Request
that removes the STATIC macro from MicroPython's C API.

See #13763

A search suggests this PR might apply the STATIC macro to some C code. If it
does, then next time you rebase the PR (or merge from master) then you should
please replace all the STATIC keywords with static.

Although this is an automated message, feel free to @-reply to me directly if
you have any questions about this.

projectgus force-pushed the feature/memmove_memset branch from bbc7eaf to b55ac53 Compare September 20, 2023 09:05

projectgus mentioned this pull request Oct 5, 2023

usb: Add high-level USB device support packages micropython/micropython-lib#558

Merged

3 tasks

dpgeorge added the py-core Relates to py/ directory in source label Oct 11, 2023

Gadgetoid mentioned this pull request Feb 29, 2024

global: Remove the STATIC macro. #13763

Merged

projectgus closed this Nov 1, 2024

projectgus deleted the feature/memmove_memset branch November 1, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

projectgus commented Sep 20, 2023 •

edited

Loading

github-actions bot commented Sep 20, 2023 •

edited

Loading

codecov bot commented Sep 20, 2023 •

edited

Loading

projectgus commented Sep 20, 2023 •

edited

Loading

projectgus commented Mar 7, 2024

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

Conversation

projectgus commented Sep 20, 2023 • edited Loading

Summary

Benchmarks - memmove

Benchmarks - memset

github-actions bot commented Sep 20, 2023 • edited Loading

codecov bot commented Sep 20, 2023 • edited Loading

Codecov Report

projectgus commented Sep 20, 2023 • edited Loading

projectgus commented Mar 7, 2024

projectgus commented Sep 20, 2023 •

edited

Loading

github-actions bot commented Sep 20, 2023 •

edited

Loading

codecov bot commented Sep 20, 2023 •

edited

Loading

projectgus commented Sep 20, 2023 •

edited

Loading