py/objslice: Avoid heap-allocating slices for built-ins. #10160

jimmo · 2022-12-05T06:14:12Z

Assume that a slice object (created by the BUILD_SLICE op-code or the native emitter via mp_obj_new_slice) will be used to slice a built-in type. This assumes that all built-in type->subscr functions will:

Not cause another slice to be allocated in the meantime.
Not save a reference to the slice.

This prevents a heap allocation, which for some cases (e.g. memoryview assignment) prevents allocation altogether.

Only when this assumption is incorrect (i.e. slicing a user-defined type via the __{get,set,del}item__ methods), then heap-allocate the slice object.

This adds +156 bytes on PYBV11 (and +16 bytes .bss). Much of this is handling the tuple slice case -- you don't know at slice creation whether its going to be directly passed to {LOAD,STORE}_SUBSCR or whether it's potentially one of multiple slices about to be placed into a tuple.

This was based on a suggestion from @damz in Discord which I think was along the lines of combining the BUILD_SLICE op with {LOAD,STORE}_SUBSCR, but I think the approach in this PR achieves the same goal with smaller code size and without having to change the bytecode.

This work was funded through GitHub Sponsors.

jimmo · 2022-12-05T06:58:15Z

@v923z It occurs to me that of all the places that this assumption about built-in attr methods might not be true, ulab might be the most likely. Does ulab require heap-allocated slices?

v923z · 2022-12-05T07:33:09Z

@v923z It occurs to me that of all the places that this assumption about built-in attr methods might not be true, ulab might be the most likely. Does ulab require heap-allocated slices?

No, I don't think so. Slices pop up only here https://github.com/v923z/micropython-ulab/blob/1a440d7d124a20a3dbaa0909000741eb28a623b3/code/ndarray.c#L1111-L1178. But, perhaps, I misunderstood your question.

stinos · 2022-12-05T10:39:13Z

This feels kind of tricky, although for the current state of the code I also cannot immediately think of issues. Although it would probably make it pretty hard to effectively implement slice() construction should we want that.

Would there be a way to verify the assumption even more, like adding a flag which doublechecks mp_obj_new_slice only gets called once the previous slice it created has been consumed by an indexing operation (which assumingly then means it's not needed anymore)? Or some variation thereof?

jimmo · 2022-12-05T13:08:40Z

Thanks @v923z -- that looks fine.

@stinos I think if we implement slice construction then the make_new just needs to unconditionally use the heap.
I'll have a think about your suggestion on how to verify the assumption. Another idea might be to null-out the thread-local slice immediately after invoking the subscr function in the {LOAD,STORE}_SUBSCR byte code ops.

One case I did think of that might need some thought is if a hard IRQ handler now attempts to slice a built-in. Previously this wouldn't have been supported as the ISR would have failed the heap allocation for the slice, but will now work (which is nice). However, if the ISR itself preempts a slice operation right between the BUILD_SLICE and the SUBSCR, then it will clobber the thread-local slice state. This could be solved by having mp_obj_new_slice check the GC lock depth.

dpgeorge · 2022-12-05T22:34:55Z

This is very neat! Makes me wonder if there are other calls to make objects that can instead return a reference to a something in MP_STATE_THREAD(...)...

This will fail (although not sure it's ever used in real code):

a[b:c[d:e]]

I do like the alternative idea of a new opcode, like BUILD_SLICE_ON_STACK. That kind of parallels LOAD_METHOD. Then the slice object (or maybe even just its indices) is stored on the Python stack. That eliminates all issues with threading, IRQs, etc.

dpgeorge · 2022-12-05T22:57:52Z

This will fail (although not sure it's ever used in real code):
a[b:c[d:e]]

We could get around this by just disallowing such syntax at the compiler level, eg raise a "nested slices not allowed" exception.

However, if the ISR itself preempts a slice operation right between the BUILD_SLICE and the SUBSCR, then it will clobber the thread-local slice state.

A solution for this would be for a hard IRQ handler to have its own mp_state_thread_t instance (basically by making a backup of the existing one, then restoring the backup at the end). That might anyway be a good idea, because hard IRQ handlers can be thought of as a separate (preemptive) thread.

jimmo · 2022-12-05T23:45:33Z

This will fail (although not sure it's ever used in real code): a[b:c[d:e]]

I thought this would be OK because the inner slice is created and used on c before the outer slice is created... e.g.

>>> a[0:b[1:2]]
File <stdin>, code block '<module>' (descriptor: 7f58bd427b60, bytecode @7f58bd427ac0 23 bytes)
Raw bytecode (code_info_size=3, bytecode_size=20):
 28 02 01 11 02 11 03 80 11 04 81 82 2e 02 55 2e
 02 55 34 01 59 51 63
arg names:
(N_STATE 6)
(N_EXC_STACK 0)
  bc=0 line=1
00 LOAD_NAME __repl_print__
02 LOAD_NAME a
04 LOAD_CONST_SMALL_INT 0
05 LOAD_NAME b
07 LOAD_CONST_SMALL_INT 1
08 LOAD_CONST_SMALL_INT 2
09 BUILD_SLICE 2
11 LOAD_SUBSCR
12 BUILD_SLICE 2
14 LOAD_SUBSCR
15 CALL_FUNCTION n=1 nkw=0
17 POP_TOP
18 LOAD_CONST_NONE
19 RETURN_VALUE

jimmo · 2022-12-06T01:01:51Z

Added a test case for the nested slicing.

github-actions · 2022-12-06T01:04:18Z

Code size report:

   bare-arm:   +28 +0.049% [incl +16(bss)]
minimal x86:   +48 +0.026% [incl +32(bss)]
   unix x64:  +248 +0.031% standard[incl +32(bss)]
      stm32:  +156 +0.040% PYBV10[incl +16(bss)]
     mimxrt:  +152 +0.042% TEENSY40[incl +16(bss)]
        rp2:  +208 +0.064% RPI_PICO[incl +16(bss)]
       samd:  +164 +0.063% ADAFRUIT_ITSYBITSY_M4_EXPRESS[incl +16(bss)]

dpgeorge · 2022-12-06T05:57:05Z

You are right! That case is OK.

But I think this one (and its generalisations) is problematic:

>>> a[1:2,3:4]
File <stdin>, code block '<module>' (descriptor: 7f0a8c89c220, bytecode @7f0a8c89c140 23 bytes)
Raw bytecode (code_info_size=3, bytecode_size=20):
 20 02 01 11 02 11 03 81 82 2e 02 83 84 2e 02 2a
 02 55 34 01 59 51 63
arg names:
(N_STATE 5)
(N_EXC_STACK 0)
  bc=0 line=1
00 LOAD_NAME __repl_print__
02 LOAD_NAME a
04 LOAD_CONST_SMALL_INT 1
05 LOAD_CONST_SMALL_INT 2
06 BUILD_SLICE 2
08 LOAD_CONST_SMALL_INT 3
09 LOAD_CONST_SMALL_INT 4
10 BUILD_SLICE 2
12 BUILD_TUPLE 2
14 LOAD_SUBSCR
15 CALL_FUNCTION n=1 nkw=0
17 POP_TOP
18 LOAD_CONST_NONE
19 RETURN_VALUE

jimmo · 2022-12-06T06:36:02Z

Argh, yeah that's quite the show-stopper... even with other approaches like a bytecode change (unless we want to make tuples detect that they're being initialised with non-heap-allocated slices). More thought required. I'll close this for now.

dpgeorge · 2022-12-06T07:26:16Z

I think this general idea is worth pursuing. It's pretty rare to use a tuple-of-slices like above (maybe only in numpy?). If you're not using a tuple-of-slices then the compiler guarantees that BUILD_SLICE is followed immediately by LOAD_SUBSCR. So we could make the compiler special case this pair so it doesn't use the heap (instead allocating from the Python stack, like mp_obj_iter_buf_t works).

Also, I don't think it'd be that crazy to do what's done in this PR, but have BUILD_SLICE set a flag that the singleton-slice is used up, and then LOAD_SUBSCR clear that flag (and if BUILD_SLICE is called again before LOAD_SUBSCR then it allocates the slice on the heap).

v923z · 2022-12-06T13:20:12Z

I think this general idea is worth pursuing. It's pretty rare to use a tuple-of-slices like above (maybe only in numpy?).

It would be quite bad, if the slice tuples no longer worked. That is probably the most numpythonic way of working with views of higher-dimensional tensors.

Unfortunately, since this is really at the very core of micropython, I couldn't even patch it up on the ulab side.

dpgeorge · 2022-12-06T23:02:02Z

It would be quite bad, if the slice tuples no longer worked

Oh! They would definitely continue to work, we would make sure of that. But they won't be optimised and would continue to allocate on the heap.

stinos · 2022-12-07T08:13:03Z

Unless the number of stack-allocated slices could be configurable, which wouldn't even add code size for the default case (i.e. 1 slice available on the stack )because the flag mentioned could be used as index.

On the other hand using a new opcode would probably remove the need for any flag and work automatically with any number of slices, and probably smaller code size as well?

dpgeorge · 2022-12-08T01:34:33Z

Unless the number of stack-allocated slices could be configurable, which wouldn't even add code size for the default case (i.e. 1 slice available on the stack )because the flag mentioned could be used as index.

Yes, that could be a good generalisation, to have N slice instances in mp_state_t.

On the other hand using a new opcode would probably remove the need for any flag and work automatically with any number of slices, and probably smaller code size as well?

Yes, although maybe not smaller code size (the compiler needs to do some work to determine when to use this new opcode).

jimmo · 2023-09-13T06:09:41Z

I updated this with a fix for the a[1:2,3:4] case, by detecting that additional thread-local slices were being attempted to be created and heap-allocating them instead. This makes the logic that heap-allocates before passing to Python a bit more complex (to now handle that the thread-local slice may be inside the tuple).

Assume that a slice object (created by the `BUILD_SLICE` op-code or the native emitter via `mp_obj_new_slice`) will be used to slice a built-in type. This assumes that all built-in type->subscr functions will: 1. Not cause another slice to be allocated in the meantime. 2. Not save a reference to the slice. This prevents a heap allocation, which for some cases (e.g. memoryview assignment) prevents allocation altogether. Only when this assumption is incorrect (i.e. slicing a user-defined type via the `__{get,set,del}item__` methods), then heap-allocate the slice object. Also detect the case where multiple slices may be created before the subscr op, e.g. when doing `a[1:2,3:4]` and force the subsequent slices to also be heap allocated. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>

codecov · 2023-09-13T06:17:06Z

Codecov Report

Merging #10160 (b9179ec) into master (ff70bd1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #10160   +/-   ##
=======================================
  Coverage   98.38%   98.38%           
=======================================
  Files         158      158           
  Lines       20933    20946   +13     
=======================================
+ Hits        20595    20608   +13     
  Misses        338      338

Files Changed	Coverage Δ
py/obj.c	`98.18% <100.00%> (+0.01%)`	⬆️
py/objslice.c	`100.00% <100.00%> (ø)`
py/objtype.c	`100.00% <100.00%> (ø)`

dpgeorge · 2023-09-26T06:04:01Z

The following case does not work:

a[x:y, c[d:e], f:g]

This can be fixed (we discussed this with @jimmo and @projectgus ).

But also there are problems if an exception is raised while a thread-local slice is still active, eg if the loading of c in the above expression raises.

projectgus · 2023-09-26T06:06:51Z

This can be fixed

To avoid xkcd 979-related mysteries for the future, we think the fix for the first one is to check if the consumed slice was the thread-local slice before NULLing out type in mp_obj_subscr.

jimmo · 2023-09-26T07:04:13Z

OK I give up on this idea 😢

Trying a different approach in #12518

massimosala · 2024-05-25T12:29:00Z

hard IRQ handler to have its own mp_state_thread_t

It seems a good idea. One thought: does this increase the latency of the interrupt service routine? how much?

Fix multiple documentation issues.

dpgeorge added the py-core Relates to py/ directory in source label Dec 5, 2022

jimmo force-pushed the thread-local-slice branch from 6c1776f to a5c7179 Compare December 6, 2022 01:01

jimmo closed this Dec 6, 2022

jimmo mentioned this pull request Dec 15, 2022

Item deletion support for bytearray object #10221

Open

jimmo reopened this Sep 13, 2023

jimmo force-pushed the thread-local-slice branch from a5c7179 to b9179ec Compare September 13, 2023 06:17

projectgus mentioned this pull request Sep 20, 2023

py/modmicropython: Add micropython.memmove() and micropython.memset(). #12487

Closed

micropython deleted a comment from github-actions bot Sep 26, 2023

jimmo mentioned this pull request Sep 26, 2023

py/vm.c: Avoid heap-allocating slices for built-ins. #12518

Open

jimmo closed this Sep 26, 2023

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Mar 27, 2025

Merge pull request micropython#10160 from dhalbert/9.2.x-doc-fixes

ae285da

Fix multiple documentation issues.

Uh oh!

py/objslice: Avoid heap-allocating slices for built-ins. #10160

py/objslice: Avoid heap-allocating slices for built-ins. #10160

Uh oh!

Conversation

jimmo commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmo commented Dec 5, 2022

Uh oh!

v923z commented Dec 5, 2022

Uh oh!

stinos commented Dec 5, 2022

Uh oh!

jimmo commented Dec 5, 2022

Uh oh!

dpgeorge commented Dec 5, 2022

Uh oh!

dpgeorge commented Dec 5, 2022

Uh oh!

jimmo commented Dec 5, 2022

Uh oh!

jimmo commented Dec 6, 2022

Uh oh!

github-actions bot commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Dec 6, 2022

Uh oh!

jimmo commented Dec 6, 2022

Uh oh!

dpgeorge commented Dec 6, 2022

Uh oh!

v923z commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpgeorge commented Dec 6, 2022

Uh oh!

stinos commented Dec 7, 2022

Uh oh!

dpgeorge commented Dec 8, 2022

Uh oh!

jimmo commented Sep 13, 2023

Uh oh!

codecov bot commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dpgeorge commented Sep 26, 2023

Uh oh!

projectgus commented Sep 26, 2023

Uh oh!

jimmo commented Sep 26, 2023

Uh oh!

massimosala commented May 25, 2024

Uh oh!

Uh oh!

jimmo commented Dec 5, 2022 •

edited

Loading

github-actions bot commented Dec 6, 2022 •

edited

Loading

v923z commented Dec 6, 2022 •

edited

Loading

codecov bot commented Sep 13, 2023 •

edited

Loading