Skip to content

py/objslice: Avoid heap-allocating slices for built-ins. #10160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jimmo
Copy link
Member

@jimmo jimmo commented Dec 5, 2022

Assume that a slice object (created by the BUILD_SLICE op-code or the native emitter via mp_obj_new_slice) will be used to slice a built-in type. This assumes that all built-in type->subscr functions will:

  1. Not cause another slice to be allocated in the meantime.
  2. Not save a reference to the slice.

This prevents a heap allocation, which for some cases (e.g. memoryview assignment) prevents allocation altogether.

Only when this assumption is incorrect (i.e. slicing a user-defined type via the __{get,set,del}item__ methods), then heap-allocate the slice object.

This adds +156 bytes on PYBV11 (and +16 bytes .bss). Much of this is handling the tuple slice case -- you don't know at slice creation whether its going to be directly passed to {LOAD,STORE}_SUBSCR or whether it's potentially one of multiple slices about to be placed into a tuple.

This was based on a suggestion from @damz in Discord which I think was along the lines of combining the BUILD_SLICE op with {LOAD,STORE}_SUBSCR, but I think the approach in this PR achieves the same goal with smaller code size and without having to change the bytecode.

This work was funded through GitHub Sponsors.

@jimmo
Copy link
Member Author

jimmo commented Dec 5, 2022

@v923z It occurs to me that of all the places that this assumption about built-in attr methods might not be true, ulab might be the most likely. Does ulab require heap-allocated slices?

@v923z
Copy link
Contributor

v923z commented Dec 5, 2022

@v923z It occurs to me that of all the places that this assumption about built-in attr methods might not be true, ulab might be the most likely. Does ulab require heap-allocated slices?

No, I don't think so. Slices pop up only here https://github.com/v923z/micropython-ulab/blob/1a440d7d124a20a3dbaa0909000741eb28a623b3/code/ndarray.c#L1111-L1178. But, perhaps, I misunderstood your question.

@stinos
Copy link
Contributor

stinos commented Dec 5, 2022

This feels kind of tricky, although for the current state of the code I also cannot immediately think of issues. Although it would probably make it pretty hard to effectively implement slice() construction should we want that.

Would there be a way to verify the assumption even more, like adding a flag which doublechecks mp_obj_new_slice only gets called once the previous slice it created has been consumed by an indexing operation (which assumingly then means it's not needed anymore)? Or some variation thereof?

@jimmo
Copy link
Member Author

jimmo commented Dec 5, 2022

Thanks @v923z -- that looks fine.

@stinos I think if we implement slice construction then the make_new just needs to unconditionally use the heap.
I'll have a think about your suggestion on how to verify the assumption. Another idea might be to null-out the thread-local slice immediately after invoking the subscr function in the {LOAD,STORE}_SUBSCR byte code ops.

One case I did think of that might need some thought is if a hard IRQ handler now attempts to slice a built-in. Previously this wouldn't have been supported as the ISR would have failed the heap allocation for the slice, but will now work (which is nice). However, if the ISR itself preempts a slice operation right between the BUILD_SLICE and the SUBSCR, then it will clobber the thread-local slice state. This could be solved by having mp_obj_new_slice check the GC lock depth.

@dpgeorge
Copy link
Member

dpgeorge commented Dec 5, 2022

This is very neat! Makes me wonder if there are other calls to make objects that can instead return a reference to a something in MP_STATE_THREAD(...)...

This will fail (although not sure it's ever used in real code):

a[b:c[d:e]]

I do like the alternative idea of a new opcode, like BUILD_SLICE_ON_STACK. That kind of parallels LOAD_METHOD. Then the slice object (or maybe even just its indices) is stored on the Python stack. That eliminates all issues with threading, IRQs, etc.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Dec 5, 2022
@dpgeorge
Copy link
Member

dpgeorge commented Dec 5, 2022

This will fail (although not sure it's ever used in real code):

a[b:c[d:e]]

We could get around this by just disallowing such syntax at the compiler level, eg raise a "nested slices not allowed" exception.

However, if the ISR itself preempts a slice operation right between the BUILD_SLICE and the SUBSCR, then it will clobber the thread-local slice state.

A solution for this would be for a hard IRQ handler to have its own mp_state_thread_t instance (basically by making a backup of the existing one, then restoring the backup at the end). That might anyway be a good idea, because hard IRQ handlers can be thought of as a separate (preemptive) thread.

@jimmo
Copy link
Member Author

jimmo commented Dec 5, 2022

This will fail (although not sure it's ever used in real code): a[b:c[d:e]]

I thought this would be OK because the inner slice is created and used on c before the outer slice is created... e.g.

>>> a[0:b[1:2]]
File <stdin>, code block '<module>' (descriptor: 7f58bd427b60, bytecode @7f58bd427ac0 23 bytes)
Raw bytecode (code_info_size=3, bytecode_size=20):
 28 02 01 11 02 11 03 80 11 04 81 82 2e 02 55 2e
 02 55 34 01 59 51 63
arg names:
(N_STATE 6)
(N_EXC_STACK 0)
  bc=0 line=1
00 LOAD_NAME __repl_print__
02 LOAD_NAME a
04 LOAD_CONST_SMALL_INT 0
05 LOAD_NAME b
07 LOAD_CONST_SMALL_INT 1
08 LOAD_CONST_SMALL_INT 2
09 BUILD_SLICE 2
11 LOAD_SUBSCR
12 BUILD_SLICE 2
14 LOAD_SUBSCR
15 CALL_FUNCTION n=1 nkw=0
17 POP_TOP
18 LOAD_CONST_NONE
19 RETURN_VALUE

@jimmo jimmo force-pushed the thread-local-slice branch from 6c1776f to a5c7179 Compare December 6, 2022 01:01
@jimmo
Copy link
Member Author

jimmo commented Dec 6, 2022

Added a test case for the nested slicing.

@github-actions
Copy link

github-actions bot commented Dec 6, 2022

Code size report:

   bare-arm:   +28 +0.049% [incl +16(bss)]
minimal x86:   +48 +0.026% [incl +32(bss)]
   unix x64:  +248 +0.031% standard[incl +32(bss)]
      stm32:  +156 +0.040% PYBV10[incl +16(bss)]
     mimxrt:  +152 +0.042% TEENSY40[incl +16(bss)]
        rp2:  +208 +0.064% RPI_PICO[incl +16(bss)]
       samd:  +164 +0.063% ADAFRUIT_ITSYBITSY_M4_EXPRESS[incl +16(bss)]

@dpgeorge
Copy link
Member

dpgeorge commented Dec 6, 2022

You are right! That case is OK.

But I think this one (and its generalisations) is problematic:

>>> a[1:2,3:4]
File <stdin>, code block '<module>' (descriptor: 7f0a8c89c220, bytecode @7f0a8c89c140 23 bytes)
Raw bytecode (code_info_size=3, bytecode_size=20):
 20 02 01 11 02 11 03 81 82 2e 02 83 84 2e 02 2a
 02 55 34 01 59 51 63
arg names:
(N_STATE 5)
(N_EXC_STACK 0)
  bc=0 line=1
00 LOAD_NAME __repl_print__
02 LOAD_NAME a
04 LOAD_CONST_SMALL_INT 1
05 LOAD_CONST_SMALL_INT 2
06 BUILD_SLICE 2
08 LOAD_CONST_SMALL_INT 3
09 LOAD_CONST_SMALL_INT 4
10 BUILD_SLICE 2
12 BUILD_TUPLE 2
14 LOAD_SUBSCR
15 CALL_FUNCTION n=1 nkw=0
17 POP_TOP
18 LOAD_CONST_NONE
19 RETURN_VALUE

@jimmo
Copy link
Member Author

jimmo commented Dec 6, 2022

Argh, yeah that's quite the show-stopper... even with other approaches like a bytecode change (unless we want to make tuples detect that they're being initialised with non-heap-allocated slices). More thought required. I'll close this for now.

@jimmo jimmo closed this Dec 6, 2022
@dpgeorge
Copy link
Member

dpgeorge commented Dec 6, 2022

I think this general idea is worth pursuing. It's pretty rare to use a tuple-of-slices like above (maybe only in numpy?). If you're not using a tuple-of-slices then the compiler guarantees that BUILD_SLICE is followed immediately by LOAD_SUBSCR. So we could make the compiler special case this pair so it doesn't use the heap (instead allocating from the Python stack, like mp_obj_iter_buf_t works).

Also, I don't think it'd be that crazy to do what's done in this PR, but have BUILD_SLICE set a flag that the singleton-slice is used up, and then LOAD_SUBSCR clear that flag (and if BUILD_SLICE is called again before LOAD_SUBSCR then it allocates the slice on the heap).

@v923z
Copy link
Contributor

v923z commented Dec 6, 2022

I think this general idea is worth pursuing. It's pretty rare to use a tuple-of-slices like above (maybe only in numpy?).

It would be quite bad, if the slice tuples no longer worked. That is probably the most numpythonic way of working with views of higher-dimensional tensors.

Unfortunately, since this is really at the very core of micropython, I couldn't even patch it up on the ulab side.

@dpgeorge
Copy link
Member

dpgeorge commented Dec 6, 2022

It would be quite bad, if the slice tuples no longer worked

Oh! They would definitely continue to work, we would make sure of that. But they won't be optimised and would continue to allocate on the heap.

@stinos
Copy link
Contributor

stinos commented Dec 7, 2022

Unless the number of stack-allocated slices could be configurable, which wouldn't even add code size for the default case (i.e. 1 slice available on the stack )because the flag mentioned could be used as index.

On the other hand using a new opcode would probably remove the need for any flag and work automatically with any number of slices, and probably smaller code size as well?

@dpgeorge
Copy link
Member

dpgeorge commented Dec 8, 2022

Unless the number of stack-allocated slices could be configurable, which wouldn't even add code size for the default case (i.e. 1 slice available on the stack )because the flag mentioned could be used as index.

Yes, that could be a good generalisation, to have N slice instances in mp_state_t.

On the other hand using a new opcode would probably remove the need for any flag and work automatically with any number of slices, and probably smaller code size as well?

Yes, although maybe not smaller code size (the compiler needs to do some work to determine when to use this new opcode).

@jimmo
Copy link
Member Author

jimmo commented Sep 13, 2023

I updated this with a fix for the a[1:2,3:4] case, by detecting that additional thread-local slices were being attempted to be created and heap-allocating them instead. This makes the logic that heap-allocates before passing to Python a bit more complex (to now handle that the thread-local slice may be inside the tuple).

Assume that a slice object (created by the `BUILD_SLICE` op-code or the
native emitter via `mp_obj_new_slice`) will be used to slice a built-in
type. This assumes that all built-in type->subscr functions will:
1. Not cause another slice to be allocated in the meantime.
2. Not save a reference to the slice.

This prevents a heap allocation, which for some cases (e.g. memoryview
assignment) prevents allocation altogether.

Only when this assumption is incorrect (i.e. slicing a user-defined type
via the `__{get,set,del}item__` methods), then heap-allocate the slice
object.

Also detect the case where multiple slices may be created before the subscr
op, e.g. when doing `a[1:2,3:4]` and force the subsequent slices to also
be heap allocated.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
@jimmo jimmo reopened this Sep 13, 2023
@jimmo jimmo force-pushed the thread-local-slice branch from a5c7179 to b9179ec Compare September 13, 2023 06:17
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Merging #10160 (b9179ec) into master (ff70bd1) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master   #10160   +/-   ##
=======================================
  Coverage   98.38%   98.38%           
=======================================
  Files         158      158           
  Lines       20933    20946   +13     
=======================================
+ Hits        20595    20608   +13     
  Misses        338      338           
Files Changed Coverage Δ
py/obj.c 98.18% <100.00%> (+0.01%) ⬆️
py/objslice.c 100.00% <100.00%> (ø)
py/objtype.c 100.00% <100.00%> (ø)

@dpgeorge
Copy link
Member

The following case does not work:

a[x:y, c[d:e], f:g]

This can be fixed (we discussed this with @jimmo and @projectgus ).

But also there are problems if an exception is raised while a thread-local slice is still active, eg if the loading of c in the above expression raises.

@projectgus
Copy link
Contributor

This can be fixed

To avoid xkcd 979-related mysteries for the future, we think the fix for the first one is to check if the consumed slice was the thread-local slice before NULLing out type in mp_obj_subscr.

@jimmo
Copy link
Member Author

jimmo commented Sep 26, 2023

OK I give up on this idea 😢

Trying a different approach in #12518

@jimmo jimmo closed this Sep 26, 2023
@massimosala
Copy link

hard IRQ handler to have its own mp_state_thread_t

It seems a good idea. One thought: does this increase the latency of the interrupt service routine? how much?

tannewt pushed a commit to tannewt/circuitpython that referenced this pull request Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants