-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
py/objslice: Avoid heap-allocating slices for built-ins. #10160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@v923z It occurs to me that of all the places that this assumption about built-in attr methods might not be true, ulab might be the most likely. Does ulab require heap-allocated slices? |
No, I don't think so. Slices pop up only here https://github.com/v923z/micropython-ulab/blob/1a440d7d124a20a3dbaa0909000741eb28a623b3/code/ndarray.c#L1111-L1178. But, perhaps, I misunderstood your question. |
This feels kind of tricky, although for the current state of the code I also cannot immediately think of issues. Although it would probably make it pretty hard to effectively implement Would there be a way to verify the assumption even more, like adding a flag which doublechecks |
Thanks @v923z -- that looks fine. @stinos I think if we implement slice construction then the One case I did think of that might need some thought is if a hard IRQ handler now attempts to slice a built-in. Previously this wouldn't have been supported as the ISR would have failed the heap allocation for the slice, but will now work (which is nice). However, if the ISR itself preempts a slice operation right between the BUILD_SLICE and the SUBSCR, then it will clobber the thread-local slice state. This could be solved by having |
This is very neat! Makes me wonder if there are other calls to make objects that can instead return a reference to a something in This will fail (although not sure it's ever used in real code): a[b:c[d:e]] I do like the alternative idea of a new opcode, like |
We could get around this by just disallowing such syntax at the compiler level, eg raise a "nested slices not allowed" exception.
A solution for this would be for a hard IRQ handler to have its own |
I thought this would be OK because the inner slice is created and used on
|
6c1776f
to
a5c7179
Compare
Added a test case for the nested slicing. |
Code size report:
|
You are right! That case is OK. But I think this one (and its generalisations) is problematic: >>> a[1:2,3:4]
File <stdin>, code block '<module>' (descriptor: 7f0a8c89c220, bytecode @7f0a8c89c140 23 bytes)
Raw bytecode (code_info_size=3, bytecode_size=20):
20 02 01 11 02 11 03 81 82 2e 02 83 84 2e 02 2a
02 55 34 01 59 51 63
arg names:
(N_STATE 5)
(N_EXC_STACK 0)
bc=0 line=1
00 LOAD_NAME __repl_print__
02 LOAD_NAME a
04 LOAD_CONST_SMALL_INT 1
05 LOAD_CONST_SMALL_INT 2
06 BUILD_SLICE 2
08 LOAD_CONST_SMALL_INT 3
09 LOAD_CONST_SMALL_INT 4
10 BUILD_SLICE 2
12 BUILD_TUPLE 2
14 LOAD_SUBSCR
15 CALL_FUNCTION n=1 nkw=0
17 POP_TOP
18 LOAD_CONST_NONE
19 RETURN_VALUE |
Argh, yeah that's quite the show-stopper... even with other approaches like a bytecode change (unless we want to make tuples detect that they're being initialised with non-heap-allocated slices). More thought required. I'll close this for now. |
I think this general idea is worth pursuing. It's pretty rare to use a tuple-of-slices like above (maybe only in numpy?). If you're not using a tuple-of-slices then the compiler guarantees that Also, I don't think it'd be that crazy to do what's done in this PR, but have BUILD_SLICE set a flag that the singleton-slice is used up, and then LOAD_SUBSCR clear that flag (and if BUILD_SLICE is called again before LOAD_SUBSCR then it allocates the slice on the heap). |
It would be quite bad, if the slice tuples no longer worked. That is probably the most numpythonic way of working with views of higher-dimensional tensors. Unfortunately, since this is really at the very core of |
Oh! They would definitely continue to work, we would make sure of that. But they won't be optimised and would continue to allocate on the heap. |
Unless the number of stack-allocated slices could be configurable, which wouldn't even add code size for the default case (i.e. 1 slice available on the stack )because the flag mentioned could be used as index. On the other hand using a new opcode would probably remove the need for any flag and work automatically with any number of slices, and probably smaller code size as well? |
Yes, that could be a good generalisation, to have N slice instances in
Yes, although maybe not smaller code size (the compiler needs to do some work to determine when to use this new opcode). |
I updated this with a fix for the |
Assume that a slice object (created by the `BUILD_SLICE` op-code or the native emitter via `mp_obj_new_slice`) will be used to slice a built-in type. This assumes that all built-in type->subscr functions will: 1. Not cause another slice to be allocated in the meantime. 2. Not save a reference to the slice. This prevents a heap allocation, which for some cases (e.g. memoryview assignment) prevents allocation altogether. Only when this assumption is incorrect (i.e. slicing a user-defined type via the `__{get,set,del}item__` methods), then heap-allocate the slice object. Also detect the case where multiple slices may be created before the subscr op, e.g. when doing `a[1:2,3:4]` and force the subsequent slices to also be heap allocated. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
a5c7179
to
b9179ec
Compare
Codecov Report
@@ Coverage Diff @@
## master #10160 +/- ##
=======================================
Coverage 98.38% 98.38%
=======================================
Files 158 158
Lines 20933 20946 +13
=======================================
+ Hits 20595 20608 +13
Misses 338 338
|
The following case does not work: a[x:y, c[d:e], f:g] This can be fixed (we discussed this with @jimmo and @projectgus ). But also there are problems if an exception is raised while a thread-local slice is still active, eg if the loading of |
To avoid xkcd 979-related mysteries for the future, we think the fix for the first one is to check if the consumed slice was the thread-local slice before NULLing out |
OK I give up on this idea 😢 Trying a different approach in #12518 |
It seems a good idea. One thought: does this increase the latency of the interrupt service routine? how much? |
Fix multiple documentation issues.
Assume that a slice object (created by the
BUILD_SLICE
op-code or the native emitter viamp_obj_new_slice
) will be used to slice a built-in type. This assumes that all built-in type->subscr functions will:This prevents a heap allocation, which for some cases (e.g. memoryview assignment) prevents allocation altogether.
Only when this assumption is incorrect (i.e. slicing a user-defined type via the
__{get,set,del}item__
methods), then heap-allocate the slice object.This adds +156 bytes on PYBV11 (and +16 bytes .bss). Much of this is handling the tuple slice case -- you don't know at slice creation whether its going to be directly passed to
{LOAD,STORE}_SUBSCR
or whether it's potentially one of multiple slices about to be placed into a tuple.This was based on a suggestion from @damz in Discord which I think was along the lines of combining the
BUILD_SLICE
op with{LOAD,STORE}_SUBSCR
, but I think the approach in this PR achieves the same goal with smaller code size and without having to change the bytecode.This work was funded through GitHub Sponsors.