-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
alif/alif.mk: Add MPY_CROSS_FLAGS setting. #17908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alif/alif.mk: Add MPY_CROSS_FLAGS setting. #17908
Conversation
@dpgeorge - Awesome! |
@@ -22,6 +22,8 @@ include $(TOP)/extmod/extmod.mk | |||
################################################################################ | |||
# Project specific settings and compiler/linker flags | |||
|
|||
MPY_CROSS_FLAGS += -march=armv7emdp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be set for float implementation or is it unrelated?
MPY_CROSS_FLAGS += -march=armv7emdp | |
ifeq ($(MICROPY_FLOAT_IMPL),float) | |
MPY_CROSS_FLAGS += -march=armv7emdp | |
else | |
MPY_CROSS_FLAGS += -march=armv7emsp # ? | |
endif |
Note that we use we use MICROPY_FLOAT_IMPL=float
when building. Would that affect the loaded frozen code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting is unrelated to the MICROPY_FLOAT_IMPL setting.
This setting is about the hardware capabilities, not any API/ABI (at least for frozen code, which is what matters here). It means you can use floating-point assembly instructions in @micropython.asm_thumb
functions.
Note that we use we use
MICROPY_FLOAT_IMPL=float
when building. Would that affect the loaded frozen code?
No, it won't matter.
(I'm curious why you don't use double though? Float objects still the same amount of heap as double (16 bytes) and you'd get more precision with doubles.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the object representation that has floats as 4-bytes.
Matters a lot of large arrays. You'd run out of memory quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still make 32-bit float arrays (in C and Python) while still using double for MicroPython's float object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means you can use floating-point assembly instructions in @micropython.asm_thumb functions.
But wouldn't armv7emdp
emit double-precision floating point instructions?
I'm curious why you don't use double though?
Historically we only had single-precision FPUs so we hard-coded float
in many places (not mp_float_t
) switching would break a lot of code, but that's easily fixable. However, I still don't think we should use double because it uses more bandwidth and more cycles. I couldn't easily find a reference for this, but I think double-precision has to be slower, at the very least it's wider so more memory bandwidth on loads/stores. The CM55 has a performance monitoring unit (PMU), it would be interesting to use it to benchmark float vs double.
You can still make 32-bit float arrays (in C and Python) while still using double for MicroPython's float object.
You mean by casting back and forth? We don't have control over all modules, for example ulab arrays would double in size, and its object files would use double-precision instructions. Also, I think the same goes for MicroPython for any float operations performed in Python.
EDIT: Not sure if I'm using it correctly:
#include "pmu_armv8.h"
uint32_t test_float(size_t iterations) {
ARM_PMU_CYCCNT_Reset();
volatile float x = 1.234f, y = 5.678f, z = 0.0f;
for (size_t i = 0; i < iterations; i++) {
z += x * y; // FP32 multiply-add
}
return ARM_PMU_Get_CCNTR();
}
uint32_t test_double(size_t iterations) {
ARM_PMU_CYCCNT_Reset();
volatile double x = 1.234, y = 5.678, z = 0.0;
for (size_t i = 0; i < iterations; i++) {
z += x * y; // FP64 multiply-add
}
return ARM_PMU_Get_CCNTR();
}
int benchmark_float(void) {
const size_t N = 1000000;
__disable_irq();
ARM_PMU_Enable();
ARM_PMU_CNTR_Enable(PMU_CNTENSET_CCNTR_ENABLE_Msk);
uint32_t flt_cycles = test_float(N);
uint32_t dbl_cycles = test_double(N);
__enable_irq();
printf("Float cycles: %lu\n", (unsigned long)flt_cycles);
printf("Double cycles: %lu\n", (unsigned long)dbl_cycles);
while (1);
}
Output with -Og
:
Float cycles: 1000358
Double cycles: 4300843
Output with -O2
:
Float cycles: 800347
Double cycles: 3000302
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think I ran the test the first time with less iteration. I changed the test a bit so it doesn't get optimized out, and ran it again with 1_000_000 iterations and -O2
:
With volatile:
Float cycles: 9000756
Double cycles: 31001274
Without volatile:
Float cycles: 6000008
Double cycles: 18000003
In either case you see it takes about 3-4x more cycles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In either case you see it takes about 3-4x more cycles.
OK, that's good information to have.
But my point still stands: for Python code this won't make much of a difference. It will however make a big difference for things like ulab and if you have custom C code that is floating-point heavy and uses mp_float_t
instead of explicitly using float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will however make a big difference for things like ulab and if you have custom C code that is floating-point heavy and uses mp_float_t instead of explicitly using float.
ulab uses mp_float_t
, while our code still uses hard-coded float
. However, it still won't build with IMPL=double
because double will need to be cast back to float. I'd rather refactor the code to use mp_float_t
instead of adding the casts, this way if a port/board uses double it builds. For now, all we need is single-precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our code still uses hard-coded
float
. However, it still won't build withIMPL=double
because double will need to be cast back to float. I'd rather refactor the code to usemp_float_t
instead of adding the casts, this way if a port/board uses double it builds
I'd actually suggest to stick with hard-coded float
everywhere, because that's what you've designed your algorithm about and optimised it with. Then, use the following provided functions to interoperate with MicroPython objects: mp_obj_get_float_to_f
, mp_obj_get_float_to_d
, mp_obj_new_float_from_f
, mp_obj_new_float_from_d
. They adjust themselved based on single/double settings.
(You could also define your own float type, eg omv_float_t
, and use that in all your code, at least then it's configurable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interoperate with MicroPython objects:
mp_obj_get_float_to_f
,mp_obj_get_float_to_d
,mp_obj_new_float_from_f
Didn't know about those, thanks! Yes, I'll do that instead.
The HP and HE CPUs have double-precision hardware floating point, so can use the armv7emdp architecture. This allows frozen code to use native/viper/asm_thumb decorators. Fixes issue micropython#17896. Signed-off-by: Damien George <damien@micropython.org>
3076d19
to
b7cfafc
Compare
Summary
The HP and HE CPUs have double-precision hardware floating point, so can use the armv7emdp architecture.
This allows frozen code to use native/viper/asm_thumb decorators.
Fixes issue #17896.
Testing
Tested on OPENMV_AE3, putting native/viper/asm_thumb code in a frozen module. It works.