Skip to content

alif/alif.mk: Add MPY_CROSS_FLAGS setting. #17908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 15, 2025

Conversation

dpgeorge
Copy link
Member

Summary

The HP and HE CPUs have double-precision hardware floating point, so can use the armv7emdp architecture.

This allows frozen code to use native/viper/asm_thumb decorators.

Fixes issue #17896.

Testing

Tested on OPENMV_AE3, putting native/viper/asm_thumb code in a frozen module. It works.

@kwagyeman
Copy link
Contributor

@dpgeorge - Awesome!

@@ -22,6 +22,8 @@ include $(TOP)/extmod/extmod.mk
################################################################################
# Project specific settings and compiler/linker flags

MPY_CROSS_FLAGS += -march=armv7emdp
Copy link
Contributor

@iabdalkader iabdalkader Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set for float implementation or is it unrelated?

Suggested change
MPY_CROSS_FLAGS += -march=armv7emdp
ifeq ($(MICROPY_FLOAT_IMPL),float)
MPY_CROSS_FLAGS += -march=armv7emdp
else
MPY_CROSS_FLAGS += -march=armv7emsp # ?
endif

Note that we use we use MICROPY_FLOAT_IMPL=float when building. Would that affect the loaded frozen code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting is unrelated to the MICROPY_FLOAT_IMPL setting.

This setting is about the hardware capabilities, not any API/ABI (at least for frozen code, which is what matters here). It means you can use floating-point assembly instructions in @micropython.asm_thumb functions.

Note that we use we use MICROPY_FLOAT_IMPL=float when building. Would that affect the loaded frozen code?

No, it won't matter.

(I'm curious why you don't use double though? Float objects still the same amount of heap as double (16 bytes) and you'd get more precision with doubles.)

Copy link
Contributor

@kwagyeman kwagyeman Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the object representation that has floats as 4-bytes.

Matters a lot of large arrays. You'd run out of memory quick.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still make 32-bit float arrays (in C and Python) while still using double for MicroPython's float object.

Copy link
Contributor

@iabdalkader iabdalkader Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means you can use floating-point assembly instructions in @micropython.asm_thumb functions.

But wouldn't armv7emdp emit double-precision floating point instructions?

I'm curious why you don't use double though?

Historically we only had single-precision FPUs so we hard-coded float in many places (not mp_float_t) switching would break a lot of code, but that's easily fixable. However, I still don't think we should use double because it uses more bandwidth and more cycles. I couldn't easily find a reference for this, but I think double-precision has to be slower, at the very least it's wider so more memory bandwidth on loads/stores. The CM55 has a performance monitoring unit (PMU), it would be interesting to use it to benchmark float vs double.

You can still make 32-bit float arrays (in C and Python) while still using double for MicroPython's float object.

You mean by casting back and forth? We don't have control over all modules, for example ulab arrays would double in size, and its object files would use double-precision instructions. Also, I think the same goes for MicroPython for any float operations performed in Python.

EDIT: Not sure if I'm using it correctly:

#include "pmu_armv8.h"

uint32_t test_float(size_t iterations) {
    ARM_PMU_CYCCNT_Reset();
    volatile float x = 1.234f, y = 5.678f, z = 0.0f;
    for (size_t i = 0; i < iterations; i++) {
        z += x * y;   // FP32 multiply-add
    }
    return ARM_PMU_Get_CCNTR();
}

uint32_t test_double(size_t iterations) {
    ARM_PMU_CYCCNT_Reset();
    volatile double x = 1.234, y = 5.678, z = 0.0;
    for (size_t i = 0; i < iterations; i++) {
        z += x * y;   // FP64 multiply-add
    }
    return ARM_PMU_Get_CCNTR();
}

int benchmark_float(void) {
    const size_t N = 1000000;

    __disable_irq();
    ARM_PMU_Enable();
    ARM_PMU_CNTR_Enable(PMU_CNTENSET_CCNTR_ENABLE_Msk);
   
    uint32_t flt_cycles = test_float(N);
    uint32_t dbl_cycles = test_double(N);

    __enable_irq();
    printf("Float cycles: %lu\n", (unsigned long)flt_cycles);
    printf("Double cycles: %lu\n", (unsigned long)dbl_cycles);

    while (1);
}

Output with -Og:

Float cycles: 1000358
Double cycles: 4300843

Output with -O2:

Float cycles: 800347
Double cycles: 3000302

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I ran the test the first time with less iteration. I changed the test a bit so it doesn't get optimized out, and ran it again with 1_000_000 iterations and -O2:

With volatile:

Float cycles: 9000756
Double cycles: 31001274

Without volatile:

Float cycles: 6000008
Double cycles: 18000003

In either case you see it takes about 3-4x more cycles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case you see it takes about 3-4x more cycles.

OK, that's good information to have.

But my point still stands: for Python code this won't make much of a difference. It will however make a big difference for things like ulab and if you have custom C code that is floating-point heavy and uses mp_float_t instead of explicitly using float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will however make a big difference for things like ulab and if you have custom C code that is floating-point heavy and uses mp_float_t instead of explicitly using float.

ulab uses mp_float_t, while our code still uses hard-coded float. However, it still won't build with IMPL=double because double will need to be cast back to float. I'd rather refactor the code to use mp_float_t instead of adding the casts, this way if a port/board uses double it builds. For now, all we need is single-precision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our code still uses hard-coded float. However, it still won't build with IMPL=double because double will need to be cast back to float. I'd rather refactor the code to use mp_float_t instead of adding the casts, this way if a port/board uses double it builds

I'd actually suggest to stick with hard-coded float everywhere, because that's what you've designed your algorithm about and optimised it with. Then, use the following provided functions to interoperate with MicroPython objects: mp_obj_get_float_to_f, mp_obj_get_float_to_d, mp_obj_new_float_from_f, mp_obj_new_float_from_d. They adjust themselved based on single/double settings.

(You could also define your own float type, eg omv_float_t, and use that in all your code, at least then it's configurable.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interoperate with MicroPython objects: mp_obj_get_float_to_f, mp_obj_get_float_to_d, mp_obj_new_float_from_f

Didn't know about those, thanks! Yes, I'll do that instead.

The HP and HE CPUs have double-precision hardware floating point, so can
use the armv7emdp architecture.

This allows frozen code to use native/viper/asm_thumb decorators.

Fixes issue micropython#17896.

Signed-off-by: Damien George <damien@micropython.org>
@dpgeorge dpgeorge force-pushed the alif-add-mpy-cross-flags branch from 3076d19 to b7cfafc Compare August 15, 2025 02:45
@dpgeorge dpgeorge merged commit b7cfafc into micropython:master Aug 15, 2025
7 checks passed
@dpgeorge dpgeorge deleted the alif-add-mpy-cross-flags branch August 15, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants