Skip to content

ports/rp2: Mark gc_heap NOLOAD for faster boot. #9017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Gadgetoid
Copy link
Contributor

@Gadgetoid Gadgetoid commented Aug 5, 2022

I've been doing a fair bit of prodding and poking with MicroPython's startup, and there are a few speed gains to be had.

Notably the zero fill for BSS takes a long time, and the SRAM copy for initialized data even longer, since these both happen before ROSC is configured to any appreciable speed.

Editing Pico SDK's crt0.S entry point and configuring ROSC to ~48MHz dramatically reduces the time these copies take- reducing a MicroPython Pico W cold boot (including our libraries) from ~200ms to ~50ms. This is not something that's easy to do for MicroPython and for our own builds I'm hot patching crt0.S with some extra features...

Instead we can skip zeroing of gc_heap. This change doesn't get us as far as the clock config, but saves ~30ms by marking gc_heap as .noinit which, in turn, ensures it is not zero-filled at startup.

This takes startup from a measured ~156ms down to ~120ms.

Before:
image

After:
image

This change was suggested by @jimmo on raspberrypi/pico-sdk#959

@Gadgetoid
Copy link
Contributor Author

@jimmo I think:

It should be pretty straightforward to put it in a different section (e.g. .noinit), or better do what we do on stm32 where it's just defined entirely in the linker config as a start and end symbol used directly by gc_init().

Might depend upon: #8761

@dpgeorge
Copy link
Member

dpgeorge commented Aug 9, 2022

It would definitely be good to improve rp2 start up speed, and this looks like a nice and simple thing to do to begin with.

@@ -180,6 +180,10 @@ SECTIONS
*(.uninitialized_data*)
} > RAM

.noinit (NOLOAD) : {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just use the uninitialized_data section? It should do the same thing, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought, but in my tests it did not seem to make a difference. I might have to test again lest my assumptions have failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated .map file should be enough to tell you if it works or not.

If it does require a new section, I'd suggest calling it uninitialized_bss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it should probably go after the bss section, and be 4-aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uninitialized_bss is a good shout, since it isn't strictly data (it's not initialized with any known value) that makes it clearer about intent.

@Gadgetoid
Copy link
Contributor Author

This has made me wonder what uninitialized_data is actually for, since I can't find any instance of it being used or even a passing reference to it at all.

I've raised an issue against Pico SDK since there's either something amiss or I'm very confused -
raspberrypi/pico-sdk#966

@Gadgetoid
Copy link
Contributor Author

Just to go a little further with my unitialized_bss is correct hypothesis (I ... can't really speak authoritatively on this stuff so I'm relying on Google and experimentation)-

Normal build:

   text	   data	    bss	    dec	    hex	filename
 702976	    124	 205064	 908164	  ddb84	firmware.elf

With unitialized_bss:

   text	   data	    bss	    dec	    hex	filename
 702976	    124	 205064	 908164	  ddb84	firmware.elf

With unitialized_data:

   text	   data	    bss	    dec	    hex	filename
 702976	 170108	  35080	 908164	  ddb84	firmware.elf

The linker script NOLOAD directive is - I believe - what makes the difference here and pending some clarification for why unitialized_data even exists and what it's intended to achieve I think it's worth an unitialized_bss.

In terms of the final result- you were correct, simply marking gc_heap as uninitialized_data prevented it from being loaded.

Create a new linker section .unitialized_bss for bss
that does not need zero-initialising.

Move gc_heap to .unitialized_bss section.

Saves ~30ms from rising edge of RESET to setting a pin HIGH
in MicroPython.

Zero fill happens in Pico SDK crt0.S before ROSC is configured.
It's very, very slow.

Signed-off-by: Phil Howard <phil@gadgetoid.com>
@Gadgetoid Gadgetoid force-pushed the patch-rp2-noinit-gc_heap branch from 4600547 to eedc9f9 Compare August 10, 2022 09:17
@dpgeorge
Copy link
Member

Rebased and merged in 71f6eb5

@dpgeorge dpgeorge closed this Aug 11, 2022
@dpgeorge
Copy link
Member

Parts of the VM/runtime are relocated to RAM at start up, and this could probably also benefit from improvements. Eg only copy this data after switching to a higher clock frequency.

@Gadgetoid
Copy link
Contributor Author

Parts of the VM/runtime are relocated to RAM at start up, and this could probably also benefit from improvements. Eg only copy this data after switching to a higher clock frequency.

This is the main thrust of my speedup hacks which are applied as a patch to crt0.S and runtime.c before building in our CI.

The Pico SDK maintainers have shown some interest in making this easier, but I don't know how imminent that will be.

You can currently use SKIP_PICO_STANDARD_LINK and SKIP_PICO_RUNTIME flags to remove those altogether and supply your own, but that's pretty drastic.

There's some discussion and links out to relevant issues here - raspberrypi/pico-sdk#959

As I mentioned above, this could potentially bring the stock Pico MicroPython startup to ~30-50ms. A dramatic improvement over even this change, since all copies to RAM (VM, runtime and initialized data) are sped up.

Definitely something to keep in mind, since I think early clock setup - via whatever solution Pico SDK canonicalizes upon - would likely be the easiest way to accomplish what you mention.

@dpgeorge
Copy link
Member

Let's keep an eye on that pico-sdk ticket. There are definitely gains to be made here in terms of improving startup time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants