Description
I'm writing code for multiple ESP32 devices that communicate with each other over BLE, acting as both central and peripheral to each other in a kind of simple mesh network over the Nordic BLE UART protocol. The code makes heavy use of uasyncio
, with everything running as tasks. I sync the BLE IRQ callback to the task that handles events through the simple mechanism of pumping the events into a list queue (copying all of the arguments) and then having the Bluetooth task wake every few milliseconds and check this list.
Mostly this is working completely fine, but I'm seeing regular situations where I start a connection to a peripheral and go back to the main loop, which can run away happily for several minutes without any sign that the connection has gone anywhere. If I interrupt execution and return to the prompt I'll then see a PERIPHERAL_CONNECT
, subsequent PERIPHERAL_DISCONNECT
and maybe a few other BLE events depending on what else was going on – like a CENTRAL_CONNECT
and CENTRAL_DISCONNECT
from another one of my units trying to query this one.
I've seen some worrying delays occasionally in central connections and GATT write/notify operations, but nothing of the same magnitude as these peripheral connections that just disappear completely.
With some additional debug prints, I'm certain that the error is my BLE.irq()
registered method not being called rather than any problem with the transfer of events from callback land to async world. I previously used a more complicated mechanism for doing this using a custom IRQ-safe stream reader / ioctl, but I was seeing the same errors there and so I went with the simpler approach.
Trawling through the code, the only way I can see that this could be occurring is if the scheduler has stopped processing pending callbacks – which seems unlikely since calls to mp_handle_pending()
seem to be everywhere in the MicroPython code – or the Nimble stack has stopped reporting them. The fact that the events appear immediately on interrupting my code is super-suspicious however.
Next step is that I'm going to put something in to keep scheduling a dummy callback so that I can see if the scheduler is doing its work.
I'm running my code against a fairly recent mainline build with only the addition of a peripheral connection cancellation method (as per PR #6584), which I added when I thought that the problem was network timeouts.