-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
esp32: sockets become unresponsive after several larger transactions #12819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Dies that problem also happen when you used idf 5.0.2? I'm asking because MicroPython uses v5.0.2. |
will try it later/tomorrow. But using the latest bin from micropy for esp32 from the website yields the same behaviour. |
You can try disabling power management import socket, select, time, network
network.WLAN(network.AP_IF).active(False)
sta = network.WLAN(network.STA_IF)
sta.active(True)
sta.config(pm=sta.PM_NONE) |
Doing this after each hard rest seems to give a stable WiFi for me. import socket, select, time, network
network.WLAN(network.AP_IF).active(False)
sta = network.WLAN(network.STA_IF)
sta.active(True)
sta.config(pm=sta.PM_NONE)
sta.config(txpower=18) Give it a try. |
Do you have tried it with my demo i provided and can you replicate the issue. Does it go away with your proposed settings? |
No, I didn't try your crash demo. I'll leave that to you); What I have noticed is that that with the above mentioned setting I see less of the
in the debugging log. |
There is one more thing you can try. Build your own custom firmware with these HW_ENABLE_MDNS disabled in "mpconfigport.h".
I have not seen |
what implications result of the esp not printing esp_netif_get_ip_info in the first place? |
As I understand it
The esp_netif process will start executing the esp_netif_get_ip_info function when the TCPIP stack detects that it is losing its IP. This is good if it works as it should, but it seems that enabling hardware DNS query/responder makes the wifi system slow and unstable. There must be a good reason why MicroPython has hardware DNS lookup/responder enabled by default, but I cannot answer that. My test telnet server running on an ESP32 is more stable now that I have disabled the HW_ENABLE_MDNS in the firmware. |
Summary: Full text: And of course I tried Board: ESP32 / WROOM Response times as logged per request: |
Thanks for your contribution. Have you tried increasing the transmitted size to roughly 10kb at least in one direction. This way im having trouble with the first few transactions and it doesnt respond to new sockets immediatly. |
No, I didn't run your code, nor did I modify my code to test it with bigger request. Just tested with my setup (that works on the old firmware) with the newest firmware. |
I ran some tests where an ESP32 runs a simple http server and MicroPython on a LinuxMint runs a simple http client. The logs were from the client script on the Linux PC. The first test was done on an ESP32 flashed with a firmware build using ESP-IDFv5.0.2. At some point during the test, the ESP32 became unresponsive for 63361ms. The second test was on the same ESP32 flashed with firmware based on ESP-IDFv4.4.5. The Wifi/TCPIP stack behaves better with the IDFv4.4.5 based firmware. There was a time when we had to wait 342ms. I think this is a problem. We need a stable wifi running on the ESP32. The ESP-IDFv4xx is no longer supported for future MicroPython versions. The v1.20.0-206-g33b403dfb was the copy I had that was based on ESP-IDFv4xx. The ESP-IDFv5.0.2 based firmware has broken WiFi. Different build settings and different wlan.config settings did not help much (or at all). The problems may be caused by ESP-IDFv5xx and not necessarily by the MicroPython core.
|
Thanks for the report. I'm able to reproduce the problem using the microdot example in the original post here. With v1.20.0 responses are returned regularly, within 1 second for the vast majority of requests. With current master there are responses that take a long time, up to 100 seconds. |
@Tobinator32 can you please try the firmware that is broken, and at the start of each HTTP request print out the available memory, like this:
From my tests it seems that the IDF is slowly running out of heap memory, and when there's none left it just blocks the TCP traffic. |
Thanks for the Answer. Yes i can try it in the evening or tomorrow. |
crash_log.txt |
Actually I am experiencing a similar issue with the Raspberry Pico W with Micropython and Microdot. At some point, the web sockets become unresponsive with hard reset being the only solution. |
I get good results with this setting in the "boards/ESP32_GENERIC/sdkconfig.debug" file:
The firmware was compiled with the board variant
Perhaps others can take the time to try and varify for themselves. |
@maxi07 the issue here was first reported on esp32, and likely the cause is specific to that platform. For the problem that you see on Pico W, can you please open a separate issue and describe the bug there, and also provide code that can be run to reproduce your issue. Then it can be investigated. |
EDIT: Sorry, some of this is wrong but unclear which part. Will debug some more and post again. |
Using @dpgeorge's reproduction code (esp32 port server, unix port client), I can confirm at least one problem is related to port reuse in the client and the TIME_WAIT state on the server: Recall that each IPv4 TCP connection is uniquely identified by a 4-tuple: source IP address, destination IP address, source port, destination port.
On MicroPython V1.20 this doesn't seem to happen. I've seen duplicate ports within 30 seconds of each other, and the server accepts them. In theory LWIP is configured with the same MSL period for TIME_WAIT. I think this could one of multiple things:
Workarounds
@Tobinator32 it's not 100% clear that this is even the root cause for the issue you initially reported. Are you able to test one of these workarounds (suggest the first one of MSL) and see if the problem goes away? |
do you mean the suggestion from @shariltumin only? The thing is my webserver demo doesnt have the exact same issue as it starts freezing in the middle of a file transfer not after establishing new connection. Yes it refuses new connections but only after the file transfer has crashed. Im slowly thinking that the described issues are a part of the problem im seeing but not the whole picture. Maybe you can try the demo i uploaded on github. Just use the latest micropython binary and upload the few files. |
I did mean that suggestion (sorry for the vague wording), but I had a feeling this might be a different issue. I'll try your reproduction code soon. |
Hi @Tobinator32, Thanks for explaining again. I was able to reproduce most of what you describe with your demo project. In my case, although I saw some transfers fail (hang part way) a subsequent transfer would always succeed if retried after a short time (but it might take a minute or so to recover in the worst case.) My reproduction strategy was very simple, running two concurrent terminals with: while wget --tries=1 --timeout=10 -O/dev/null http://192.168.4.1/new/images/200_1.jpg; do; done and while wget --tries=1 --timeout=10 -O/dev/null http://192.168.4.1/new/images/200.jpg; do; done ... I think with more concurrent transfers it'd fail sooner, but I could reproduce within a minute or so with just two. There are two different failure modes:
Mostly I think the problems you're seeing are (1) not (2), but I saw both. Sockets in TIME_WAIT seem to be the main culprit. I hadn't realised until today, LWIP doesn't place any hard limit on the number of these in the system - if you make requests fast enough then they'll keep growing and taking up RAM until they time out. In my tests I could see over 250 (the structure is 208 bytes plus heap overhead, so this would be over 50KB of RAM!) Until there's a patch for that, the best workaround is to rebuild MicroPython with @Tobinator32 are you able to please give that a try, and see if the problem still appears for you? |
Thanks for the great test project, btw. Much appreciated.
I don't know what is calling this, I think possibly it's coming from wifi library event being processed inside ESP-IDF. "Debug" level logs tend to be a little spammy, unfortunately. Note that once #12900 is merged it'll be straightforward to turn these on and off from Python again. (I'm not sure if that will work with the firmware you're using now.) |
The frequency of the esp_netif_get_ip_info() calls can be reduced by changing the value of I got a good result by changing the value from 5000 to 50000.
|
maybe i finally loop in. @projectgus yes please upload the binary somewhere so that its easier for me to test. I guess i ran my initial webserver test with http cat and see if it gets stuck or crashes. Can do this around the weekend. |
Default MSL is 60 seconds. The main purpose of this is to shorten time closed sockets spend in the TIME_WAIT state. This is (2 * MSL) so previously 2 minutes, with this change becomes 20 seconds. Without this, a system which makes many short-lived TCP connections can grow to be holding hundreds of sockets in TIME_WAIT. Possible fix for micropython#12819 This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
@Tobinator32 Great, thank you. There are two ESP32_GENERIC firmwares in this zip file: issue12819_esp32_generic_firmwares.zip Please try flashing If "10s" version doesn't work, please test again with The firmware is built from this branch: |
Hello together, |
Thank you for posting this thread, it was very helpful to know a prior version 1.19 that works well. I am using ESP32-WROOM-32D DevKitC for a small project that I need to update remotely. Due to WebREPL instability, I setup an API using Microdot to upload changed .MPY files (up to 15 ranging from 400-8500 bytes) and restart.
With 1.22, it regularly stopped responding, sometimes after only 3 or 4 requests. Connected to USB, I could see these exceptions for OSError ECONNABORTED.
Most of the time the client would never respond, but sometimes would show:
Additional requests could still be made from a different client. USB console shows the request received which was enough to remotely reset but responses rarely worked. I have tried this workaround CONFIG_LWIP_TCP_MSL=6000 from above without any luck. I was looking into options to try and catch the exceptions properly and restart microdot or the Wifi, but changing to 1.19 (IDF 4.4.4) has fixed this completely even though there is less memory available (only 20k compared to 65k). A load test that repeatedly uploaded all files, ran for over 12 hours with an occasional MemoryError. I have a spare board to experiment with and a repeatable test, so would be happy to try any other ideas and report back. The issues with WebREPL could be related, but not solved by changing to older version. Best regards, Daniel. |
Hi! Thank you so much for precompiled firmwares! In my case ESP32_GENERIC_MSL_10s.bin works great!! Non of the official ones works as expected. I have tried to compile your branch on my own, also tried official master branch with CONFIG_LWIP_TCP_MSL=10000, but works like in official... Can you please tell how to compile that firmware? I have tried both IDFv5.0.2 and IDF5.1.2... I see that in the name there is 'with-newlib4.1.0', maybe the way you compile is the key for unresponsive sockets? Thanx! |
I was encouraged by the most recent comment about ESP32_GENERIC_MSL_10s.bin and just downloaded and tested it on ESP32-WROOM-32D DevKitC. Unfortunately I still see the same ECONNABORTED error as reported earlier. This is after running for only 1 min and uploading the same 16kb file just 3 times, with the following 2 exceptions seen.
1.19 is still working perfectly. Regards, |
Hello, I was suffering from the same bug. tested the 10s version and it improves a lot. im adding myself to this issue for a follow up. Thanks |
Hello, How did you manage to get 1.19 running? i tried installing it but asyncio is not present there and needed by microdot. Got the binary from here: https://micropython.org/download/ESP32_GENERIC/ this is beacuse finally 10S is aborting connections again with minimal application code change. thanks |
Hi, |
@Tobinator32 Actually i got 1.19 from the micropython download page. |
the only way i had success was build 1.19 with idf 4.4.4 myself. This way its rock solid. asyncio was not an issue in my case and was running perfectly. Although stated i havent used to "official" binaries |
can you share your binary? none of the options here fix the issues for me in the end. |
I can look into that tomorrow. I have to build the image without some customer specific sources. |
To save memory I also compile Micropython 1.19 with a manifest file to freeze all of my dependencies. My application has been stable on this version. I have been trying newer versions and combinations without any luck, as shown above. Microdot works reliably on the downloaded ESP32_GENERIC.bin if you copy the uasyncio files from Micropython source.
Copy the whole uasyncio subdirectory (from extmod/) to your ESP32 filesystem. |
Thanks for this info. Tried and the error is that effectively it is running out of ram. my application is nothing memory intensive so i ask: what could i do to preserve more ram? |
i did not realize i was running microdot with a lot of extensions by default. Doing some more inspecting. it appears the garbage collector is not freeing the ram fast enough.
right before microdot fails. each line is 1 sec appart. in contrast. forcing a collect:
|
Some good news, this issue is resolved for me running ESP32-WROOM-32D with 1.23.0 and IDF 5.0.5. A test runs for hours on end with no errors, but failed within 2min using IDF 1.22.0 or 1.23.0 and IDF 5.0.4. By setting gc.threshold to 45k, garbage collection happens automatically and reliably. I no longer need to call gc.collect before, during and after API calls. I was unable to build and test an image with IDF 5.2.1 due to a separate issue "region `iram0_0_seg' overflowed by 84 bytes". WebREPL also works properly now. From the release notes I can't see exactly what would have fixed the problem. |
@dcox761 that's interesting that IDF 5.0.5 works for you. I just triedMicroPython commit ee10360 with board ESP32_GENERIC, and the following IDF versions: 5.0.4, 5.0.5, 5.0.6. The problem is still there for all of these IDF versions: out of 500 requests at least one of those requests takes a long time to complete (eg 36 seconds, 68 seconds). So, not sure what to make of that. |
I did notice some of the requests took much longer, up to 1 min but they did complete and it seemed to recover and continue normally. On previous versions I either noticed memory errors or OSError: [Errno 113] ECONNABORTED. |
Have been debugging issue #15844 which looks very similar to this one. Some more detail in that thread.
(If anyone needs a build for a different board, let me know.) |
Candidate fix in the linked PR, and I've posted some builds there if anyone is able to test: #15952 (comment) |
hello,
im encountering massive issues with using microdot or microwebsrv2 with serving ~5 request of ~10kb each on esp32 wroom-32 (no additional memory). Worst case is an immediate unresponsiveness of any new connection. It stops serving the opened connections and doesnt accept any new ones which forces me to do a hard reset to recover functionality. Reducing backlog of async web_server to 2 doesnt really help.
This is issue is consistently replicable with the latest micropython build with idf 5.1 or the latest µP bin from the website.
When using 1.19.1 with idf 4.4.4 it runs for hours serving 100kb per website with two clients connected and requests send every 10s for hours without any issue.
I already made an example for the microdot dev available here: https://github.com/Tobinator32/microdot_crash_demo
Any advices how to get it up and running with the latest µP and IDF would be greatly appreciated.
The text was updated successfully, but these errors were encountered: