-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
esp32: Apply the LWIP active TCP socket limit. #15952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Uploading some builds for generic boards here if anyone would like to test:
EDIT: instructions for flashing below: #15952 (comment) |
a7d85e6
to
5a19a8d
Compare
Ahh, our good old friend This looks pretty good to me, a relatively small and self-contained workaround. I will test it. |
Yeah 😅 . I originally thought I could do this by calling an internal LWIP API, but it was a ton fiddlier than this. |
As this TIME-WAIT phenomenon seems to have haunted me on and off for the past decade, I've decided that I'll put some of my own time into it if the lwIP developers are amenable: https://lists.nongnu.org/archive/html/lwip-devel/2024-10/msg00000.html |
Very good! |
Tried the C3 image on two different ESP32 C3 boards, both of them got stuck in bootloop:
flashing https://micropython.org/resources/firmware/ESP32_GENERIC_C3-20241003-v1.24.0-preview.378.gca6723b14.bin brought both of them back to life.
And the same goes for the ESP32 S3 image. |
TestingI have tested this PR. My test setup is:
I ran the following tests:
Current master with ESP32_GENERICI ran this for 10 minutes and got 9 failures, all of which were That's about 54 failures per hour. This PR with ESP32_GENERICI ran this test for 37 hours. There were 400297 iterations of the client loop, ie that many attempted requests. Out of those, there were 49 which had an error, and all of those were of this form:
The first such error was at connection number 1338, after 466 seconds. Then more or less evenly distributed over the 37 hours. Failure rate here is 1.32 failures per hour, about 40 times less than the failure rate on master. Current master on PYBD-SF6As a comparison to this PR I also ran in parallel the same test on master on PYBD-SF6 (using the same PC as the client, same access point). This board did nearly twice as many connections in the same time. There were 715801 iterations of the client loop. And 9 failures. One failure (the first failure, at iteration 36012) was:
The other 8 failures (the first at iteration 53639, and more or less distributed throughout the 37 hours) were the same Failure rate here is 0.24 failures per hour. But because this board had more iterations, to compare with ESP32_GENERIC the equivalent failure rate would be 0.14, which is about 10 times less than ESP32_GENERIC with this PR. Summary
It would be interesting to see if implementing "accept an out-of-window SYN on a TIME-WAIT socket" in lwIP would really reduce the failure rate to zero. But that's a lot of work and probably impossible for us to integrate into MicroPython without forking lwIP and ESP-IDF. Client test codeModified version of @projectgus client code above to track errors: import sys
import time
import requests
IP = sys.argv[1]
start = time.time()
errors = []
try:
for i in range(1000_000):
try:
r = requests.get(f'http://{IP}/*JOY;{i};0;0;0;0;0', timeout=20)
r.close()
except Exception as er:
print("Exception:", repr(er))
errors.append((i, time.time() - start, er))
print(i, time.time() - start, len(errors))
time.sleep(0.15)
except KeyboardInterrupt:
for er in errors:
print(er) |
Sorry @TRadigk for the missing instruction. These .bin files attached above are only the app, and the .bin files published on the website are the full flash contents which incorporates some other binary files into one. To test: First flash a "full" MicroPython firmware .bin from the website, and then do |
Very thorough testing, @dpgeorge! Nice.
FWIW I think most of them probably do, the ones I tested with a running packet capture did 100% (on an otherwise quiet Wi-Fi network with good signal strength). Port reuse was sometimes really rapid (within a few subsequent connections), I don't fully understand why but I guess I have a lot of browser tabs open! However, fully agree with your other point that you can't ever assume a robust network and code should be prepared for some failures.
I think so too. Just want to make the point for anyone reading along that, similar to reducing MSL, reducing |
Thank you so much, @projectgus for clearing this up. Now I was able to fully test and compare ESP32_GENERIC_C3-20241003-v1.24.0-preview.378.gca6723b14 and the provided "app". The result is pretty clear to me. On the previous version I could execute (at best) 73 consecutive requests until wifi broke down. |
This is a workaround for a bug in ESP-IDF where the configuration setting for maximum active TCP sockets (PCBs) is not applied. Fixes cases where a lot of short-lived TCP connections can cause: - Excessive memory usage (unbounded number of sockets in TIME-WAIT). - Much higher risk of stalled connections due to repeated port numbers. The maximum number of active TCP PCBs is reduced from 16 to 12 to further reduce this risk (trade-off against possibility of TIME-WAIT Assassination as described in RFC1337). This is not a watertight fix for the second point: a peer can still reuse a port number while a previous socket is in TIME-WAIT, and LWIP will reject that connection (in an RFC compliant way) causing the peer to stall. This work was funded through GitHub Sponsors. Signed-off-by: Angus Gratton <angus@redyak.com.au>
5a19a8d
to
82e69df
Compare
Although not a complete fix, this PR improves things dramatically. Merged. |
Summary
This is a workaround for a bug in ESP-IDF where the configuration setting for LWIP maximum active TCP sockets (PCBs) is not applied. See espressif/esp-idf#9670
Fixes cases where a lot of short-lived TCP connections can cause:
This is not a 100% fix for the second point: a peer can still reuse a port number while a previous socket is in TIME-WAIT, and LWIP will reject that connection (in an RFC compliant way) causing the peer to stall.
(Note that this may not be a complete fix for every failure reported in those issues, but it should fix most of them. Will need additional info to reproduce any remaining problems, so probably worth opening a new issue.)
Testing
Using the test program supplied in the issue report for #15844, and this test client:
The frequency of TCP local port reuse depends on the client system (in this case desktop Linux), so may vary depending on the host OS and other network usage.
Trade-offs and Alternatives
This work was funded through GitHub Sponsors.