-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Lambda: fix transient connection errors on first container invoke with retry logic #12522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 33m 42s ⏱️ - 19m 24s Results for commit a7fa7f1. ± Comparison against base commit 1724451. This pull request removes 1183 and adds 20 tests. Note that renamed tests count towards both.
This pull request removes 102 skipped tests and adds 11 skipped tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
for attempt_count in range(max_retry_on_connection_error + 1): # 1 initial + n retries | ||
try: | ||
response = requests.post(url=invocation_url, json=payload, proxies=proxies) | ||
response.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response is checked on line 200 as well - this will prevent the other error handling, and leak request errors to the callers (instead of the expected InvokeSendError). We should not raise for the status here, I think the connection error should still be raised, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed I could confirm; I removed this raise in the invoke routine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change!
Motivation
This change improves the robustness of the container invocation logic by addressing a timing issue observed during Step Functions integration tests #12512, where transient ConnectionErrors occurred despite the container signaling readiness via
startup_future
. Although the startup flag may be marked as FINISHED, this does not guarantee that the container is immediately ready to accept incoming connections. To handle this transitional state, the _perform_invoke method has been updated to automatically retry on requests.exceptions.ConnectionError up to_INVOCATION_CONNECTION_ERROR_MAX_RETRY
times, using exponential backoff defined by_INVOCATION_CONNECTION_ERROR_BACKOFF_FACTOR_SECONDS
.Alternative solutions were considered:
startup_future
within thestatus_read
logic was explored, but we currently lack a reliable mechanism to verify full container readiness without issuing an actual invocationstartup_future
resolution until the first successful invoke was also evaluated, but since transient network errors may occur beyond the initial request, a retry mechanism on every invoke call was deemed both simpler and more robust.Although this retry logic is primarily expected to benefit the first invocation following container startup, it applies to all invokes to ensure resilience under intermittent network conditions.
Changes
total_max_attempts
back to 1, as this issue appears now resolved