Lambda: fix transient connection errors on first container invoke with retry logic #12522

MEPalma · 2025-04-14T10:25:03Z

Motivation

This change improves the robustness of the container invocation logic by addressing a timing issue observed during Step Functions integration tests #12512, where transient ConnectionErrors occurred despite the container signaling readiness via startup_future. Although the startup flag may be marked as FINISHED, this does not guarantee that the container is immediately ready to accept incoming connections. To handle this transitional state, the _perform_invoke method has been updated to automatically retry on requests.exceptions.ConnectionError up to _INVOCATION_CONNECTION_ERROR_MAX_RETRY times, using exponential backoff defined by _INVOCATION_CONNECTION_ERROR_BACKOFF_FACTOR_SECONDS.

Alternative solutions were considered:

awaiting the startup_future within the status_read logic was explored, but we currently lack a reliable mechanism to verify full container readiness without issuing an actual invocation
similarly, deferring the startup_future resolution until the first successful invoke was also evaluated, but since transient network errors may occur beyond the initial request, a retry mechanism on every invoke call was deemed both simpler and more robust.
Although this retry logic is primarily expected to benefit the first invocation following container startup, it applies to all invokes to ensure resilience under intermittent network conditions.

Changes

Implemented retry logic in _perform_invoke for requests.exceptions.ConnectionError
Configured retries using _INVOCATION_CONNECTION_ERROR_MAX_RETRY with exponential backoff via _INVOCATION_CONNECTION_ERROR_BACKOFF_FACTOR_SECONDS
Reverted Step Functions' total_max_attempts back to 1, as this issue appears now resolved

github-actions · 2025-04-14T11:21:35Z

LocalStack Community integration with Pro

2 files ± 0 2 suites ±0 1h 33m 42s ⏱️ - 19m 24s
3 189 tests - 1 163 2 934 ✅ - 1 072 255 💤 - 91 0 ❌ ±0
3 191 runs - 1 163 2 934 ✅ - 1 072 257 💤 - 91 0 ❌ ±0

Results for commit a7fa7f1. ± Comparison against base commit 1724451.

This pull request removes 1183 and adds 20 tests. Note that renamed tests count towards both.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

tests.aws.services.cloudformation.api.test_changesets.TestCaptureUpdateProcess ‑ test_execute_with_ref
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_negative_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_positive_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_adds_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_removes_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_update
…

This pull request removes 102 skipped tests and adds 11 skipped tests. Note that renamed tests count towards both.

tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input4-FAILED]
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_deployed_infra_state
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_populate_data
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_user_clicks_are_stored
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_api_exceptions
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_create_exceptions
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_create_invalid_desiredstate
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_double_create_with_client_token
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_lifecycle
…

tests.aws.services.cloudformation.api.test_changesets.TestCaptureUpdateProcess ‑ test_execute_with_ref
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_negative_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_positive_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_adds_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_removes_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_update
…

♻️ This comment has been updated with latest results.

dfangl · 2025-04-18T10:25:07Z

localstack-core/localstack/services/lambda_/invocation/executor_endpoint.py

+        for attempt_count in range(max_retry_on_connection_error + 1):  # 1 initial + n retries
+            try:
+                response = requests.post(url=invocation_url, json=payload, proxies=proxies)
+                response.raise_for_status()


The response is checked on line 200 as well - this will prevent the other error handling, and leak request errors to the callers (instead of the expected InvokeSendError). We should not raise for the status here, I think the connection error should still be raised, right?

Yes indeed I could confirm; I removed this raise in the invoke routine

dfangl

Thanks for making the change!

retry on connection error

30eff03

MEPalma added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label Apr 14, 2025

MEPalma added this to the 4.4 milestone Apr 14, 2025

MEPalma self-assigned this Apr 14, 2025

MEPalma requested review from joe4dev, gregfurman, dominikschubert and dfangl as code owners April 14, 2025 10:25

MEPalma added 2 commits April 14, 2025 13:46

backoff rate reduce

2fe8c91

use backoff util

c37d37d

dfangl reviewed Apr 18, 2025

View reviewed changes

rm raise in invoke routine

a7fa7f1

dfangl approved these changes Apr 23, 2025

View reviewed changes

MEPalma merged commit bebff5e into master Apr 23, 2025
31 checks passed

MEPalma deleted the MEP-Lambda-fix_connection_error branch April 23, 2025 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Lambda: fix transient connection errors on first container invoke with retry logic #12522

Lambda: fix transient connection errors on first container invoke with retry logic #12522

Uh oh!

MEPalma commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 14, 2025 •

edited

Loading

Uh oh!

dfangl Apr 18, 2025

Uh oh!

MEPalma Apr 22, 2025

Uh oh!

dfangl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lambda: fix transient connection errors on first container invoke with retry logic #12522

Lambda: fix transient connection errors on first container invoke with retry logic #12522

Uh oh!

Conversation

MEPalma commented Apr 14, 2025

Motivation

Changes

Uh oh!

github-actions bot commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LocalStack Community integration with Pro

Uh oh!

dfangl Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

MEPalma Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

dfangl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2025 •

edited

Loading