Skip to content

Lambda: fix transient connection errors on first container invoke with retry logic #12522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 23, 2025

Conversation

MEPalma
Copy link
Contributor

@MEPalma MEPalma commented Apr 14, 2025

Motivation

This change improves the robustness of the container invocation logic by addressing a timing issue observed during Step Functions integration tests #12512, where transient ConnectionErrors occurred despite the container signaling readiness via startup_future. Although the startup flag may be marked as FINISHED, this does not guarantee that the container is immediately ready to accept incoming connections. To handle this transitional state, the _perform_invoke method has been updated to automatically retry on requests.exceptions.ConnectionError up to _INVOCATION_CONNECTION_ERROR_MAX_RETRY times, using exponential backoff defined by _INVOCATION_CONNECTION_ERROR_BACKOFF_FACTOR_SECONDS.

Alternative solutions were considered:

  • awaiting the startup_future within the status_read logic was explored, but we currently lack a reliable mechanism to verify full container readiness without issuing an actual invocation
  • similarly, deferring the startup_future resolution until the first successful invoke was also evaluated, but since transient network errors may occur beyond the initial request, a retry mechanism on every invoke call was deemed both simpler and more robust.
    Although this retry logic is primarily expected to benefit the first invocation following container startup, it applies to all invokes to ensure resilience under intermittent network conditions.

Changes

  • Implemented retry logic in _perform_invoke for requests.exceptions.ConnectionError
  • Configured retries using _INVOCATION_CONNECTION_ERROR_MAX_RETRY with exponential backoff via _INVOCATION_CONNECTION_ERROR_BACKOFF_FACTOR_SECONDS
  • Reverted Step Functions' total_max_attempts back to 1, as this issue appears now resolved

@MEPalma MEPalma added the semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases label Apr 14, 2025
@MEPalma MEPalma added this to the 4.4 milestone Apr 14, 2025
@MEPalma MEPalma self-assigned this Apr 14, 2025
Copy link

github-actions bot commented Apr 14, 2025

LocalStack Community integration with Pro

    2 files  ±    0      2 suites  ±0   1h 33m 42s ⏱️ - 19m 24s
3 189 tests  - 1 163  2 934 ✅  - 1 072  255 💤  - 91  0 ❌ ±0 
3 191 runs   - 1 163  2 934 ✅  - 1 072  257 💤  - 91  0 ❌ ±0 

Results for commit a7fa7f1. ± Comparison against base commit 1724451.

This pull request removes 1183 and adds 20 tests. Note that renamed tests count towards both.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…
tests.aws.services.cloudformation.api.test_changesets.TestCaptureUpdateProcess ‑ test_execute_with_ref
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_negative_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_positive_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_adds_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_removes_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_update
…
This pull request removes 102 skipped tests and adds 11 skipped tests. Note that renamed tests count towards both.
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input4-FAILED]
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_deployed_infra_state
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_populate_data
tests.aws.scenario.mythical_mysfits.test_mythical_misfits.TestMythicalMisfitsScenario ‑ test_user_clicks_are_stored
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_api_exceptions
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_create_exceptions
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_create_invalid_desiredstate
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_double_create_with_client_token
tests.aws.services.cloudcontrol.test_cloudcontrol_api.TestCloudControlResourceApi ‑ test_lifecycle
…
tests.aws.services.cloudformation.api.test_changesets.TestCaptureUpdateProcess ‑ test_execute_with_ref
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_negative_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_add_new_positive_condition_to_existent_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_adds_resource
tests.aws.services.cloudformation.v2.test_change_set_conditions.TestChangeSetConditions ‑ test_condition_update_removes_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_addition_with_resource
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_deletion_with_resource_remap
tests.aws.services.cloudformation.v2.test_change_set_mappings.TestChangeSetMappings ‑ test_mapping_key_update
…

♻️ This comment has been updated with latest results.

for attempt_count in range(max_retry_on_connection_error + 1): # 1 initial + n retries
try:
response = requests.post(url=invocation_url, json=payload, proxies=proxies)
response.raise_for_status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response is checked on line 200 as well - this will prevent the other error handling, and leak request errors to the callers (instead of the expected InvokeSendError). We should not raise for the status here, I think the connection error should still be raised, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed I could confirm; I removed this raise in the invoke routine

Copy link
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change!

@MEPalma MEPalma merged commit bebff5e into master Apr 23, 2025
31 checks passed
@MEPalma MEPalma deleted the MEP-Lambda-fix_connection_error branch April 23, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants