Step Functions: Increase Retry Attempts on Service Integrations for Resilience Against Transient Network Errors #12512
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
To temporarily reduce the impact of transient network instability in concurrent Lambda executions, this PR increases the total number of retry attempts for the boto client from 1 to 5. Due to occasional “connection refused” errors, the previous setting would immediately trigger retry workflows in the state machine. This could significantly increase the runtime of the program if large backoff rates or wait settings were used in case of failures #12399. This change does not affect the semantics of evaluation. Catch and Retry blocks still function as intended, but it makes the state machine slightly more resilient against non-service errors such as transient connection failures.
Changes