[ESM] Handle Lambda TCP socket connection timeouts #11977

gregfurman · 2024-12-03T19:29:03Z

Motivation

We need to handle the case where a synchronous Lambda invocation runs for an extended period of time -- ensuring the TCP socket remains open and alive.

See boto/boto3#2424

Changes

Adds defaults for connect_timeout, read_timeout, tcp_keepalive, and retries in the client for the ESM lambda sender
Since a Lambda can run for 900s, we will only throw an exception once this timeout is reached.
In addition, we specify tcp_keepalive=True which ensures the TCP connection does not drop prior to timeout.

github-actions · 2024-12-03T20:25:32Z

LocalStack Community integration with Pro

2 files ± 0 2 suites ±0 1h 27m 53s ⏱️ - 24m 9s
2 880 tests - 936 2 659 ✅ - 839 221 💤 - 97 0 ❌ ±0
2 882 runs - 936 2 659 ✅ - 839 223 💤 - 97 0 ❌ ±0

Results for commit 0894a3b. ± Comparison against base commit 78fdf45.

This pull request removes 946 and adds 10 tests. Note that renamed tests count towards both.

tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

tests.aws.services.events.test_events.TestEvents ‑ test_put_event_with_too_big_detail
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_bad_ip_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_bad_mask_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_type_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6_NEG]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6_bad_ip_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[exists_list_empty_NEG]
tests.aws.services.sns.test_sns_filter_policy.TestSNSFilterPolicyBody ‑ test_filter_policy_empty_array_payload
tests.aws.services.sns.test_sns_filter_policy.TestSNSFilterPolicyBody ‑ test_filter_policy_ip_address_condition

♻️ This comment has been updated with latest results.

dominikschubert · 2024-12-04T08:26:39Z

localstack-core/localstack/services/lambda_/event_source_mapping/esm_worker_factory.py

+                read_timeout=900,  # 900s is the maximum amount of time a Lambda can run for
+                connect_timeout=900,


suggestion: I'd say we should make this configurable. While debugging for example it would be quite nice to not have it drop the connection.

WDYT @joe4dev @dfangl

Not sure if we should make the actual timeouts configurable -- since we set the Lambda debug limit to 3600 at DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS so allowing for a number higher than this seems confusing:

localstack/localstack-core/localstack/utils/lambda_debug_mode/lambda_debug_mode.py

Line 10 in 414a968

DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS: int = 3_600

We could check if debug mode and pass in that timeout default timeout seconds as the timeout value, similar to how we configure the boto client's config here:

localstack/localstack-core/localstack/services/lambda_/invocation/executor_endpoint.py

Lines 208 to 213 in 414a968

if is_lambda_debug_mode():

# The value is set to a default high value to ensure eventual termination.

timeout_seconds = DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS

else:

# Do not wait longer for an invoke than the maximum lambda timeout plus a buffer

lambda_max_timeout_seconds = 900

@gregfurman Adopting the debug mode is a great suggestion here 👍
I suggest adding a buffer similar to the executor endpoint so that we don't terminate Lambdas running for exactly 900s. This buffer should account for potentially slow container startup and small LS processing delays.
/cc @MEPalma

OK awesome. Not sure the buffer is necessary here though since we set the connect_timeout to 900s -- giving our internal poller client up to 900s to establish a connectiom to the container (which IMO is more than enough time for the container to start up + processing delays).

Assuming we manage to establish a connection within that 900s timeframe, the synchronous invoke we triggered will then have an additional 900s from the read_timeout while awaiting for a response.

So, in actuality, this approach allow us a window of 1800s (or 30 minutes) from invoke -> response.

Perhaps I am misunderstanding these timeouts though, so some clarification here could be useful.

From the boto docs:

connect_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds.

read_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to read from a connection. The default is 60 seconds.

good point to distinguish these two:

connect_timeout: I think that can be low (e.g., 5s) because it only involves connecting to our gateway running on the same machine.
read_timeout: I think that's the relevant one which includes LS gateway processing, container startup (up to LAMBDA_RUNTIME_ENVIRONMENT_TIMEOUT plus small Docker client overhead), and function execution time (up to 900s).

I don't think we shouldn't be retrying since that could cause unintended side-effects, where each retry will invoke the Lambda.

Currently, the config I added retries={"max_attempts": 0, "total_max_attempts": 1} will disable retries on failing requests (albeit redundantly since total_max_attempts takes presedence over max_attempts).

According to the ESM docs, there are some cases where we'll want to do retries on failed invocation (with backoff).

From the docs:

When an invocation fails, Lambda attempts to retry the invocation while implementing a backoff strategy. The backoff strategy differs slightly depending on whether Lambda encountered the failure due to an error in your function code, or due to throttling.

If your function code caused the error, Lambda will stop processing and retrying the invocation. In the meantime, Lambda gradually backs off, reducing the amount of concurrency allocated to your Amazon SQS event source mapping. After your queue's visibility timeout runs out, the message will again reappear in the queue.

If the invocation fails due to throttling, Lambda gradually backs off retries by reducing the amount of concurrency allocated to your Amazon SQS event source mapping. Lambda continues to retry the message until the message's timestamp exceeds your queue's visibility timeout, at which point Lambda drops the message.

We don't currently handle throttling and AFAIK our internal SQS implementation does not either. We probably should though, since it seems we could start leverage boto's adaptive rate limiting functionality.

yes, +1 on no retries here 👍

+1 for extra error handling -> backlog

I'm adding this config to only the ESM worker when the client gets created. We can see how it performs and perhaps consider setting some of these values by default in the get_internal_client function?

Also, I removed setting the connection_timeout to allow it to default to 60s 🙂

joe4dev

Great catch and kudos for also considering the Lambda debug mode 💯

Let's get some customer feedback here and consider the lessons learned in Pipes as well.

[ESM] Handle socket connection timeouts

2f2104f

gregfurman self-assigned this Dec 3, 2024

gregfurman added semver: patch Non-breaking changes which can be included in patch releases aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) labels Dec 3, 2024

gregfurman added this to the Playground milestone Dec 3, 2024

gregfurman marked this pull request as ready for review December 4, 2024 08:03

gregfurman requested review from joe4dev, dominikschubert and dfangl as code owners December 4, 2024 08:03

dominikschubert reviewed Dec 4, 2024

View reviewed changes

[ESM] Change timeout when in Lambda debug mode

0894a3b

joe4dev approved these changes Dec 5, 2024

View reviewed changes

gregfurman merged commit cbe2923 into master Dec 5, 2024
31 checks passed

gregfurman deleted the add/esm/keepalive branch December 5, 2024 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ESM] Handle Lambda TCP socket connection timeouts #11977

[ESM] Handle Lambda TCP socket connection timeouts #11977

Uh oh!

gregfurman commented Dec 3, 2024

Uh oh!

github-actions bot commented Dec 3, 2024 •

edited

Loading

Uh oh!

dominikschubert Dec 4, 2024

Uh oh!

gregfurman Dec 4, 2024

Uh oh!

joe4dev Dec 4, 2024 •

edited

Loading

Uh oh!

gregfurman Dec 4, 2024

Uh oh!

joe4dev Dec 4, 2024

Uh oh!

gregfurman Dec 4, 2024 •

edited

Loading

Uh oh!

gregfurman Dec 4, 2024

Uh oh!

joe4dev Dec 4, 2024

Uh oh!

gregfurman Dec 4, 2024

Uh oh!

gregfurman Dec 4, 2024

Uh oh!

joe4dev left a comment

Uh oh!

Uh oh!

Uh oh!

		read_timeout=900, # 900s is the maximum amount of time a Lambda can run for
		connect_timeout=900,

	if is_lambda_debug_mode():
	# The value is set to a default high value to ensure eventual termination.
	timeout_seconds = DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS
	else:
	# Do not wait longer for an invoke than the maximum lambda timeout plus a buffer
	lambda_max_timeout_seconds = 900

Uh oh!

[ESM] Handle Lambda TCP socket connection timeouts #11977

[ESM] Handle Lambda TCP socket connection timeouts #11977

Uh oh!

Conversation

gregfurman commented Dec 3, 2024

Motivation

Changes

Uh oh!

github-actions bot commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LocalStack Community integration with Pro

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joe4dev Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregfurman Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joe4dev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2024 •

edited

Loading

joe4dev Dec 4, 2024 •

edited

Loading

gregfurman Dec 4, 2024 •

edited

Loading