Skip to content

Conversation

gregfurman
Copy link
Contributor

Motivation

We need to handle the case where a synchronous Lambda invocation runs for an extended period of time -- ensuring the TCP socket remains open and alive.

See boto/boto3#2424

Changes

  • Adds defaults for connect_timeout, read_timeout, tcp_keepalive, and retries in the client for the ESM lambda sender
  • Since a Lambda can run for 900s, we will only throw an exception once this timeout is reached.
  • In addition, we specify tcp_keepalive=True which ensures the TCP connection does not drop prior to timeout.

@gregfurman gregfurman self-assigned this Dec 3, 2024
@gregfurman gregfurman added semver: patch Non-breaking changes which can be included in patch releases aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) labels Dec 3, 2024
@gregfurman gregfurman added this to the Playground milestone Dec 3, 2024
Copy link

github-actions bot commented Dec 3, 2024

LocalStack Community integration with Pro

    2 files  ±  0      2 suites  ±0   1h 27m 53s ⏱️ - 24m 9s
2 880 tests  - 936  2 659 ✅  - 839  221 💤  - 97  0 ❌ ±0 
2 882 runs   - 936  2 659 ✅  - 839  223 💤  - 97  0 ❌ ±0 

Results for commit 0894a3b. ± Comparison against base commit 78fdf45.

This pull request removes 946 and adds 10 tests. Note that renamed tests count towards both.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…
tests.aws.services.events.test_events.TestEvents ‑ test_put_event_with_too_big_detail
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_bad_ip_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_bad_mask_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_type_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6_NEG]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[content_ip_address_v6_bad_ip_EXC]
tests.aws.services.events.test_events_patterns.TestEventPattern ‑ test_event_pattern[exists_list_empty_NEG]
tests.aws.services.sns.test_sns_filter_policy.TestSNSFilterPolicyBody ‑ test_filter_policy_empty_array_payload
tests.aws.services.sns.test_sns_filter_policy.TestSNSFilterPolicyBody ‑ test_filter_policy_ip_address_condition

♻️ This comment has been updated with latest results.

@gregfurman gregfurman marked this pull request as ready for review December 4, 2024 08:03
Comment on lines 64 to 65
read_timeout=900, # 900s is the maximum amount of time a Lambda can run for
connect_timeout=900,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I'd say we should make this configurable. While debugging for example it would be quite nice to not have it drop the connection.

WDYT @joe4dev @dfangl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should make the actual timeouts configurable -- since we set the Lambda debug limit to 3600 at DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS so allowing for a number higher than this seems confusing:

DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS: int = 3_600

We could check if debug mode and pass in that timeout default timeout seconds as the timeout value, similar to how we configure the boto client's config here:

if is_lambda_debug_mode():
# The value is set to a default high value to ensure eventual termination.
timeout_seconds = DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS
else:
# Do not wait longer for an invoke than the maximum lambda timeout plus a buffer
lambda_max_timeout_seconds = 900

Copy link
Member

@joe4dev joe4dev Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregfurman Adopting the debug mode is a great suggestion here 👍
I suggest adding a buffer similar to the executor endpoint so that we don't terminate Lambdas running for exactly 900s. This buffer should account for potentially slow container startup and small LS processing delays.
/cc @MEPalma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK awesome. Not sure the buffer is necessary here though since we set the connect_timeout to 900s -- giving our internal poller client up to 900s to establish a connectiom to the container (which IMO is more than enough time for the container to start up + processing delays).

Assuming we manage to establish a connection within that 900s timeframe, the synchronous invoke we triggered will then have an additional 900s from the read_timeout while awaiting for a response.

So, in actuality, this approach allow us a window of 1800s (or 30 minutes) from invoke -> response.

Perhaps I am misunderstanding these timeouts though, so some clarification here could be useful.

From the boto docs:

connect_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds.

read_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to read from a connection. The default is 60 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point to distinguish these two:

connect_timeout: I think that can be low (e.g., 5s) because it only involves connecting to our gateway running on the same machine.
read_timeout: I think that's the relevant one which includes LS gateway processing, container startup (up to LAMBDA_RUNTIME_ENVIRONMENT_TIMEOUT plus small Docker client overhead), and function execution time (up to 900s).

Copy link
Contributor Author

@gregfurman gregfurman Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we shouldn't be retrying since that could cause unintended side-effects, where each retry will invoke the Lambda.

Currently, the config I added retries={"max_attempts": 0, "total_max_attempts": 1} will disable retries on failing requests (albeit redundantly since total_max_attempts takes presedence over max_attempts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the ESM docs, there are some cases where we'll want to do retries on failed invocation (with backoff).

From the docs:

When an invocation fails, Lambda attempts to retry the invocation while implementing a backoff strategy. The backoff strategy differs slightly depending on whether Lambda encountered the failure due to an error in your function code, or due to throttling.

  • If your function code caused the error, Lambda will stop processing and retrying the invocation. In the meantime, Lambda gradually backs off, reducing the amount of concurrency allocated to your Amazon SQS event source mapping. After your queue's visibility timeout runs out, the message will again reappear in the queue.

  • If the invocation fails due to throttling, Lambda gradually backs off retries by reducing the amount of concurrency allocated to your Amazon SQS event source mapping. Lambda continues to retry the message until the message's timestamp exceeds your queue's visibility timeout, at which point Lambda drops the message.

We don't currently handle throttling and AFAIK our internal SQS implementation does not either. We probably should though, since it seems we could start leverage boto's adaptive rate limiting functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, +1 on no retries here 👍

+1 for extra error handling -> backlog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding this config to only the ESM worker when the client gets created. We can see how it performs and perhaps consider setting some of these values by default in the get_internal_client function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I removed setting the connection_timeout to allow it to default to 60s 🙂

Copy link
Member

@joe4dev joe4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch and kudos for also considering the Lambda debug mode 💯

Let's get some customer feedback here and consider the lessons learned in Pipes as well.

@gregfurman gregfurman merged commit cbe2923 into master Dec 5, 2024
31 checks passed
@gregfurman gregfurman deleted the add/esm/keepalive branch December 5, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) semver: patch Non-breaking changes which can be included in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants