-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
[ESM] Handle Lambda TCP socket connection timeouts #11977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 27m 53s ⏱️ - 24m 9s Results for commit 0894a3b. ± Comparison against base commit 78fdf45. This pull request removes 946 and adds 10 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
read_timeout=900, # 900s is the maximum amount of time a Lambda can run for | ||
connect_timeout=900, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should make the actual timeouts configurable -- since we set the Lambda debug limit to 3600 at DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS
so allowing for a number higher than this seems confusing:
localstack/localstack-core/localstack/utils/lambda_debug_mode/lambda_debug_mode.py
Line 10 in 414a968
DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS: int = 3_600 |
We could check if debug mode and pass in that timeout default timeout seconds as the timeout value, similar to how we configure the boto client's config here:
localstack/localstack-core/localstack/services/lambda_/invocation/executor_endpoint.py
Lines 208 to 213 in 414a968
if is_lambda_debug_mode(): | |
# The value is set to a default high value to ensure eventual termination. | |
timeout_seconds = DEFAULT_LAMBDA_DEBUG_MODE_TIMEOUT_SECONDS | |
else: | |
# Do not wait longer for an invoke than the maximum lambda timeout plus a buffer | |
lambda_max_timeout_seconds = 900 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregfurman Adopting the debug mode is a great suggestion here 👍
I suggest adding a buffer similar to the executor endpoint so that we don't terminate Lambdas running for exactly 900s. This buffer should account for potentially slow container startup and small LS processing delays.
/cc @MEPalma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK awesome. Not sure the buffer is necessary here though since we set the connect_timeout
to 900s -- giving our internal poller client up to 900s to establish a connectiom to the container (which IMO is more than enough time for the container to start up + processing delays).
Assuming we manage to establish a connection within that 900s timeframe, the synchronous invoke we triggered will then have an additional 900s from the read_timeout
while awaiting for a response.
So, in actuality, this approach allow us a window of 1800s (or 30 minutes) from invoke -> response.
Perhaps I am misunderstanding these timeouts though, so some clarification here could be useful.
From the boto docs:
connect_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to make a connection. The default is 60 seconds.
read_timeout (float or int) – The time in seconds till a timeout exception is thrown when attempting to read from a connection. The default is 60 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point to distinguish these two:
connect_timeout
: I think that can be low (e.g., 5s
) because it only involves connecting to our gateway running on the same machine.
read_timeout
: I think that's the relevant one which includes LS gateway processing, container startup (up to LAMBDA_RUNTIME_ENVIRONMENT_TIMEOUT
plus small Docker client overhead), and function execution time (up to 900s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we shouldn't be retrying since that could cause unintended side-effects, where each retry will invoke the Lambda.
Currently, the config I added retries={"max_attempts": 0, "total_max_attempts": 1}
will disable retries on failing requests (albeit redundantly since total_max_attempts
takes presedence over max_attempts
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the ESM docs, there are some cases where we'll want to do retries on failed invocation (with backoff).
From the docs:
When an invocation fails, Lambda attempts to retry the invocation while implementing a backoff strategy. The backoff strategy differs slightly depending on whether Lambda encountered the failure due to an error in your function code, or due to throttling.
If your function code caused the error, Lambda will stop processing and retrying the invocation. In the meantime, Lambda gradually backs off, reducing the amount of concurrency allocated to your Amazon SQS event source mapping. After your queue's visibility timeout runs out, the message will again reappear in the queue.
If the invocation fails due to throttling, Lambda gradually backs off retries by reducing the amount of concurrency allocated to your Amazon SQS event source mapping. Lambda continues to retry the message until the message's timestamp exceeds your queue's visibility timeout, at which point Lambda drops the message.
We don't currently handle throttling and AFAIK our internal SQS implementation does not either. We probably should though, since it seems we could start leverage boto's adaptive
rate limiting functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, +1 on no retries here 👍
+1 for extra error handling -> backlog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding this config to only the ESM worker when the client gets created. We can see how it performs and perhaps consider setting some of these values by default in the get_internal_client
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I removed setting the connection_timeout
to allow it to default to 60s 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch and kudos for also considering the Lambda debug mode 💯
Let's get some customer feedback here and consider the lessons learned in Pipes as well.
Motivation
We need to handle the case where a synchronous Lambda invocation runs for an extended period of time -- ensuring the TCP socket remains open and alive.
See boto/boto3#2424
Changes
connect_timeout
,read_timeout
,tcp_keepalive
, andretries
in the client for the ESM lambda sendertcp_keepalive=True
which ensures the TCP connection does not drop prior to timeout.