-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
[ESM] Add backoff between poll events calls #12304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c1c3cd9
to
451ce15
Compare
LocalStack Community integration with Pro 2 files ± 0 2 suites ±0 1h 29m 59s ⏱️ - 24m 3s Results for commit fc60200. ± Comparison against base commit 6f32581. This pull request removes 994 tests.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, these results look very promising 🤩 Thorough PR documentation 👏
Great and clean changes 💯
Generally, we should remember to trigger the ext Pipes tests upon ESM poller changes (to be on the safe side shortly before release). These changes should be safe for Pipes because the EmptyPollResultsException
inherits from Exception
.
@@ -148,9 +172,10 @@ def poller_loop(self, *args, **kwargs): | |||
e, | |||
exc_info=LOG.isEnabledFor(logging.DEBUG), | |||
) | |||
# TODO: implement some backoff here and stop poller upon continuous errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking, we could parity test when pollers are shut down at AWS after continous errors. However, I think the current solution with the up to 60s max error backoff is good enough (or maybe even preferrable in a dev environment to make the error visible).
@@ -133,13 +140,30 @@ def poller_loop(self, *args, **kwargs): | |||
self.update_esm_state_in_store(EsmState.ENABLED) | |||
self.state_transition_reason = self.user_state_reason | |||
|
|||
error_boff = ExponentialBackoff(initial_interval=2, max_interval=MAX_BACKOFF_POLL_ERROR_SEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: any particular reason why one of the two parameters is a constant and the other not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wanted to make these values configurable via environment variables (see this commit 451ce15)
Didn't think to make the initial interval configurable though 🫠
# Wait some time between retries to avoid running into the problem right again | ||
self._shutdown_event.wait(2) | ||
poll_interval_duration = error_boff.next_backoff() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much nicer than static waits 👏👏👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic improvement! And thanks for exhaustive PR description, made it very easy to review! 👍
Beyond this PR, I'd like to discuss in the future whether such polling results are exception or polling methods deserve a return value, let's discuss outside PR!
localstack-core/localstack/services/lambda_/event_source_mapping/pollers/sqs_poller.py
Show resolved
Hide resolved
That's a very good point @tiurin to keep in mind. @gregfurman brought up this idea as well a while ago proposing a typed return structure, which could enable passing metadata such as errors or trace context. I think it's a good idea and would most likely be needed for efficient tracing. |
Motivation
As a part of performance improvements to ESM, this PR introduces backoff in poll-frequency when an event source either returns no records or raises an exception on data retrieval.
By delaying polling on empty data sources and errors, we expect to reduce load on the LocalStack gateway and make more resources available for successful processing events.
Changes
poll_events
to better facilitate changing the poll-frequency duration to include backoff.EmptyPollResultsException
that is raised when a poll request to an event source returns no data. This is used to signal a poll-miss scenario and allows us to decrease poller frequency for empty resources.empty_boff
that returns backoff (up to 10s) when an event source does not return any data (i.e aEmptyPollResultsException
is raised).error_boff
that returns backoff (up to 60s) when a poll request to an event source raises an exception (other thanEmptyPollResultsException
).Benchmarking
Some benchmarks tests were run to assess the extent to which including backoff would improve gateway processing latency of LocalStack. Note, that each experiment was run twice and aggregated appropriately.
We tested exponential backoff with a multiplication factor
x1.5
(default) per iteration.The below table shows request count, throughput, and latency information about each experiment.
The below table shows the percentiles of request-response latency for each experiment.
The below table shows factor-level changes comparing backoff experiments to the baseline.
Expand for Baseline Performance Chart(s)
Baseline Performance Chart 1/2
Performance measurements over 5m run for baseline:

Baseline Performance Chart 2/2
Performance measurements over 5m run for baseline:

Expand for Performance With Backoff Chart(s)
With Backoff (x1.5) Performance Chart 1/2
Performance measurements over 5m baseline for run with backoff:

With Backoff (x1.5) Performance Chart 2/2
Performance measurements over 5m baseline for run with backoff:

Results & Conclusions:
No. of Requests
Adding backoff on poll-misses saw significant improvements in latency and throughput in our proxy measurement:
RPS
P(50)
P(95)
Main Findings & Conclusions: