Skip to content

[ESM] Add backoff between poll events calls #12304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 26, 2025
Merged

Conversation

gregfurman
Copy link
Contributor

@gregfurman gregfurman commented Feb 25, 2025

Motivation

As a part of performance improvements to ESM, this PR introduces backoff in poll-frequency when an event source either returns no records or raises an exception on data retrieval.

By delaying polling on empty data sources and errors, we expect to reduce load on the LocalStack gateway and make more resources available for successful processing events.

Changes

  • Small refactor of poll_events to better facilitate changing the poll-frequency duration to include backoff.
  • New EmptyPollResultsException that is raised when a poll request to an event source returns no data. This is used to signal a poll-miss scenario and allows us to decrease poller frequency for empty resources.
  • Add an empty_boff that returns backoff (up to 10s) when an event source does not return any data (i.e a EmptyPollResultsException is raised).
  • Add an error_boff that returns backoff (up to 60s) when a poll request to an event source raises an exception (other than EmptyPollResultsException).
  • Note: All backoffs are reset upon a successful poll.

Benchmarking

Some benchmarks tests were run to assess the extent to which including backoff would improve gateway processing latency of LocalStack. Note, that each experiment was run twice and aggregated appropriately.

We tested exponential backoff with a multiplication factor x1.5 (default) per iteration.

The below table shows request count, throughput, and latency information about each experiment.

Experiment # Requests Average (ms) Min (ms) Max (ms) RPS
Baseline 48096 514.275 2 2247 172.56
With Backoff (x1.5) 111858 237.655 1 1553 373.045

The below table shows the percentiles of request-response latency for each experiment.

Experiment P(50) (ms) P(90) (ms) P(95) (ms) P(99) (ms) P(100) (ms)
Baseline 480 830 935 1200 1800
With Backoff (x1.5) 225 415 485 665 1400

The below table shows factor-level changes comparing backoff experiments to the baseline.

  P(50) (ms) RPS
With Backoff (x1.5) -0.531 +1.162

Expand for Baseline Performance Chart(s)

Baseline Performance Chart 1/2

Performance measurements over 5m run for baseline:
image

Baseline Performance Chart 2/2

Performance measurements over 5m run for baseline:
image

Expand for Performance With Backoff Chart(s)

With Backoff (x1.5) Performance Chart 1/2

Performance measurements over 5m baseline for run with backoff:
total_requests_per_second_1740565012 795

With Backoff (x1.5) Performance Chart 2/2

Performance measurements over 5m baseline for run with backoff:
total_requests_per_second_1740565698 292

Results & Conclusions:

No. of Requests

Adding backoff on poll-misses saw significant improvements in latency and throughput in our proxy measurement:

  • x1.5 backoff: ~2.2x improvement compared to baseline

RPS

  • Baseline Avg: 172.56 RPS
  • With Backoff (x1.5) Avg: 373.05 RPS

P(50)

  • Baseline: 480 ms
  • With Backoff (x1.5): 225 ms

P(95)

  • Baseline: 935 ms
  • With Backoff (x1.5): 485 ms

Main Findings & Conclusions:

  • RPS and P(50) is less variable/noisy over time when backoff is added.
  • P(50) + P(95) graphs: The initial performance hit from overlapping requests lessened slightly when compared to baseline -- likely due to backoff on poll-misses reducing the likelihood of request-clustering.
  • While adding backoff on poll-misses showed performance gains, without tracking e2e throughput it is unclear how more aggressive backoff will affect the user experience

@gregfurman gregfurman added area: performance Make LocalStack go rocket-fast semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases aws:pipes Amazon EventBridge Pipes aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) labels Feb 25, 2025
@gregfurman gregfurman added this to the 4.2 milestone Feb 25, 2025
@gregfurman gregfurman self-assigned this Feb 25, 2025
@gregfurman gregfurman force-pushed the add/esm/backoff-on-error branch from c1c3cd9 to 451ce15 Compare February 25, 2025 15:16
Copy link

LocalStack Community integration with Pro

    2 files  ±  0      2 suites  ±0   1h 29m 59s ⏱️ - 24m 3s
3 109 tests  - 994  2 888 ✅  - 882  221 💤  - 112  0 ❌ ±0 
3 111 runs   - 994  2 888 ✅  - 882  223 💤  - 112  0 ❌ ±0 

Results for commit fc60200. ± Comparison against base commit 6f32581.

This pull request removes 994 tests.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

@gregfurman gregfurman requested a review from tiurin February 26, 2025 11:44
@gregfurman gregfurman marked this pull request as ready for review February 26, 2025 11:45
Copy link
Member

@joe4dev joe4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, these results look very promising 🤩 Thorough PR documentation 👏

Great and clean changes 💯

Generally, we should remember to trigger the ext Pipes tests upon ESM poller changes (to be on the safe side shortly before release). These changes should be safe for Pipes because the EmptyPollResultsException inherits from Exception.

@@ -148,9 +172,10 @@ def poller_loop(self, *args, **kwargs):
e,
exc_info=LOG.isEnabledFor(logging.DEBUG),
)
# TODO: implement some backoff here and stop poller upon continuous errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking, we could parity test when pollers are shut down at AWS after continous errors. However, I think the current solution with the up to 60s max error backoff is good enough (or maybe even preferrable in a dev environment to make the error visible).

@@ -133,13 +140,30 @@ def poller_loop(self, *args, **kwargs):
self.update_esm_state_in_store(EsmState.ENABLED)
self.state_transition_reason = self.user_state_reason

error_boff = ExponentialBackoff(initial_interval=2, max_interval=MAX_BACKOFF_POLL_ERROR_SEC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: any particular reason why one of the two parameters is a constant and the other not :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wanted to make these values configurable via environment variables (see this commit 451ce15)

Didn't think to make the initial interval configurable though 🫠

# Wait some time between retries to avoid running into the problem right again
self._shutdown_event.wait(2)
poll_interval_duration = error_boff.next_backoff()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer than static waits 👏👏👏

Copy link
Contributor

@tiurin tiurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic improvement! And thanks for exhaustive PR description, made it very easy to review! 👍

Beyond this PR, I'd like to discuss in the future whether such polling results are exception or polling methods deserve a return value, let's discuss outside PR!

@joe4dev
Copy link
Member

joe4dev commented Feb 26, 2025

Beyond this PR, I'd like to discuss in the future whether such polling results are exception or polling methods deserve a return value, let's discuss outside PR!

That's a very good point @tiurin to keep in mind. @gregfurman brought up this idea as well a while ago proposing a typed return structure, which could enable passing metadata such as errors or trace context. I think it's a good idea and would most likely be needed for efficient tracing.

@tiurin tiurin merged commit ceb6140 into master Feb 26, 2025
41 checks passed
@tiurin tiurin deleted the add/esm/backoff-on-error branch February 26, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance Make LocalStack go rocket-fast aws:lambda:event-source-mapping AWS Lambda Event Source Mapping (ESM) aws:pipes Amazon EventBridge Pipes semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants