Skip to content

fix put-metric-alarm test failure rate #12598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 13, 2025
Merged

Conversation

pinzon
Copy link
Member

@pinzon pinzon commented May 8, 2025

Motivation

As discussed internally, the test_put_metric_alarm test currently has a failure rate of around 14%, which makes it frustrating during full test suite runs. The root cause appears to be that the alarm-triggering value can vary in LocalStack—sometimes evaluating to 21 instead of the expected average 21.5 (which is consistently correct on AWS). This is likely due to timing differences or slight inconsistencies in how LocalStack handles metric ingestion and evaluation windows.

Changes

  • Increase the threshold of the CW Alarm to make the Alarm triggering value consistent between LS and AWS.

Testing

  • updated snapshots

@pinzon pinzon added the semver: patch Non-breaking changes which can be included in patch releases label May 8, 2025
@pinzon pinzon added this to the 4.5 milestone May 8, 2025
Copy link

github-actions bot commented May 8, 2025

LocalStack Community integration with Pro

    2 files  ±    0    2 suites  ±0   43m 16s ⏱️ - 1h 0m 22s
1 039 tests  - 3 386  980 ✅  - 3 062  59 💤  - 324  0 ❌ ±0 
1 041 runs   - 3 386  980 ✅  - 3 062  61 💤  - 324  0 ❌ ±0 

Results for commit 9c95b14. ± Comparison against base commit a7b4250.

This pull request removes 3386 tests.
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_lambda_dynamodb
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_opensearch_crud
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_search_books
tests.aws.scenario.bookstore.test_bookstore.TestBookstoreApplication ‑ test_setup
tests.aws.scenario.kinesis_firehose.test_kinesis_firehose.TestKinesisFirehoseScenario ‑ test_kinesis_firehose_s3
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_destination_sns
tests.aws.scenario.lambda_destination.test_lambda_destination_scenario.TestLambdaDestinationScenario ‑ test_infra
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_prefill_dynamodb_table
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input0-SUCCEEDED]
tests.aws.scenario.loan_broker.test_loan_broker.TestLoanBrokerScenario ‑ test_stepfunctions_input_recipient_list[step_function_input1-SUCCEEDED]
…

♻️ This comment has been updated with latest results.

@pinzon pinzon marked this pull request as ready for review May 8, 2025 18:53
@pinzon pinzon requested a review from steffyP as a code owner May 8, 2025 18:53
@pinzon pinzon marked this pull request as draft May 8, 2025 18:53
@pinzon pinzon marked this pull request as ready for review May 9, 2025 15:37
@@ -403,7 +403,7 @@
"StateUpdatedTimestamp": "timestamp",
"StateValue": "INSUFFICIENT_DATA",
"Statistic": "Average",
"Threshold": 2.0,
"Threshold": 21.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good adjustment of threshold to be equal to one of the data points! 👍 Indeed since metric data is added sequentially in a loop and alarm evaluation happens in a separate thread it is likely that in those 14% of cases evaluation happened after 21.0 was added but before 22.0. Since 21 was already bigger than 2 alarm was triggered. Now it shouldn't be the case - alarm needs both data points to be triggered.

@@ -1338,8 +1338,8 @@ def test_put_metric_alarm(
retry(
_sqs_messages_snapshot,
retries=60,
sleep=3 if is_aws_cloud() else 1,
sleep_before=5 if is_aws_cloud() else 0,
sleep=3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here retries are applied to receiving messages from SQS. Increasing sleep won't change the resulting alarm. Currently alarm has period of 30 seconds, 60 retries wit sleep 1 covers it more than enough. Locally test succeeds exactly after 30 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this made me think that we can reduce period to a minimum value of 10 and therefore shave 20 seconds off test execution!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed changes to the branch - reverted sleeps and reduced period to 10. Test now consistently executes successfully locally in just over 10 seconds. If you're happy with those changes please merge the PR! Otherwise happy to revert if there is any consideration.

@pinzon
Copy link
Member Author

pinzon commented May 13, 2025

The changing of the period is a better change than the edition of retry. Thanks @tiurin. 👍

@pinzon pinzon merged commit 05f6746 into master May 13, 2025
32 checks passed
@pinzon pinzon deleted the cw/fix/test-metric-alarm branch May 13, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: patch Non-breaking changes which can be included in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants