Cloudwatch: add basic composite alarm support #11828

tiurin · 2024-11-11T11:39:02Z

Motivation

Add basic support for composite alarms evaluation in Cloudwatch. Currently composite alarms can be saved but there is no evaluation for them, therefore there is no way to get them triggered.

Changes

Composite alarm rules will be evaluated now whenever a metric alarm state is changed.

This PR adds limited support for alarm rules:

Alarms can be referred to by ARNs only (no alarm names support yet)
It is assumed that only OR logic is applied for all child alarms
Suppression logic for child alarms is not implemented
Composite alarm actions only support a message to an SNS topic
Only metric alarms can be child alarms, nesting of a composite ones is not supported yet

Example of supported rule: ALARM("arn:<partition>:cloudwatch:<region>:111111111111:alarm:<simple-alarm-1-name>") OR ALARM("arn:<partition>:cloudwatch:<region>:111111111111:alarm:<simple-alarm-2-name>")

Testing

Snapshot tests have been recorded for several test cases to verify that alarm rule logic applies correctly.

localstack-bot

Welcome to LocalStack! Thanks for raising your first Pull Request and landing in your contributions. Our team will reach out with any reviews or feedbacks that we have shortly. We recommend joining our Slack Community and share your PR on the #community channel to share your contributions with us. Please make sure you are following our contributing guidelines and our Code of Conduct.

localstack-bot · 2024-11-11T11:39:16Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

tiurin · 2024-11-11T11:41:18Z

I have read the CLA Document and I hereby sign the CLA

localstack-core/localstack/services/cloudwatch/provider_v2.py

tiurin · 2024-11-11T11:55:51Z

Haven't added PR labels as I seem to not have permissions.

localstack-core/localstack/services/cloudwatch/provider_v2.py

tiurin · 2024-11-11T13:45:47Z

localstack-core/localstack/services/cloudwatch/provider_v2.py

+            new_state_value = StateValue.OK
+            alarm_rule = composite_alarm.alarm["AlarmRule"]
+            # assuming that a rule consists only of ALARM evaluations of metric alarms, with OR logic applied
+            metric_alarm_arns = re.findall(


Alarm rule parsing and further evaluation definitely deserves more checks, like check if parsed metric alarm ARM exists in the store and actually check for OR presence. However, I wanted to first get overall feedback on PR and approach and then work on these details.

So there are some things that can break in real usage if unchecked, but I'd like to fix them after a general feedback on approach.

Added several input validations in put_composite_alarm, namely:

if each metric alarm condition starts with ALARM

if there is a metric alarm in the store that corresponds to a referenced one

Likely some edge cases are still left out. For a more robust and extensible parsing likely a different approach needs to be taken, probably with recursive descent parsing.

localstack-core/localstack/services/cloudwatch/provider_v2.py

simonrw

I have a couple of questions about the correct behaviour, but this is looking great!

simonrw · 2024-11-12T12:13:03Z

localstack-core/localstack/services/cloudwatch/models.py

-        current_time = datetime.datetime.utcnow()
+        current_time = datetime.datetime.now(timezone.utc)


Thank you for correcting this!

localstack-core/localstack/services/cloudwatch/provider_v2.py

simonrw · 2024-11-12T16:19:23Z

tests/aws/services/cloudwatch/test_cloudwatch.py

+        aws_client.cloudwatch.put_composite_alarm(
+            AlarmName=composite_alarm_name,
+            AlarmDescription=composite_alarm_description,
+            AlarmRule=composite_alarm_rule,
+            OKActions=[topic_arn_ok],
+            AlarmActions=[topic_arn_alarm],
+        )


Since you are making a change to this method, it might be nice to try snapshotting the response for extra parity coverage

Suggested change

aws_client.cloudwatch.put_composite_alarm(

AlarmName=composite_alarm_name,

AlarmDescription=composite_alarm_description,

AlarmRule=composite_alarm_rule,

OKActions=[topic_arn_ok],

AlarmActions=[topic_arn_alarm],

)

put_composite_alarm_response = aws_client.cloudwatch.put_composite_alarm(

AlarmName=composite_alarm_name,

AlarmDescription=composite_alarm_description,

AlarmRule=composite_alarm_rule,

OKActions=[topic_arn_ok],

AlarmActions=[topic_arn_alarm],

)

snapshot.match("put-composite-alarm", put_composite_alarm_response)

👍 Makes sense! Adding it in a separate commit, together with the updated snapshot file.

tests/aws/services/cloudwatch/test_cloudwatch.py

simonrw · 2024-11-12T16:26:42Z

tests/aws/services/cloudwatch/test_cloudwatch.py

+        # It is not possible to reuse the snapshot within the same test,
+        # therefore checking that alarm reason hasn't changed instead of recording new snapshot
+        state_reason_data = json.loads(
+            composite_alarm_not_changed_by_second_trigger["StateReasonData"]
+        )
+        triggering_alarms = state_reason_data["triggeringAlarms"]
+        assert len(triggering_alarms) == 1
+        assert triggering_alarms[0]["arn"] == alarm_2_arn


q: why do you think it's a problem to record new state?

I thought that because composite alarm is unaffected it should be compared against previous snapshot, but now I can't remember why I thought it is a problem to record a new one. 🙂 🤷 I think I have a better understanding of snapshot testing by now and I changed this check to a new recorded snapshot comparison.

tests/aws/services/cloudwatch/test_cloudwatch.py

localstack-core/localstack/services/cloudwatch/provider_v2.py

simonrw · 2024-11-12T16:41:55Z

localstack-core/localstack/services/cloudwatch/provider_v2.py

@@ -889,7 +970,7 @@ def create_message_response_update_state_lambda(
    return json.dumps(response, cls=JSONEncoder)


-def create_message_response_update_state_sns(alarm, old_state):
+def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state):


Good stuff, typing is great!

It looks like these states are StateValue instances

Suggested change

def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state):

def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state: StateValue):

Good catch, will add it here as per suggestion and to a new create_message_response_update_composite_alarm_state_sns one as well. 👍

simonrw · 2024-11-12T16:45:58Z

localstack-core/localstack/services/cloudwatch/provider_v2.py

+                    raise ValidationError(
+                        f"Error in alarm rule condition '{alarm}': Only ALARM expression is supported"
+                    )


I think we should perhaps LOG.warning and ignore the rule rather than raising an error on unsupported actions. This becomes a known lack of parity rather than a user error.

... except if this is actually a parity improvement with AWS, in which case we should have a snapshot to cover it. I think this would be better as an additional test to capture any validation errors thrown by CloudWatch.

I did take inspiration from AWS response for put-composite-alarm when the alarm rule grammar has unsupported tokens.
For example, if I run the following command against AWS

aws cloudwatch put-composite-alarm --alarm-name "composite-alarm" --alarm-rule "ALARRRRM(metric-alarm)"

then the response is:

An error occurred (ValidationError) when calling the PutCompositeAlarm operation: Error in AlarmRule [Unsupported token 'ALARRRRM' at char 0]

I'm not sure if to qualify the validation I added as a parity improvement though. My motivation to add this validation was to make it very explicit to a Localstack user that as of now no expressions except ALARM one are supported. This way when they try to add a composite alarm that has expressions that are not supported yet by Localstack they will get a message and error response quite similar to what AWS provides. If there is no such validation a user could add a composite alarm with what is supported by AWS, let's say, OK(metric-alarm) in the rule. This rule would be saved and later be evaluated not as user expects, because in the current implementation every metric alarm is checked for ALARM state only.

So what I added is a similar behaviour, not full parity.

Because of that, I didn't see how I can do a shapshot, as AWS would happily accept the ALARM token.

I think when a full alarm rule grammar is implemented many things will become simpler and cleaner, validation included. I just didn't venture into implementing full grammar parser in the first iteration, which lead to having to make several workarounds.

After my explanation, do you think it is worth to keep this validation? I'm absolutely good to change to LOG.warning as you suggest if that's the go-to approach in Localstack for lack of parity.

I agree a line needs to be drawn for this first implementation. My take is: I would prefer to not fail on valid input if it is not supported by LocalStack (yet). I think it is better to not break existing infrastructure (e.g. if a user tries to deploy their app to LocalStack, we should not block them just because we don't support a feature).

The only other option we have is to raise NotImplementedError which correctly conveys the message, however we treat that exception specially to return a response similar to "The API action '...' is not supported" which is not quite what we want. The action is supported, just that this specific input is not.

Can we maybe switch to the LOG statement in this instance, and as you say: a proper rule parser will likely be a future step and we can properly validate the input at that stage.

I think it is better to not break existing infrastructure (e.g. if a user tries to deploy their app to LocalStack, we should not block them just because we don't support a feature).

Great point, now what you suggest makes a lot of sense to me! Also it made me think whether a partial rule implementation does more good than harm. 😄 I mean situations like when someone has a rule with AND in their infra and Localstack will treat it as if it was OR. Might be a rare case, yet it is probably better to not delay full alarm expression grammar support by a long time after (if) this gets merged.

Anyway, will now change raising error to logging as you suggest, thanks for explaining the details about parity concept!

pinzon

This is a really good PR and now we're closer to provide full functionality to CW logs. Please resolve the comments from @simonrw they're very insightful.

Also another test about composite alarms is failing:
FAILED tests/aws/services/cloudwatch/test_cloudwatch.py::TestCloudwatch::test_put_composite_alarm_describe_alarms - botocore.exceptions.ClientError: An error occurred (ValidationError) when calling the PutCompositeAlarm operation: Could not save the composite alarm as alarms [a-ae401a82] in the alarm rule do not exist

pinzon · 2024-11-12T16:22:06Z

tests/aws/services/cloudwatch/test_cloudwatch.py

+        topic_name_alarm = f"topic-alarm-{short_uid()}"
+        topic_name_ok = f"topic-ok-{short_uid()}"
+
+        sns_topic_alarm = sns_create_topic(Name=topic_name_alarm)
+        topic_arn_alarm = sns_topic_alarm["TopicArn"]
+        sns_topic_ok = sns_create_topic(Name=topic_name_ok)
+        topic_arn_ok = sns_topic_ok["TopicArn"]
+
+        # create queues for 'ALARM' and 'OK' of the composite alarm (will receive sns messages)
+        queue_url_alarm = sqs_create_queue(QueueName=f"AlarmQueue-{short_uid()}")
+        queue_url_ok = sqs_create_queue(QueueName=f"OKQueue-{short_uid()}")
+
+        arn_queue_alarm = aws_client.sqs.get_queue_attributes(
+            QueueUrl=queue_url_alarm, AttributeNames=["QueueArn"]
+        )["Attributes"]["QueueArn"]
+        arn_queue_ok = aws_client.sqs.get_queue_attributes(
+            QueueUrl=queue_url_ok, AttributeNames=["QueueArn"]
+        )["Attributes"]["QueueArn"]
+        aws_client.sqs.set_queue_attributes(
+            QueueUrl=queue_url_alarm,
+            Attributes={"Policy": get_sqs_policy(arn_queue_alarm, topic_arn_alarm)},
+        )
+        aws_client.sqs.set_queue_attributes(
+            QueueUrl=queue_url_ok, Attributes={"Policy": get_sqs_policy(arn_queue_ok, topic_arn_ok)}
+        )
+
+        # subscribe to SQS
+        subscription_alarm = aws_client.sns.subscribe(
+            TopicArn=topic_arn_alarm, Protocol="sqs", Endpoint=arn_queue_alarm
+        )
+        cleanups.append(
+            lambda: aws_client.sns.unsubscribe(
+                SubscriptionArn=subscription_alarm["SubscriptionArn"]
+            )
+        )
+        subscription_ok = aws_client.sns.subscribe(
+            TopicArn=topic_arn_ok, Protocol="sqs", Endpoint=arn_queue_ok
+        )
+        cleanups.append(
+            lambda: aws_client.sns.unsubscribe(SubscriptionArn=subscription_ok["SubscriptionArn"])
+        )


SNS to SQS is the default pattern we use to test alarms, maybe we could convert it into a fixture for the next iteration 👍

Great suggestion, added a TODO comment for now but definitely will address either in this or follow-up PR.

Fixed a test by changing alarm name to alarm ARN in the rule, according to scope, but probably it is better to just add support for alarm-names, especially after what @simonrw explained about parity here. Using alarm names in rules is more human-readable and is likely used a lot. Will look into adding support for names tomorrow.

Eventually reverted to the original test and moved validation from put alarm method to alarm evaluation, skipping and logging unsupported rules. This is the least intrusive approach which won't harm users who might have valid composite alarms in their infra - they should be able to deploy on Localstack regardless on whether Localstack can evaluate rules in their alarms or not.

simonrw

Thanks for keeping this PR up to date. I will ask @pinzon to do the final approval since he has a "request changes" review blocking the merge but otherwise LGTM!

simonrw · 2024-11-26T14:46:03Z

tests/aws/services/cloudwatch/test_cloudwatch.py

@@ -2713,6 +2928,34 @@ def _sqs_messages_snapshot(expected_state, sqs_client, sqs_queue, snapshot, iden
    snapshot.match(f"{identifier}-sqs-msg", found_msg)


+def check_composite_alarm_message(


I think this is a reasonable compromise, thanks!

localstack-bot · 2024-11-26T15:11:07Z

Currently, only patch changes are allowed on master. Your PR labels (aws:cloudwatch, semver: minor) indicate that it cannot be merged into the master at this time.

pinzon · 2024-11-26T15:20:43Z

Hi @tiurin. Thank you for addressing the previous comments. I'm pleased to approve your pull request ✔️.

I noticed that the PR was labeled as a PATCH, but it should be classified as a MINOR since it introduces new functionality to the CloudWatch service. Given our current code freeze, we will merge this PR next week.

tiurin · 2024-11-26T20:00:47Z

Thanks @pinzon and @simonrw, my pleasure to contribute! Yeah, good point on MINOR rather than PATCH. I will rebase my branch one more time later this week to make sure it builds well with all the latest changes to master.

WIP, didn't run yet against AWS so no snapshot yet.

Focus on making test pass in AWS mode.

Improves readability and fixes a bug with returning from all alarms evaluation if state of one of them hasn't changed - function complexity needs to be reduced as there are too many break and return statements.

Improves readability and reduces return statements from the original method.

Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>

…s signature

It relied on the alarm name reference in composite alarm rule which is not supported yet. Changing it to ARN reference. Alarm names support should be added as the next step though, it is not too difficult and is very natural to use them in alarm rule description.

Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>

#11828 (comment)

Avoid repetitive code.

As discussed in the (PR)[#11828 (comment)], partial behaviour should not hurt parity. In this case, because evaluation by name is not yet implemented, it is better to let user save an alarm that is not fully supported by Localstack but skip this alarm evaluation.

This reverts commit f77b418.

simonrw · 2024-12-03T14:20:17Z

FYI I re-pushed to trigger a reassessment of the PR label enforcer job

tiurin requested review from pinzon and steffyP as code owners November 11, 2024 11:39

localstack-bot reviewed Nov 11, 2024

View reviewed changes

localstack-bot added a commit that referenced this pull request Nov 11, 2024

@tiurin has signed the CLA in #11828

8e163ec

tiurin commented Nov 11, 2024

View reviewed changes

localstack-core/localstack/services/cloudwatch/provider_v2.py Show resolved Hide resolved

tiurin commented Nov 11, 2024

View reviewed changes

localstack-core/localstack/services/cloudwatch/provider_v2.py Show resolved Hide resolved

tiurin commented Nov 11, 2024

View reviewed changes

localstack-core/localstack/services/cloudwatch/provider_v2.py Show resolved Hide resolved

tiurin commented Nov 11, 2024

View reviewed changes

localstack-core/localstack/services/cloudwatch/provider_v2.py Show resolved Hide resolved

tiurin commented Nov 11, 2024

View reviewed changes

tiurin commented Nov 12, 2024

View reviewed changes

localstack-core/localstack/services/cloudwatch/provider_v2.py Show resolved Hide resolved

simonrw added aws:cloudwatch Amazon CloudWatch semver: patch Non-breaking changes which can be included in patch releases labels Nov 12, 2024

simonrw reviewed Nov 12, 2024

View reviewed changes

pinzon requested changes Nov 13, 2024

View reviewed changes

simonrw approved these changes Nov 26, 2024

View reviewed changes

pinzon approved these changes Nov 26, 2024

View reviewed changes

pinzon added semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases and removed semver: patch Non-breaking changes which can be included in patch releases labels Nov 26, 2024

tiurin added 5 commits December 3, 2024 14:19

Add draft test for composite alarm

fed139b

WIP, didn't run yet against AWS so no snapshot yet.

Comment out further test cases for now

11b6a7a

Focus on making test pass in AWS mode.

Inline metric alarm variables

5bbbd9f

Add missing comma

f9937d2

Use unique alarm names

09e83e8

tiurin and others added 27 commits December 3, 2024 14:19

Add composite alarm in ALARM because of alarm-2 test case

e1fb6a2

Add composite alarm not changed by second trigger test case

505d10d

Remove SQS message logging in test

20f97d3

Format code

e9a83cb

Remove implemented TODO

168514b

Extract method for evaluating single composite alarm

b5af73b

Improves readability and fixes a bug with returning from all alarms evaluation if state of one of them hasn't changed - function complexity needs to be reduced as there are too many break and return statements.

Extract method for for running composite alarm actions

b15caba

Improves readability and reduces return statements from the original method.

Apply format and lint

91d09da

Validate if metric alarms from alarm rule exist

c12a188

Validate if each rule operand starts with ALARM

f0afeb0

Add clarifying comment

b4a4ccf

Update limitations in provider class description

51f07f4

Add typing to create_message_response_update_state_sns signature

23b2185

Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>

Add typing to create_message_response_update_composite_alarm_state_sn…

deea700

…s signature

Verify put_composite_alarm response with snapshot

f717ef1

Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>

Add test fixture TODOs

4cf2275

Apply format

525fb73

Add TODO for store lock management improvements

e47fba9

Log a warning when alarm rule has unsupported expressions

6fd4f9d

#11828 (comment)

Ignore alarm rule when it has unsupported expressions

6038241

#11828 (comment)

Add helper inner methods to check alarm action result

8cc0848

Avoid repetitive code.

Add test when 2nd alarm from the rule is back to OK

346cf75

Add snapshot for both alarms from rule are triggered test case

384979c

Add TODO for describe-composite-alarm snapshot

5080f12

Revert "Fix test_put_composite_alarm_describe_alarms"

afe714a

This reverts commit f77b418.

simonrw merged commit 206be1c into localstack:master Dec 10, 2024
29 checks passed

tiurin deleted the add-basic-composite-alarm-support branch December 10, 2024 18:26

		current_time = datetime.datetime.utcnow()
		current_time = datetime.datetime.now(timezone.utc)

	def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state):
	def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state: StateValue):

		@@ -2713,6 +2928,34 @@ def _sqs_messages_snapshot(expected_state, sqs_client, sqs_queue, snapshot, iden
		snapshot.match(f"{identifier}-sqs-msg", found_msg)


		def check_composite_alarm_message(

Uh oh!

Cloudwatch: add basic composite alarm support #11828

Cloudwatch: add basic composite alarm support #11828

Uh oh!

Conversation

tiurin commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Testing

Uh oh!

localstack-bot left a comment

Choose a reason for hiding this comment

Uh oh!

localstack-bot commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiurin commented Nov 11, 2024

Uh oh!

Uh oh!

Uh oh!

tiurin commented Nov 11, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonrw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pinzon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonrw left a comment

Choose a reason for hiding this comment

Uh oh!

tiurin commented Nov 11, 2024 •

edited

Loading

localstack-bot commented Nov 11, 2024 •

edited

Loading

pinzon left a comment •

edited

Loading