Skip to content

Conversation

tiurin
Copy link
Contributor

@tiurin tiurin commented Nov 11, 2024

Motivation

Add basic support for composite alarms evaluation in Cloudwatch. Currently composite alarms can be saved but there is no evaluation for them, therefore there is no way to get them triggered.

Changes

Composite alarm rules will be evaluated now whenever a metric alarm state is changed.

This PR adds limited support for alarm rules:

  • Alarms can be referred to by ARNs only (no alarm names support yet)
  • It is assumed that only OR logic is applied for all child alarms
  • Suppression logic for child alarms is not implemented
  • Composite alarm actions only support a message to an SNS topic
  • Only metric alarms can be child alarms, nesting of a composite ones is not supported yet

Example of supported rule: ALARM("arn:<partition>:cloudwatch:<region>:111111111111:alarm:<simple-alarm-1-name>") OR ALARM("arn:<partition>:cloudwatch:<region>:111111111111:alarm:<simple-alarm-2-name>")

Testing

Snapshot tests have been recorded for several test cases to verify that alarm rule logic applies correctly.

Copy link
Contributor

@localstack-bot localstack-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to LocalStack! Thanks for raising your first Pull Request and landing in your contributions. Our team will reach out with any reviews or feedbacks that we have shortly. We recommend joining our Slack Community and share your PR on the #community channel to share your contributions with us. Please make sure you are following our contributing guidelines and our Code of Conduct.

@localstack-bot
Copy link
Contributor

localstack-bot commented Nov 11, 2024

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@tiurin
Copy link
Contributor Author

tiurin commented Nov 11, 2024

I have read the CLA Document and I hereby sign the CLA

localstack-bot added a commit that referenced this pull request Nov 11, 2024
@tiurin
Copy link
Contributor Author

tiurin commented Nov 11, 2024

Haven't added PR labels as I seem to not have permissions.

new_state_value = StateValue.OK
alarm_rule = composite_alarm.alarm["AlarmRule"]
# assuming that a rule consists only of ALARM evaluations of metric alarms, with OR logic applied
metric_alarm_arns = re.findall(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alarm rule parsing and further evaluation definitely deserves more checks, like check if parsed metric alarm ARM exists in the store and actually check for OR presence. However, I wanted to first get overall feedback on PR and approach and then work on these details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are some things that can break in real usage if unchecked, but I'd like to fix them after a general feedback on approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added several input validations in put_composite_alarm, namely:

  • if each metric alarm condition starts with ALARM
  • if there is a metric alarm in the store that corresponds to a referenced one

Likely some edge cases are still left out. For a more robust and extensible parsing likely a different approach needs to be taken, probably with recursive descent parsing.

@simonrw simonrw added aws:cloudwatch Amazon CloudWatch semver: patch Non-breaking changes which can be included in patch releases labels Nov 12, 2024
Copy link
Contributor

@simonrw simonrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions about the correct behaviour, but this is looking great!

Comment on lines -27 to +28
current_time = datetime.datetime.utcnow()
current_time = datetime.datetime.now(timezone.utc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for correcting this!

Comment on lines 1072 to 1080
aws_client.cloudwatch.put_composite_alarm(
AlarmName=composite_alarm_name,
AlarmDescription=composite_alarm_description,
AlarmRule=composite_alarm_rule,
OKActions=[topic_arn_ok],
AlarmActions=[topic_arn_alarm],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are making a change to this method, it might be nice to try snapshotting the response for extra parity coverage

Suggested change
aws_client.cloudwatch.put_composite_alarm(
AlarmName=composite_alarm_name,
AlarmDescription=composite_alarm_description,
AlarmRule=composite_alarm_rule,
OKActions=[topic_arn_ok],
AlarmActions=[topic_arn_alarm],
)
put_composite_alarm_response = aws_client.cloudwatch.put_composite_alarm(
AlarmName=composite_alarm_name,
AlarmDescription=composite_alarm_description,
AlarmRule=composite_alarm_rule,
OKActions=[topic_arn_ok],
AlarmActions=[topic_arn_alarm],
)
snapshot.match("put-composite-alarm", put_composite_alarm_response)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Makes sense! Adding it in a separate commit, together with the updated snapshot file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1197 to 1240
# It is not possible to reuse the snapshot within the same test,
# therefore checking that alarm reason hasn't changed instead of recording new snapshot
state_reason_data = json.loads(
composite_alarm_not_changed_by_second_trigger["StateReasonData"]
)
triggering_alarms = state_reason_data["triggeringAlarms"]
assert len(triggering_alarms) == 1
assert triggering_alarms[0]["arn"] == alarm_2_arn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why do you think it's a problem to record new state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that because composite alarm is unaffected it should be compared against previous snapshot, but now I can't remember why I thought it is a problem to record a new one. 🙂 🤷 I think I have a better understanding of snapshot testing by now and I changed this check to a new recorded snapshot comparison.

@@ -889,7 +970,7 @@ def create_message_response_update_state_lambda(
return json.dumps(response, cls=JSONEncoder)


def create_message_response_update_state_sns(alarm, old_state):
def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, typing is great!

It looks like these states are StateValue instances

Suggested change
def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state):
def create_message_response_update_state_sns(alarm: LocalStackMetricAlarm, old_state: StateValue):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will add it here as per suggestion and to a new create_message_response_update_composite_alarm_state_sns one as well. 👍

Comment on lines 463 to 468
raise ValidationError(
f"Error in alarm rule condition '{alarm}': Only ALARM expression is supported"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should perhaps LOG.warning and ignore the rule rather than raising an error on unsupported actions. This becomes a known lack of parity rather than a user error.

... except if this is actually a parity improvement with AWS, in which case we should have a snapshot to cover it. I think this would be better as an additional test to capture any validation errors thrown by CloudWatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did take inspiration from AWS response for put-composite-alarm when the alarm rule grammar has unsupported tokens.
For example, if I run the following command against AWS

aws cloudwatch put-composite-alarm --alarm-name "composite-alarm" --alarm-rule "ALARRRRM(metric-alarm)"

then the response is:

An error occurred (ValidationError) when calling the PutCompositeAlarm operation: Error in AlarmRule [Unsupported token 'ALARRRRM' at char 0]

I'm not sure if to qualify the validation I added as a parity improvement though. My motivation to add this validation was to make it very explicit to a Localstack user that as of now no expressions except ALARM one are supported. This way when they try to add a composite alarm that has expressions that are not supported yet by Localstack they will get a message and error response quite similar to what AWS provides. If there is no such validation a user could add a composite alarm with what is supported by AWS, let's say, OK(metric-alarm) in the rule. This rule would be saved and later be evaluated not as user expects, because in the current implementation every metric alarm is checked for ALARM state only.

So what I added is a similar behaviour, not full parity.

Because of that, I didn't see how I can do a shapshot, as AWS would happily accept the ALARM token.

I think when a full alarm rule grammar is implemented many things will become simpler and cleaner, validation included. I just didn't venture into implementing full grammar parser in the first iteration, which lead to having to make several workarounds.

After my explanation, do you think it is worth to keep this validation? I'm absolutely good to change to LOG.warning as you suggest if that's the go-to approach in Localstack for lack of parity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a line needs to be drawn for this first implementation. My take is: I would prefer to not fail on valid input if it is not supported by LocalStack (yet). I think it is better to not break existing infrastructure (e.g. if a user tries to deploy their app to LocalStack, we should not block them just because we don't support a feature).

The only other option we have is to raise NotImplementedError which correctly conveys the message, however we treat that exception specially to return a response similar to "The API action '...' is not supported" which is not quite what we want. The action is supported, just that this specific input is not.

Can we maybe switch to the LOG statement in this instance, and as you say: a proper rule parser will likely be a future step and we can properly validate the input at that stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to not break existing infrastructure (e.g. if a user tries to deploy their app to LocalStack, we should not block them just because we don't support a feature).

Great point, now what you suggest makes a lot of sense to me! Also it made me think whether a partial rule implementation does more good than harm. 😄 I mean situations like when someone has a rule with AND in their infra and Localstack will treat it as if it was OR. Might be a rare case, yet it is probably better to not delay full alarm expression grammar support by a long time after (if) this gets merged.

Anyway, will now change raising error to logging as you suggest, thanks for explaining the details about parity concept!

Copy link
Member

@pinzon pinzon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good PR and now we're closer to provide full functionality to CW logs. Please resolve the comments from @simonrw they're very insightful.

Also another test about composite alarms is failing:
FAILED tests/aws/services/cloudwatch/test_cloudwatch.py::TestCloudwatch::test_put_composite_alarm_describe_alarms - botocore.exceptions.ClientError: An error occurred (ValidationError) when calling the PutCompositeAlarm operation: Could not save the composite alarm as alarms [a-ae401a82] in the alarm rule do not exist

Comment on lines 997 to 1038
topic_name_alarm = f"topic-alarm-{short_uid()}"
topic_name_ok = f"topic-ok-{short_uid()}"

sns_topic_alarm = sns_create_topic(Name=topic_name_alarm)
topic_arn_alarm = sns_topic_alarm["TopicArn"]
sns_topic_ok = sns_create_topic(Name=topic_name_ok)
topic_arn_ok = sns_topic_ok["TopicArn"]

# create queues for 'ALARM' and 'OK' of the composite alarm (will receive sns messages)
queue_url_alarm = sqs_create_queue(QueueName=f"AlarmQueue-{short_uid()}")
queue_url_ok = sqs_create_queue(QueueName=f"OKQueue-{short_uid()}")

arn_queue_alarm = aws_client.sqs.get_queue_attributes(
QueueUrl=queue_url_alarm, AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]
arn_queue_ok = aws_client.sqs.get_queue_attributes(
QueueUrl=queue_url_ok, AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]
aws_client.sqs.set_queue_attributes(
QueueUrl=queue_url_alarm,
Attributes={"Policy": get_sqs_policy(arn_queue_alarm, topic_arn_alarm)},
)
aws_client.sqs.set_queue_attributes(
QueueUrl=queue_url_ok, Attributes={"Policy": get_sqs_policy(arn_queue_ok, topic_arn_ok)}
)

# subscribe to SQS
subscription_alarm = aws_client.sns.subscribe(
TopicArn=topic_arn_alarm, Protocol="sqs", Endpoint=arn_queue_alarm
)
cleanups.append(
lambda: aws_client.sns.unsubscribe(
SubscriptionArn=subscription_alarm["SubscriptionArn"]
)
)
subscription_ok = aws_client.sns.subscribe(
TopicArn=topic_arn_ok, Protocol="sqs", Endpoint=arn_queue_ok
)
cleanups.append(
lambda: aws_client.sns.unsubscribe(SubscriptionArn=subscription_ok["SubscriptionArn"])
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SNS to SQS is the default pattern we use to test alarms, maybe we could convert it into a fixture for the next iteration 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, added a TODO comment for now but definitely will address either in this or follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a test by changing alarm name to alarm ARN in the rule, according to scope, but probably it is better to just add support for alarm-names, especially after what @simonrw explained about parity here. Using alarm names in rules is more human-readable and is likely used a lot. Will look into adding support for names tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually reverted to the original test and moved validation from put alarm method to alarm evaluation, skipping and logging unsupported rules. This is the least intrusive approach which won't harm users who might have valid composite alarms in their infra - they should be able to deploy on Localstack regardless on whether Localstack can evaluate rules in their alarms or not.

Copy link
Contributor

@simonrw simonrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping this PR up to date. I will ask @pinzon to do the final approval since he has a "request changes" review blocking the merge but otherwise LGTM!

@@ -2713,6 +2928,34 @@ def _sqs_messages_snapshot(expected_state, sqs_client, sqs_queue, snapshot, iden
snapshot.match(f"{identifier}-sqs-msg", found_msg)


def check_composite_alarm_message(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable compromise, thanks!

@pinzon pinzon added semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases and removed semver: patch Non-breaking changes which can be included in patch releases labels Nov 26, 2024
@localstack-bot
Copy link
Contributor

Currently, only patch changes are allowed on master. Your PR labels (aws:cloudwatch, semver: minor) indicate that it cannot be merged into the master at this time.

@pinzon
Copy link
Member

pinzon commented Nov 26, 2024

Hi @tiurin. Thank you for addressing the previous comments. I'm pleased to approve your pull request ✔️.

I noticed that the PR was labeled as a PATCH, but it should be classified as a MINOR since it introduces new functionality to the CloudWatch service. Given our current code freeze, we will merge this PR next week.

@tiurin
Copy link
Contributor Author

tiurin commented Nov 26, 2024

Thanks @pinzon and @simonrw, my pleasure to contribute! Yeah, good point on MINOR rather than PATCH. I will rebase my branch one more time later this week to make sure it builds well with all the latest changes to master.

tiurin and others added 27 commits December 3, 2024 14:19
Improves readability and fixes a bug with returning from all alarms evaluation if state of one of them hasn't changed - function complexity needs to be reduced as there are too many break and return statements.
Improves readability and reduces return statements from the original method.
Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>
It relied on the alarm name reference in composite alarm rule which is not supported yet. Changing it to ARN reference. Alarm names support should be added as the next step though, it is not too difficult and is very natural to use them in alarm rule description.
Co-authored-by: Simon Walker <s.r.walker101@googlemail.com>
As discussed in the (PR)[#11828 (comment)], partial behaviour should not hurt parity. In this case, because evaluation by name is not yet implemented, it is better to let user save an alarm that is not fully supported by Localstack but skip this alarm evaluation.
@simonrw
Copy link
Contributor

simonrw commented Dec 3, 2024

FYI I re-pushed to trigger a reassessment of the PR label enforcer job

@simonrw simonrw merged commit 206be1c into localstack:master Dec 10, 2024
29 checks passed
@tiurin tiurin deleted the add-basic-composite-alarm-support branch December 10, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws:cloudwatch Amazon CloudWatch semver: minor Non-breaking changes which can be included in minor releases, but not in patch releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants