Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

Closed
dogsbody-josh opened this issue Aug 29, 2024 · 1 comment · Fixed by #19586
Closed
Labels
feature request New features needs triage Issues which need to be manually labelled

Comments

@dogsbody-josh
Copy link

Problem

The problem to be solved is alerting to the loss of streaming metrics from a child node to a parent node. This request deals specifically for the use case of a child that stops streaming to a parent.

We will use the term 'Deadman Alerts' throughout this request to describe the detection of a loss of streaming from a child to a parent node. This is to distinguish the functionality from the existing Reachability Alerts Netdata Cloud provides.

On that topic, Netdata Cloud has Reachability Alerts but they are confined to the Cloud option. At the moment they are largely un-configurable and no specific options exists to track/alert on Parent/Child communication loss. Issue 1039 exists to address that but doesn't go as far as this Feature Request in bringing the functionality into the Netdata Agent.

The problems with cloud based Reachability Alerts are:

  1. Excessive alerts during upgrades and from ephemeral nodes or those that are turned off/on on schedule.
  2. No non-cloud based 'in-built' option to detect the loss of streaming from a child to a parent.
  3. No ability to detect partial loss of metric data, for instance if a service like Apache stops sending metrics but the remaining metrics continue to be sent.
  4. The lack of configurability of Cloud Reachability alerts is also a blocker to getting effective notifications of potential streaming problems to reach the right teams in the right way.

Netdata agent has the ability to extensively configure notification options, roles, and recipients across a wide range of notification channels. No such ability exists with Cloud Reachability alerts. Leveraging the existing notification system by integrating it with this new 'Deadman Alert' type functionality would provide extensive notification options and ensure teams are alerted to problems right away.

Finally, because Reachability Alerts are either on or off for all nodes in a room some ephemeral nodes have to get organised differently just to accommodate this lack of functionality. Integrating Deadman Alerts into the agent itself and giving the parent control over alerting for a loss of streaming would remove this limitation.

Overall, the problem being solved here is multi-faceted:

  1. Provide 'Deadman Alerts' for child that stops streaming all or partial data to a parent
  2. Do so without requiring cloud configuration
  3. Provide greater configurability to 'Deadman Alert' notifications using the current notification system

Description

These 'in-built' non-cloud Deadman Alerts need the following functionality:

  1. Configurable per individual node via the Parent. All the points below should be configurable in this way too.
  2. Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering (to account for patching/Netdata Agent restarts) and the ability to specify all or partial streaming data loss.
  3. Control over the content of the notification, including subject and body via the existing notification system - Agent Dispatched Notifications (Roles and alternative notification methods).
  4. Ability to silence alerts per node/or a custom/selectable set of nodes.
  5. Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately without causing in appropriate alerts.

Configuration of this functionality should be centrally manageable by the 'parent' node for the setup. This is particularly important for Business customers like MSP's where a single main parent will collect streaming data from perhaps hundreds of child nodes.

In these cases we shouldn't individually have to configure Deadman Alerts per child, they should be managed in one central location (the parent). Point 4 above would be used to control where Reachability Alerts are delivered.

Importance

must have

Value proposition

Deadman Alerts are a critical component of a monitoring solution and deserve first class status. Knowing that a child node is no longer reporting to the monitoring solution is as important as the monitoring itself. A child node that is not reporting in has:

  1. The potential to lose metrics
  2. To not trigger health alerts, especially when the parent is responsible for those health alerts.

If health alerts aren't triggered then the monitoring becomes effectively useless.

Because of this we believe it is absolutely essential to immediately detect if a non-ephemeral node is no longer reachable so that relevant teams can immediately investigate and rectify the issue. This is even more important in parent/child setups (the focus of this Feature Request) where the parent is responsible for health alerts.

Our aim is to avoid a situation where a node silently loses monitoring/alerting but be able to configure the parameters and notification options for receiving such Deadman Alerts and without having to use a limited Cloud Reachability function to do so.

Proposed implementation

Netdata already detects the loss of communication from a child, or at least the reconnection of streaming. We see such output in the logs on the parent:

Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: thread created with task id 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM child.monitoring.com [localhost]:38164: receive thread started
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: [child.monitoring.com]: Postponing health checks for 60 seconds, because it was just connected.
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM child.monitoring.com [receive from [localhost]:38164]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS FUNCTIONS REPLICATION BINARY INTERPOLATED IEEE754 ML DYNCFG SLOTS PROGRESS 
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM_RECEIVER for 'child.monitoring.com': connected and ready to receive data 
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: Queuing status update for node=REDACTED, live=1, hops=1, queryable=1
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'streaming' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'streaming' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'netdata-api-calls' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'netdata-api-calls' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'mount-points' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'mount-points' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'containers-vms' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'containers-vms' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'systemd-services' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'systemd-services' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-interfaces' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-interfaces' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-connections' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-connections' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-sockets-tracing' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-sockets-tracing' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'processes' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'processes' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'config' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'config' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: RRDCONTEXT: received checkpoint command for claim id 'REDACTED', node id 'REDACTED', while node 'child.monitoring.com' has an active context streaming.
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: RRDCONTEXT: host 'child.monitoring.com' enabling streaming of contexts

We would propose that the Netdata Agent on the parent detects the loss of streaming metrics from a child in some appropriate manner and that would then be used to trigger the Deadman Alert in the configured way. I believe that if enabling:

​[plugins]                      
        netdata monitoring extended = yes

...then it's possible to view the condition of streaming, so it's possible there is existing scaffolding in place to build this feature.

If this Feature is implemented, #17539 could also be integrated.

@dogsbody-josh dogsbody-josh added feature request New features needs triage Issues which need to be manually labelled labels Aug 29, 2024
@ktsaou ktsaou mentioned this issue Feb 6, 2025
11 tasks
@ktsaou
Copy link
Member

ktsaou commented Feb 6, 2025

Hey @dogsbody-josh I am working on this (better late than never - sorry for that).
Follow my progress on #19586 ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New features needs triage Issues which need to be manually labelled
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants